Analysis of a Very Large AltaVista Query Log (1998)
Silverstein, Henzinger, Marais, Moricz 

Jonathan Ledlie (jledlie@eecs)

The authors analyze a query log containing about 1 billion search
engine queries made on the AltaVista search engine over a six week
period and extrapolate to general Internet search engine usage from
their results.  Their central point, as noted in class on Tuesday, is
to show that Web queries are fundamentally different from pre-Web
"library" queries: information theorists say these types of queries
have been solved adequately, but the authors argue that their results
are different enough from earlier library-type searches that new rules
apply.  They perform two types of analysis: (1) they compare all of
the queries against all of the other queries (called "first order
analysis) and (2) they do a local analysis of queries which appeared to
come as part of the same "session."  First-order analysis produced a
histogram of the most popular search terms.  Second- (and greater)
analysis examined how users refined their queries: how often they
restarted queries and what the correlation was between pairs of words
and geographical locations and domain restrictions (e.g. Netherlands, nl).

One criticism of their methodology was the simplification of their
analysis by treating the restricting "host" predicate as a four word
query.  The difficulty with this is that many people have a
personalized version of AltaVista which only searches their own
documents, and therefore uses this predicate.  This may cause their
results to show more predicates than actually are supplied
(particularly because each "host" line was broken into four
operators).  They do not say what percentage of queries fall into this
category.

One interesting assumption, which is perhaps somewhat reasonable, is
that users are the same across all search engines and that
these results do not only apply to AltaVista users. In other words,
AltaVista users may be a self-selecting group who like to be answered
queries in the way AltaVista answers them, and therefore use it as
opposed to some other engine.

From class
One major benefit of this work is the feedback loop created between improving user interfaces and better queries.  Improved interfaces might be able to suggest ways a user might prune his or her search.  In other words, the extra syntax that many search engines have available and which is often unused might be prompted.....