Analysis of a Very Large AltaVista Query Log (1998) Silverstein, Henzinger, Marais, Moricz Jonathan Ledlie (jledlie@eecs) The authors analyze a query log containing about 1 billion search engine queries made on the AltaVista search engine over a six week period and extrapolate to general Internet search engine usage from their results. Their central point, as noted in class on Tuesday, is to show that Web queries are fundamentally different from pre-Web "library" queries: information theorists say these types of queries have been solved adequately, but the authors argue that their results are different enough from earlier library-type searches that new rules apply. They perform two types of analysis: (1) they compare all of the queries against all of the other queries (called "first order analysis) and (2) they do a local analysis of queries which appeared to come as part of the same "session." First-order analysis produced a histogram of the most popular search terms. Second- (and greater) analysis examined how users refined their queries: how often they restarted queries and what the correlation was between pairs of words and geographical locations and domain restrictions (e.g. Netherlands, nl). One criticism of their methodology was the simplification of their analysis by treating the restricting "host" predicate as a four word query. The difficulty with this is that many people have a personalized version of AltaVista which only searches their own documents, and therefore uses this predicate. This may cause their results to show more predicates than actually are supplied (particularly because each "host" line was broken into four operators). They do not say what percentage of queries fall into this category. One interesting assumption, which is perhaps somewhat reasonable, is that users are the same across all search engines and that these results do not only apply to AltaVista users. In other words, AltaVista users may be a self-selecting group who like to be answered queries in the way AltaVista answers them, and therefore use it as opposed to some other engine. From class One major benefit of this work is the feedback loop created between improving user interfaces and better queries. Improved interfaces might be able to suggest ways a user might prune his or her search. In other words, the extra syntax that many search engines have available and which is often unused might be prompted.....