Breadth-First Search Crawling Yields High-Quality Pages
Marc Najork and Janet Weiner

Jonathan Ledlie (jonathan@eecs)
October 2, 2001

The authors argue that the PageRank algorithm for computing the
quality of pages, while very good at finding pages of high quality, is
too computationally-intensive to scale to the billions of pages on the
Internet and is better approximated with a breadth-first search.
Their analysis compares a breadth-first search of 351 million pages
with a PageRank analysis of the same set of pages and finds that the
highest PageRank-ing pages are heavily weighted to the beginning of
the crawl.  They do not perform a strict breadth-first search to
download pages to analyze.  Instead, their web crawler, called
Mercator, pauses briefly in between downloading pages from the same
web server.  It seemed unclear to me that this design change --
originally enacted to avoid overloading a crawled web server -- would
actually lead to significantly different results.  Still, it does show
that crawlers which do perform this friendliness do not get poor
results.  One area the authors leave unresolved is how local links
should be weighed in comparison with remote.  To me, they successfully
show that essentially whatever further analysis is going to be
performed on the crawled pages, a breadth-first search of them will
yield better, more timely, results than other crawling methods.