PicASHOW: Pictorial Authority Search by Hyperlinks on the Web Ronny Lempel, Aya Soffer Jonathan Ledlie (jonathan@eecs) October 2, 2001 The authors argue that a hybrid of two major text-oriented web page ranking schemes, PageRank and Kleinberg's algorithm, also apply to images and could be included in search engines with little modification. They merge the probabilities of visiting random pages from PageRank's link analysis with Kleinberg's idea of hubs (pointers to authorities) and authorities (high quality containers of information on a given topic) to form topical image collections IC = (P,I,L,E), where P is the page set, I is the set of images in P, L is the page-to-page link set, and E is the page-to-image relation. They attempt to ameliorate problems areas where pages and images differ with several heuristics. In particular, they attempt to account for the condition where images have been copied instead of linked to. They hash the first 1k of each image and then compare this to other hashes of images, arguing that only if the hashes are the same are the images the same. Because they also hash the header information, I think these hashes are too restrictive and, in fact, they say that "very rarely" do images have the same signature. The problem is that headers contain timestamps and cookie information which would likely be different even if the images were the same. A second heuristic is the elimination of banners and other files they deem unimportant based on aspect ratios and sizes. Instead, I think these should be included but then excluded at the user's discretion along with other image analysis criteria which they mention in the future work. Class discussion: Comparison between Kleinberg/PageRank and PicASHOW. Different kinds of relations. Here we have Page->Page, and Page->Image relations, a level of indirection. What matrix do you use? They end up suggesting a linear combination (maybe 1/2, 1/2). - Fingerprints for Uniqueness The headers are actually part of the image file. These are not HTTP headers. - Noise and Banner - StopImages: These are manual; is this reasonable? - Could this also apply to sound files?