Google Image Results

Kimo Johnson

08 June 2006

This week, I have been researching radial lens distortion. I have a simple mathematical model of the distortion and an algorithm for estimating the parameters of the model. Everything works well in simulation so it's time to test the algorithm on real images. I remembered that the camera review site dpreview.com had detailed images of lens distortion for many different cameras. Rather than navigate through their multi-page reviews searching for the images, I did a Google image search: "distortion site:dpreview.com." This search returned over 20 pages of results, more than enough for testing my algorithm. Downloading all the images by hand would be a slow and painful process, so I wrote a python script to do the work for me. This script is coded to the specifics of this problem, but it could be modified to automate the downloading of any set of Google image results.

Here’s the script, save it in a file called get_image_results.py. You need to save the html page with the search results to a file and then run the script on that file: get_image_results.py page1.html output_dir. The last argument output_dir is a directory in which to save the images.

#
# Download Google image results for query: "distortion site:dpreview.com"
# Can be modified to work with other image queries
# Last updated: 6/8/2006
#
import re
import os.path
import sys
import subprocess

#
#
#
def main(argv):
    (html_file,output_dir) = process_args(argv)
    f = open(html_file)
    lines = f.readlines()
    f.close()
    
    href = re.compile(r'imgurl=(?P<imgurl>[^&]+)&')
    
    url_list = []
    for line in lines:
        matches = href.findall(line)
        if matches is None:
            continue

        # filter out urls that are not under reviews
        url_list += filter(lambda s: s.find('reviews') > 0, matches)
    
    for url in url_list:
        filename = os.path.join(output_dir,make_filename(url))
        retcode = subprocess.call(['curl',url,'-o',filename])
        
#
#
#
def process_args(argv):
    argc = len(argv)
    if argc < 3:
        print "Usage: get_image_results.py <html file> <output directory>"
        sys.exit()

    args = map(lambda s: s.strip(), argv[1:])

    # Make sure file exists
    if not os.path.exists(args[0]):
        print 'File "%s" does not exist.' % args[0]
        sys.exit()

    # Make sure directory exists
    if not os.path.exists(args[0]):
        print 'Directory "%s" does not exist.' % args[1]
        sys.exit()

    return tuple(args)

#
#
#
def make_filename(url):
    # first remove address
    ix = url.find('reviews/') + 8
    
    # now remove samples and distortion if they exist in the name
    smaller = url[ix:].lower()
    smaller = smaller.replace('/','_')
    smaller = smaller.replace('samples_','')
    smaller = smaller.replace('distortion_','')
    return smaller

#
#
#
if __name__ == "__main__":
    main(sys.argv)