A few sketchy notes about SIFT ------------------------------ See the papers for more details: http://www.cs.ubc.ca/~lowe/keypoints/ http://www.cs.toronto.edu/~kyros/courses/320/Lectures/lowe-sift-ijcv04.pdf We assume that keypoints have already been detected, as "corners" in scale-space, namely local minima/maxima in 3x3x3 regions of the DOG (Difference-of-Gaussian) pyramids. How the SIFT feature descriptors are computed: [1] each keypoint is assigned an orientation, corresponding to the *dominant* orientation within that patch -- specifying this gives rotation invariance, as all other descriptors are computed relative to this orientation - find the gradient magnitude and orientation in a neighborhood (all of this is at the relevant scale) - bin the orientation, \theta = tan(G_y/G_x), into one of 36 bins, each covering 10 degrees - each pixel in the neighborhood contributes to the bin corresponding to its orientation: - proportional to the its gradient magnitude, \sqrt(G_x^2+G_y^2) - and with a circular Gaussian falloff from the keypoint center (\sigma=1.5 pixels at the current scale, so the effective neighborhood is about 9x9) - the highest peak of the histogram is the *dominant* orientation - (if there are multiple peaks within 0.8 of the max, add each of those as separate SIFT features) - (interpolate the peak using a parabola fit to the +-1 neighboring bins, for better accuracy) [2] given a keypoint and a dominant orientation, we build the descriptor as follows - consider a 16x16 patch centered on the keypoint (at the relevant scale) - decompose this patch into 4x4 pixel tiles .. so there will be (16/4)*(16/4)=16 such tiles - for each such tile, we compute a histogram of its pixels' gradient orientations with 8 bins, each covering 45 degrees - actually, we specify these gradient orientations *relative* to the keypoint's dominant orientation, ie. \theta' = tan(G_y/G_x)-\theta - again weight the contribution to the bin by the gradient magnitude - again we use a circular Gaussian falloff from the keypoint center (\sigma=8 pixels, half the descriptor window) - so this gives 16 tiles * 8 histogram bins/tile = a 128-dimensional vector representing our a SIFT feature descriptor - normalize this to unit length, for better illumination invariance - (also, threshold values in this normalized vector to be no larger than 0.2, and then renormalize) How to match SIFT features: [1] Define a distance measure between two SIFT feature vectors - this is easy ... just find the Euclidian distance in the space of 128-dimensional vectors, d = [\sum_{i=1}^{128} (p_i - q_i)^2]^{1/2} - in practice use kd-trees or other specialized structures for efficient (maybe even approximate) nearest-neighbor queries in a large database of SIFT features (perhaps ~100k in a big scene) - (can also do dimensionality reduction using e.g. Haar wavelets; in fact this transform would preserve the Euclidean distance, since the Haar wavelet basis is *orthogonal*) [2] For applications like recognition, panorama stitching, etc. we need to know: does the best SIFT feature match in the database constitute a *real* match, or is it just a false positive? - setting a hard distance threshold in 128-dimensional space (ie. below this threshold = a match) doesn't work well - better: compare the distance to best matching SIFT feature in the database, and the distance to the *2nd-best* matching feature -- if the ratio of closest distance to 2nd closest distance >0.8 then reject as a false match!