Now for the final step of SIFT. Till now, we had scale and rotation invariance. Now we create a fingerprint for each keypoint. This is to identify a keypoint. If an eye is a keypoint, then using this fingerprint, we’ll be able to distinguish it from other keypoints, like ears, noses, fingers, etc.
We want to generate a very unique fingerprint for the keypoint. It should be easy to calculate. We also want it to be relatively lenient when it is being compared against other keypoints. Things are never EXACTLY same when comparing two different images.
To do this, a 16×16 window around the keypoint. This 16×16 window is broken into sixteen 4×4 windows.
Within each 4×4 window, gradient magnitudes and orientations are calculated. These orientations are put into an 8 bin histogram.
Any gradient orientation in the range 0-44 degrees add to the first bin. 45-89 add to the next bin. And so on.And (as always) the amount added to the bin depends on the magnitude of the gradient.
Unlike the past, the amount added also depends on the distance from the keypoint. So gradients that are far away from the keypoint will add smaller values to the histogram.
This is done using a “gaussian weighting function”. This function simply generates a gradient (it’s like a 2D bell curve). You multiple it with the magnitude of orientations, and you get a weighted thingy. The farther away, the lesser the magnutide.
Doing this for all 16 pixels, you would’ve “compiled” 16 totally random orientations into 8 predetermined bins. You do this for all sixteen 4×4 regions. So you end up with 4x4x8 = 128 numbers. Once you have all 128 numbers, you normalize them (just like you would normalize a vector in school, divide by root of sum of squares). These 128 numbers form the “feature vector”. This keypoint is uniquely identified by this feature vector.
You might have seen that in the pictures above, the keypoint lies “in between”. It does not lie exactly on a pixel. That’s because it does not. The 16×16 window takes orientations and magnitudes of the image “in-between” pixels. So you need to interpolate the image to generate orientation and magnitude data “in between” pixels.
This feature vector introduces a few complications. We need to get rid of them before finalizing the fingerprint.
You take a 16×16 window of “in-between” pixels around the keypoint. You split that window into sixteen 4×4 windows. From each 4×4 window you generate a histogram of 8 bins. Each bin corresponding to 0-44 degrees, 45-89 degrees, etc. Gradient orientations from the 4×4 are put into these bins. This is done for all 4×4 blocks. Finally, you normalize the 128 values you get.
To solve a few problems, you subtract the keypoint’s orientation and also threshold the value of each element of the feature vector to 0.2 (and normalize again).