Moved to http://jacobgil.github.io: May 2014

I wanted to play around with Bag Of Words for visual classification, so I coded a Matlab implementation that uses VLFEAT for the features and clustering.

It was tested on classifying Mac/Windows desktop screenshots.

For a small testing data set (about 50 images for each category), the best vocabulary size was about 80.
It scored 97% accuracy on the training set, and 85% accuracy on the cross validation set,
so the over-fitting can be improved a bit more.

Overview:

1. Collect a data set of examples. I used a python script to download images from Google.

2. Partition the data set into a training set, and a cross validation set (80% - 20%).

3. Find key points in each image, using SIFT.

4. Take a patch around each key point, and calculate it's Histogram of Oriented Gradients (HoG). Gather all these features.

5. Build a visual vocabulary by finding representatives of the gathered features (quantization).
This done by k-means clustering.

6. Find the distribution of the vocabulary in each image in the training set.
This is done by a histogram with a bin for each vocabulary word.
The histogram values can be either hard values, or soft values.
Hard values means that for each descriptor of a key point patch in an image, we add 1 to the bin of the vocabulary word closest to it in absolute square value.
Soft values means that each patch votes to all histogram bins, but give a higher weight to bin representing words that are similar to that patch. Take a look here.

7. Train an SVM on the resulting histograms (each histogram is a feature vector, with a label).

8. Test the classifier on the cross validation set.

9. If results are not satisfactory, repeat 5 for a different vocabulary size and a different SVM parameters.