A Computer Vision and Machine learning blog

My photo
I'm a Computer Vision and Machine learning developer.

Monday, May 5, 2014

Bag Of Visual Words model for image classification

I wanted to play around with Bag Of Words for visual classification, so I coded a Matlab implementation that uses VLFEAT for the features and clustering.
It was tested on classifying Mac/Windows desktop screenshots.

For a small testing data set (about 50 images for each category), the best vocabulary size was about 80.
It scored 97% accuracy on the training set, and 85% accuracy on the cross validation set,
so the over-fitting can be improved a bit more.


1. Collect a data set of examples. I used a python script to download images from Google.
2. Partition the data set into a training set, and a cross validation set (80% - 20%).
3. Find key points in each image, using SIFT.
4. Take a patch around each key point, and calculate it's Histogram of Oriented Gradients (HoG). Gather all these features.
5. Build a visual vocabulary by finding representatives of the gathered features (quantization).
This done by k-means clustering.
6. Find the distribution of the vocabulary in each image in the training set.
This is done by a histogram with a bin for each vocabulary word.
The histogram values can be either hard values, or soft values.
Hard values means that for each descriptor of a key point patch in an image, we add 1 to the bin of the vocabulary word closest to it in absolute square value.
Soft values means that each patch votes to all histogram bins, but give a higher weight to bin representing words that are similar to that patch. Take a look here.
7. Train an SVM on the resulting histograms (each histogram is a feature vector, with a label).
8. Test the classifier on the cross validation set.
9. If results are not satisfactory, repeat 5 for a different vocabulary size and a different SVM parameters.

Visualization of the vocabulary learned by the clustering

Source Code