Implementing Bag of Visual Words (BoW) classifier
for image classification and scene recognition
Report | Python Notebook | Rendered Notebook |
1. Author Info |
Name | Surname | Student ID | UniTS mail | Google mail | Master |
---|---|---|---|---|---|
Marco | Tallone | SM3600002 | marco.tallone@studenti.units.it | marcotallone85@gmail.com | SDIC |
Warning
Generative Tools Notice:
Generative AI tools have been used as a support for the development of this project. In particular, the Copilot generative tool based on OpenAI GPT 4o model has been used as assistance medium in performing the following tasks:
- writing documentation and comments in implemented functions by adehering to the NumPy Style Guide
- improving variable naming and code readability
- minor bug fixing in implemented functions
- grammar and spelling check both in this README and in the report
- tweaking aesthetic improvements in report plots
- formatting table of results in report
The Bag of Visual Words (BoW) model is a popular computer vision technique used for image classification or retrieval. It is based on the idea of treating images as documents and representing them as histograms of visual words belonging to a visual vocabulary, which is obtained by clustering local features extracted from a set of images.
This project implements a BoW image classifier for scene recognition by first building a visual vocabulary from a set of test images and then performing multi-class classification using K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) classifiers.
In particular, the visual vocabulary is built by clustering SIFT descriptors extracted from the test images and using the K-Means algorithm. Descriptors have been computed both from keypoints detected with the the SIFT algorithm and from dense sampling of the images with a fixed grid to compare the two approaches.
In the classification phase instead, the performance of a simple KNN classifier is compared with that of different SVM classifiers all adopting the ``one-vs-all'' strategy for multi-class classification. The SVM classifiers differ in the kernel used and in the kind of input features they are trained on.
Additionally, different ways to represent images as input feature vectors are tested. These include the classic representation as normalized histograms of visual words, the implementation of the soft assignment techniques proposed by Van Gemert et al. [3] and the use of the spatial pyramid feature representation proposed by Lazebnik et al. [1].
The objectives of this study are to compare the performance of the different classifiers and image representations and to reproduce the results obtained by Van Gemert et al. [3] and Lazebnik et al. [1] on the 15-Scenes dataset.
For further details on the specific feature extraction techniques used and the machine learning algorithms implemented as well as the results with them obtained, please refer to the official report. For a description of the implementation of the BoW classifier, read instead the dedicated notebook.
The project is structured as follows:
├── 🐍 cv-conda.yaml # Conda environment
├── 📁 datasets # Datasets folder
│ ├── test
│ └── train
├── 📁 doc # Project assignment
├── ⚜️ LICENSE # License file
├── 📓 notebooks # Jupyter Notebooks
│ ├── bow-classifier.ipynb
│ ├── results-plots.ipynb
│ └── utils.py
├── 📜 README.md # This README file
└── 📁 report # Report folder
├── images
├── main.tex
└── ...
In particular the notebooks/ folder contains the following notebooks:
bow-classifier.ipynb
: the main notebook containing a step-by-step description of the implementation of the BoW classifierresults-plots.ipynb
: a notebook containing the code to generate the plots and the final results presented in the reportutils.py
: a Python script containing the utility functions implemented for this project
The project is developed in Python
and mostly requires the following libraries:
numpy
, version1.26.4
opencv
, version4.10.0
scikit-learn
, version1.5.2
tqdm
, version4.67.0
All the necessary libraries can be easily installed using the pip
package manager.
Additionally a conda environment yaml
file containing all the necessary libraries for the project is provided in the root folder. To create the environment you need to have installed a working conda
version and then create the environment with the following command:
conda env create -f cv-conda.yaml
After the environment has been created you can activate it with:
conda activate cv
For a detailed step-by-step description of the main tasks performed for this project and the implementation of the BoW classifier, please refer to the dedicated notebook.
The goal of this repository was to implement a classifier based on the Bag of Visual Words approach and reproduce the results presented in the referenced papers in the context of a university exam project. However, if you have a suggestion that would make this better or extend its functionalities and want to share it with me, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement" or "extension".
Suggested contribution procedure:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
[1] S. Lazebnik, C. Schmid, J. Ponce, "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories", 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Volume 2, 2006, Pages 2169-2178, https://doi.org/10.1109/CVPR.2006.68
[2] L. Fei-Fei, P. Perona, "A Bayesian hierarchical model for learning natural scene categories", 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Volume 2, 2005, Pages 524-531 vol. 2, https://doi.org/10.1109/CVPR.2005.16
[3] J.C. van Gemert, J.-M. Geusebroek, C.J. Veenman, A.W.M. Smeulders, "Kernel Codebooks for Scene Categorization", Computer Vision -- ECCV 2008, 2008, Springer Berlin Heidelberg, Pages 696-709, https://doi.org/10.1007/978-3-540-88693-8_52
[4] C.-C. Chang, C.-J. Lin, "LIBSVM: A library for support vector machines", ACM Transactions on Intelligent Systems and Technology (TIST), Volume 2, Number 3, 2011, Pages 1-27, https://doi.org/10.1145/1961189.1961199
[5] P. J. Rousseeuw, "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics, Volume 20, 1987, Pages 53-65, https://doi.org/10.1016/0377-0427(87)90125-7
[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, "Scikit-learn: Machine learning in Python", Journal of Machine Learning Research, Volume 12, 2011, Pages 2825-2830, https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
- Computer Vision and Pattern Recognition course material (UniTS, Fall 2024) (access restricted to UniTS students and staff)
- Best-README-Template: for the README template
- Flaticon: for the icons used in the README