Written in MATLAB.
This project serves as an interactive image classifier. Users can select either (1) a region/object within an image of choice, or (2) an entire image. The program then retrieves the top n = 5 most similar images to the given queries.
Similarity scores are computed using bag-of-words modeling and k-means clustering.
Results are based off of a dataset of 6,600+ distinct video frames from the American T.V. series Friends. Note: The dataset has not been provided in this repository. Please see visual demo instead.
A brief description of the terminology used can be found here.
- Sample Results
- Full-Frame Query
- Region-Based Query
- Directory Layout & Contents
- Terminology Mentioned
This program makes use of Scale-Invariant Feature Transform (SIFT) descriptors, as well as their associated images.
Sample results have been provided below for both full-frame and region-based queries.
Retrieves top n = 5 most similar video frames to selected image.
Retrieves top n = 5 most similar video frames containing queried region/object (in this example, a kitchen table, which is outlined in blue).
Query | Retrieved Images |
---|---|
Please see the sample_outputs
directory for additional examples. Its layout and contents are detailed in the next section.
This section pertains to the sample_outputs
directory. Its subdirectories and their contents are summarized below.
Subdirectory Name | Description of Contents |
---|---|
full_frames |
Sample results based on full-frame queries. |
full_frames_comparison |
Visual comparison between AlexNet Image Classification and SIFT-based descriptors. This project is based on the latter. Serves to illustrate program's accuracy/effectiveness. |
raw_matches |
Sample queried region versus computed SIFT descriptors. |
region_based |
Sample results based on region-based queries. |
visual_vocab |
Sample visual vocabulary (aka bag-of-words, where each image patch represents a "word"). |
Terminology | Description |
---|---|
Bag-of-Words (BoW) Modeling | A histogram of visual image patches/literal words within a given image/text; describes the frequency of unique (visual) words |
SIFT (algorithm/descriptors) | An abbreviation for Scale-Invariant Feature Transform; describes local, unique features within images |
AlexNet | A well-known Computer Vision application designed by Alex Krizhevsky that detected and classified objects |