A Markov Chain that can describe an image based on the extracted features, in this case being the object categories and its location.
This is my code for the course project of 10-701 Introduction to Machine Learning titled "Markov vs Neural Network: A Comparative Study of Classic and Modern Models for Image Captioning". You can find the project report here.
Select the next best word (or sequence of words when beam width is greater than 1) by maximizing the conditional probability of the word's appearance given its previous few words (Markov assumption) and the conditional probability of the word's appearance given all provided features.
numpy
sparse
DataLoader.py
: helper functions to load training captions and encode object category and location information to train and test the Markov-based modelgen_test_captions.py
: automation script to run a trained Markov-based model on Karparthy offline test or validation split with sentence generation parameters provided through the command line. Stores the captions in a JSON file for scoringHeatmap.py
: borrowed script to show object-word location heatmap for a trained Markov-based modelMarkovCaptioner.py
: the core encapsulatedMarkovCaptioner
module for training and testing the Markov-based model. It also defines theBeamSearchCandidate
class for beam search during sentence generationtrain_markov.py
: automation script to train a Markov-based model with training parameters provided through the command line and serialize the trained model to a file on diskUtility.py
: utility functions and constants
- training
ngram_n
: n-gram size used to train Markov Chaingrid_size
: object location encoding is based on a nxn grid. This controls n
- sentence generation
sentence_length_limit
: sentence cutoff lengthbeam_width
: beam width for beam searchdecay_factor
: incremental penalty for generating the same word. This helps to reduce rambling of the model.
ngram_n
= 4, grid_size
= 2, sentence_length_limit
= 16, beam_width
= 20, decay_factor
= 1e-2
By extracting better features from the images, it might perform better than what I have here