Skip to content

Latest commit

 

History

History
41 lines (23 loc) · 2 KB

README.md

File metadata and controls

41 lines (23 loc) · 2 KB

Image Captioning with Deep Learning

This repository is an unofficial PyTorch implementation of the paper Show and Tell: A Neural Image Caption Generator.

After cloning this repository, please redirect to instructions.md for a quick set-up so that you are able to use our codes.

Training details

As described in the paper, we need a CNN playing an encoder and a RNN a decoder. For CNN, we have tried with VGG19, Inceptionv3, ResNet34, the differences in result are insignificant. For RNN, we choose LSTM as said in the paper.

Hyperparameters are chosen as follows:

  • lr=1e-3
  • batch_size=128
  • lr-decay-rate=0.99, strategy exponential decay, applied every 2000 iters
  • nb-epochs=100
  • nb-of-LSTM-layers=2
  • LSTM-Dropout-rate=0.7

For evaluation metrics, we used BLEU-n for n=1, 2, 3, 4. Evaluation strategy is beam search: You insert image to the CNN, the embeddings then is passed to the RNN, the first token is sampled from the proba output vector and is then re-inserted to the RNN to get the second token, so on and so on. This procedure terminates when END token is generated or the sequence length passes some limit. (Typically for NLP tasks, the training and evaluation phases are proceeded with different strategies and metrics)

With all these settings, we have achieved good BLEU scores: BLEU-1=64., BLEU-4=18. (max=100) on COCO dataset, though not yet comparable with the paper (same score for BLEU-1, but the paper gets 27. for BLEU-4, which is much higher). To achieve this score, we need several hours with 60 epochs.

Some auto captions:

Good results.

Results with minor errors.

Somewhat relevant.

Bad results.