Skip to content

Dense captioning with joint inference and visual context

License

Notifications You must be signed in to change notification settings

VL-Group/densecap

 
 

Repository files navigation

Run

Run from pre-built, modified docker image.

Please make sure you have nvidia-docker (2) and set default docker runtime to nvidia.

# assume in HOST_DIR, you have a.jpg and dense_cap_late_fusion_sum.caffemodel
docker run -v /HOST_DIR:/mnt -it --gpus all zhongbazhu/densecap /bin/bash
python ./lib/tools/demo.py --image /mnt/a.jpg --gpu 0 --net /mnt/dense_cap_late_fusion_sum.caffemodel

Compile Manually

Use the latest caffe

make
cd lib
make
cd ../python
make

Dense Captioning with Joint Inference and Visual Context

This repo is the released code of dense image captioning models described in the CVPR 2017 paper:

 @InProceedings{CVPR17,
  author       = "Linjie Yang and Kevin Tang and Jianchao Yang and Li-Jia Li",
  title        = "Dense Captioning with Joint Inference and Visual Context",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition (CVPR)",
  month        = "Jul",
  year         = "2017"
}

All code is provided for research purposes only and without any warranty. Any commercial use requires our consent. When using the code in your research work, please cite the above paper. Our code is adapted from the popular Faster-RCNN repo written by Ross Girshick, which is based on the open source deep learning framework Caffe. The evaluation code is adapted from COCO captioning evaluation code.

Compiling

Compile Caffe

Please follow official guide. Support CUDA 7.5+, CUDNN 5.0+. Tested on Ubuntu 14.04.

Compile local libraries

cd lib
make

Demo

Download official sample model here. This model is the Twin-LSTM with late context fusion (fused by summation) described in the paper. To test the model, run the following command in the library root folder.

python ./lib/tools/demo.py --image [IMAGE_PATH] --gpu [GPU_ID] --net [MODEL_PATH]

It will generate a folder named "demo" in the library root. Inside the "demo" folder, there will be an HTML page showing the predicted results.

Training

Data preparation

For model training you will need to download the visual genome dataset from Visual Genome Website, either 1.0 or 1.2 is fine. Download pre-trained VGG16 model from link. Modify data paths in models/dense_cap/preprocess.py and run it from the library root to generate training/validation/testing data.

Start training

Run models/dense_cap/dense_cap_train.sh to start training. For example, to train a model with joint inference and visual context (late fusion, feature summation) on visual genome 1.0:

./models/dense_cap/dense_cap_train.sh [GPU_ID] visual_genome late_fusion_sum [VGG_MODEL_PATH] 

It typically takes 3 days to finish training. Note that due to the limitation of Python, multi-GPU training is not available for this library. In this library, we only provide Twin-LSTM structure for joint inference and late fusion (with three different fusion operators: summation, multiplication, concatenation) for context fusion. Other structures described in the paper can be easily implemented by adapting the existing code.

Evaluation

Modify models/dense_cap/dense_cap_test.sh according to the model you want to test. For example, if you want to test the provided sample model, it will look like this:

GPU_ID=0
NET_FINAL=models/dense_cap/dense_cap_late_fusion_sum.caffemodel
TEST_IMDB="vg_1.0_test"
PT_DIR="dense_cap"
time ./lib/tools/test_net.py --gpu ${GPU_ID} \
  --def_feature models/${PT_DIR}/vgg_region_global_feature.prototxt \
  --def_recurrent models/${PT_DIR}/test_cap_pred_context.prototxt \
  --def_embed models/${PT_DIR}/test_word_embedding.prototxt \
  --net ${NET_FINAL} \
  --imdb ${TEST_IMDB} \
  --cfg models/${PT_DIR}/dense_cap.yml \

The sample model will get an mAP of around 9.05. Except the model path(NET_FINAL), the only thing you should change is def_recurrent, which should be models/${PT_DIR}/test_cap_pred_no_context.prototxt for models without context information and models/${PT_DIR}/test_cap_pred_context.prototxt for models with context fusion. If you want to test late fusion models with other fusion operators, you need to modify test_cap_pred_context.prototxt. Change the "local_global_fusion" layer to eltwise multiplication or concatenation accordingly. To visualize the result, you can add --vis to the end of the above script. It will generate html pages for each image visualizing the results under folder output/dense_cap/${TEST_IMDB}/vis.

Contact

If you have any questions regarding the repo, please send email to Linjie Yang (yljatthu@gmail.com).

About

Dense captioning with joint inference and visual context

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 63.0%
  • Jupyter Notebook 15.4%
  • Python 12.9%
  • Cuda 4.7%
  • CMake 2.2%
  • MATLAB 0.7%
  • Other 1.1%