This repository is the implementation for the video description task introduced in the paper Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents. Our codes are based on AudioVisualSceneAwareDialog(Hori et. al.) and Baseline on AVSD(Schwartz et. al.), we thank the authors of the previous work to share their data and codes.
We have published an extended version of this video description work at TPAMI with novel settings and experiments. You can check the paper Saying the Unseen: Video Descriptions via Dialog Agents. The code will be updated at this repo.
We introduce a task whose ultimate goal is for one coversational agent to describe an unseen video based on the dialog and two static frames from the video as shown below.
- python 2.7
- pytorch 0.4.1
- Numpy
- six
- java 1.8.0
The original AVSD dataset used in our experiments can be found here.
The annotations can be downloaded here. Please extract to ‘data/’.
The audio-visual features can be downloaded here. Please extract to ‘data/charades_features’.
Use the command ./qa_run.sh
to run the codes.
The codes are running under 4 different stages: evaluation tool prepration, training, inference and scores calculating. Note that to compute the SPICE scores, please follow the additional instructions from the coco-pation project.
The pretained model is available here.
Please consider citing our papers if you find them useful.
@InProceedings{zhu2020describing,
author = {Zhu, Ye and Wu, Yu and Yang, Yi and Yan, Yan},
title = {Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents},
booktitle = {The European Conference on Computer Vision (ECCV)},
year = {2020}
}
@InProceedings{zhu2021saying,
author = {Zhu, Ye and Wu, Yu and Yang, Yi and Yan, Yan},
title = {Saying the Unseen: Video Descriptions via Dialog Agents},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2021}
}