This repository is an implementation of Bilinear Attention Networks for the visual question answering task using the KVQA dataset.
The validation scores repeated 5 times are shown as follows:
Embedding | Dimension | All | Yes/No | Number | Other | Unanswerable |
---|---|---|---|---|---|---|
Word2vec | 200 | 29.75 ± 0.28 | 72.59 | 16.94 | 17.16 | 78.74 |
GloVe | 100 | 30.93 ± 0.19 | 71.91 | 17.65 | 18.93 | 78.26 |
fastText | 200 | 30.94 ± 0.09 | 72.48 | 17.74 | 18.96 | 77.92 |
BERT | 768 | 30.56 ± 0.12 | 69.28 | 17.48 | 18.65 | 78.28 |
This repository is based on and inspired by @hengyuan-hu's work. We sincerely thank for their sharing of the codes.
You may need a machine with a Titan-grade GPU, 64 GB memory, and PyTorch v1.1.0
for Python3
. We highly recommend you to use this docker image.
pip install -r requirements.txt
Install mecab
sudo apt-get install default-jre curl
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
You can download the KVQA dataset via this link. Please be aware that this is licensed via Korean VQA License
.
Our implementation uses the pretrained image features by bottom-up-attention, the adaptive 10-100 features for the detected objects in an image. In addition to this, the pretrained Korean word vectors, Word2vec, GloVe, fastText and BERT.
For simplicity, you can prepare the KVQA data as follows and use the below script to avoid a hassle:
- Place the downloaded files from
KVQA Dataset
as follows:
data
├── KVQA_annotations_train.json
├── KVQA_annotations_val.json
├── KVQA_annotations_test.json
└── features
├── KVQA_resnet101_faster_rcnn_genome.tsv
└── VizWiz_resnet101_faster_rcnn_genome.tsv
Notice that if you download the preprocessed features (the tsv files), you don't need to download image sources.
- Run the two scripts,
download.sh
andprocess.sh
.
./tools/download.sh
./tools/process.sh
Run
python3 main.py
to start training. The training and validation scores will be printed at every epoch, and the best model will be saved under the directory saved_models
.
You can train a model based on the other question embedding by running as follows:
python3 main.py --q_emb glove-rg
If you use this code as part of any published research, please consider to cite the following papers:
@inproceedings{Kim_Lim2019,
author = {Kim, Jin-hwa and Lim, Soohyun and Park, Jaesun and Cho, Hansu},
booktitle = {AI for Social Good workshop at NeurIPS},
title = {{Korean Localization of Visual Question Answering for Blind People}},
year = {2019}
}
@inproceedings{Kim2018,
author = {Kim, Jin-Hwa and Jun, Jaehyun and Zhang, Byoung-Tak},
booktitle = {Advances in Neural Information Processing Systems 31},
title = {{Bilinear Attention Networks}},
pages = {1571--1581},
year = {2018}
}
- Korean VQA License for the KVQA Dataset
- Creative Commons License Deed (CC BY 4.0) for the VizWiz subset
- GNU GPL v3.0 for the Code
We sincerely thank the collaborators from TestWorks for helping the collection.