MultiModal Fusion Transformer with BERT Encodings for Visual Question Answering (EMNLP 2020 Findings)

Abstract

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

Paper

This code repo is modified from TVQA codebase.

dataset

Follow instructions on TVQA github repo to download data. Alternatively, the processed data files are provided here. download the files, and extract them.

unzip data.zip

Copy the extracted files into the data directory.

Download tvqa vocab and indexing dictionaries from here and unzip them in the cache folder by using unzip cache.zip inside cache/ directory.

Training files

main_dict_multiple_losses.py is the main file for training. tvqa_vqa_2bert_bertfusion_sub.py is the model definition. tvqa_dataset_vqa_bert_attn.py has the dataloader for TVQA dataset.

Bibtex

@misc{khan2020mmftbert,
      title={MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering}, 
      author={Aisha Urooj Khan and Amir Mazaheri and Niels da Vitoria Lobo and Mubarak Shah},
      year={2020},
      eprint={2010.14095},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
cache		cache
data		data
README.md		README.md
bidaf.py		bidaf.py
cnn.py		cnn.py
config.py		config.py
config_.py		config_.py
context_query_attention.py		context_query_attention.py
encoder.py		encoder.py
main.py		main.py
main_dict_multiple_losses.py		main_dict_multiple_losses.py
mlp.py		mlp.py
model.py		model.py
model2.py		model2.py
model_qa.py		model_qa.py
position_encoding.py		position_encoding.py
preprocessing.py		preprocessing.py
rnn.py		rnn.py
self_attention.py		self_attention.py
test_scores.py		test_scores.py
tvqa_dataset.py		tvqa_dataset.py
tvqa_dataset_maxpool_spatial.py		tvqa_dataset_maxpool_spatial.py
tvqa_dataset_qa_bert.py		tvqa_dataset_qa_bert.py
tvqa_dataset_vcpt_bert.py		tvqa_dataset_vcpt_bert.py
tvqa_dataset_vcpt_bert_w_ans.py		tvqa_dataset_vcpt_bert_w_ans.py
tvqa_dataset_vqa_bert.py		tvqa_dataset_vqa_bert.py
tvqa_dataset_vqa_bert_attn.py		tvqa_dataset_vqa_bert_attn.py
tvqa_dataset_vqa_bert_caption.py		tvqa_dataset_vqa_bert_caption.py
tvqa_mac_temporal_no_fc.py		tvqa_mac_temporal_no_fc.py
tvqa_mac_temporal_no_fc_qa_bert.py		tvqa_mac_temporal_no_fc_qa_bert.py
tvqa_mac_temporal_no_fc_vcpt_bert.py		tvqa_mac_temporal_no_fc_vcpt_bert.py
tvqa_mac_temporal_no_fc_vqa_2bert.py		tvqa_mac_temporal_no_fc_vqa_2bert.py
tvqa_mac_temporal_no_fc_vqa_bert.py		tvqa_mac_temporal_no_fc_vqa_bert.py
tvqa_mac_temporal_no_fc_vqa_bert_attn.py		tvqa_mac_temporal_no_fc_vqa_bert_attn.py
tvqa_vqa_2bert_bertfusion.py		tvqa_vqa_2bert_bertfusion.py
tvqa_vqa_2bert_bertfusion_add_vec_to_head.py		tvqa_vqa_2bert_bertfusion_add_vec_to_head.py
tvqa_vqa_2bert_bertfusion_sub.py		tvqa_vqa_2bert_bertfusion_sub.py
tvqa_vqa_2bert_cross_fusion.py		tvqa_vqa_2bert_cross_fusion.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiModal Fusion Transformer with BERT Encodings for Visual Question Answering (EMNLP 2020 Findings)

Abstract

dataset

Training files

Bibtex

About

Releases

Packages

Languages

aurooj/MMFT-BERT

Folders and files

Latest commit

History

Repository files navigation

MultiModal Fusion Transformer with BERT Encodings for Visual Question Answering (EMNLP 2020 Findings)

Abstract

dataset

Training files

Bibtex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages