Hi👋! This is a Pytorch implementation of the explainability system proposed in A Study on Multimodal and Interactive Explanations for Visual Question Answering (Alipour et al., 2020). We adapted their experiments to obtain interactive multimodal explanations based on text and image in the Easy VQA dataset.
Our adaptation is shown in below. We used a Conditioned U-Net to leverage visual with textual information from the question words. The output is a mask that is applied to the original input image to feed a convolutional classifier with text conditioning.
Installation: Check the libraries requirements.txt to train and deploy our system.
pip3 install -r requirements.txt
Before training or predicting with our system, it is required to download the Easy VQA dataset. The folder easy_vqa/data
must be downloaded at the root of this repository and renamed as easy-vqa
. Optionally, run the following terminal commands:
!wget https://github.com/vzhou842/easy-VQA/archive/refs/heads/master.zip
!unzip master.zip
!mv easy-VQA-master/easy_vqa/data/ easy-vqa
!rm -r easy-VQA-master/
!rm master.zip
The folder structure should look like this:
easy-vqa/
test/
train/
answers.txt
modules/
utils/
model.py
system.py
Before deploying the system it is required to train the neural network. We suggest to directly run the system.py script to train the model (it is possible to configure some parameters such as the batch size or number of epochs). With the default configuration the model should reach ~90% in F-score. At the end of the training, the system parameters should be stored at results/
folder.
See the prepared demo.ipynb for documentated examples of how to work with interactive multimodal explanations.