[Paper] [Code] [Video] [DeepREAL Lab]
This repository holds the Pytorch implementation of DEAL in DEAL: Disentangle and Localize Concept-level Explanations for VLMs by Tang Li, Mengmeng Ma, and Xi Peng. If you find our code useful in your research, please consider citing:
@inproceedings{li2024deal,
title={DEAL: Disentangle and Localize Concept-level Explanations for VLMs},
author={Li, Tang and Ma, Mengmeng and Peng, Xi},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}
Can we trust Vision-language Models (VLMs) in their predictions? Our findings say NO! The fine-grained visual evidence behind their predictions could be wrong! Our empirical results indicate that CLIP cannot disentangle and localize fine-grained visual evidence. And this phenomenon can be observed in many popular VLMs across different benchmark datasets. However, this issue is challenging to solve. First, human annotations are missing for fine-grained visual evidence. Second, existing VLMs align image with the entire textual caption, without disentangling and localizing fine-grained visual evidence. To this end, we proposed to Disentangle and Localize (DEAL) concept-level explanations of VLMs without rely on expensive human annotations.
- Fine-tuned on ImageNet: DEAL-ImageNet-ViT-B/32
- Fine-tuned on EuroSAT: DEAL-EuroSAT-ViT-B/32
This repository reproduces our results on ImageNet, CUB, EuroSAT, OxfordPets, and Food101 datasets, please download these datasets as needed. Our code is build upon Python3 and Pytorch v2.0.1 on Ubuntu 18.04. Please install all required packages by running:
pip install -r requirements.txt
You will need to add your OpenAI API token and run the following notebook. Note that in notebook showcase our best prompt for this task, you can change to any category list as you want or modify the prompts as needed.
./deal/generate_descriptors.ipynb
OpenAI will update their API library, please modify the code accordingly if needed.
Before training, please replace the paths in load.py to your own datasets.
python train.py --dataset imagenet --model ViT-B/32 --batch_size 256 --lr 5e-7 --save_path "/path/to/save/"
Note that we use adaptive batch sizes for different datasets to alleviate the ambiguity within a batch. Specifically, we use a batch size that is smaller than the number of classes in the dataset. For example, we use 128 for CUB, 64 for Food101, 32 for OxfordPets, and 8 for EuroSAT. We usually fine-tune one epoch for each of the datasets, please change the number of training steps according to your batch size.
The results for prediction accuracy and explanation quality:
To evaluate the prediction accuracy, please run:
./deal/evaluation.ipynb
To evaluate concept-level explanation disentanglability, please run:
./deal/exp_disentanglability.ipynb
To evaluate concept-level explanation localzability (fidelity), please run:
./deal/exp_localizability.ipynb
Part of our code is borrowed from the following repositories.
- Visual Classification via Description from Large Language Models
- Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
We thank to the authors for releasing their codes. Please also consider citing their works.