Code and data for paper Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models.
[Website] • [Paper] • [Dataset] • [Twitter]
{
"id": "infoseek_val_00068862",
"entity": "Arctic hare",
"question": "What is the closest upper taxonomy of this animal?",
"image": "oven_05050169.jpg",
"multiple_choices": {
"A": "Lepus",
"B": "Marmota",
"C": "Oryctolagus",
"D": "Sylvilagus"
},
"multiple_choices_answer": "A"
}
By the time we release this code, vllm does not support all the VLMs used in our paper, and different transformers versions will cause conflicts in the environment. So, we provide two environments: 1) for the latest transformers lib, including all the popular VLMs, and 2) for models that are supported by vllms to accelerate the inference process.
To install the base environment containing only the transformers lib, run the following code:
conda env create -f base.yml
To install the vllm environment, run the following code:
conda env create -f vllm.yml
To conduct the analyses in our paper, run codes in src/analysis
.
We provide codes for answer prediction, and contrastive metric.
You can refer to the scripts in scripts/
.
For answer prediction, run the following command:
python $PREDICT_FILE \
--dataset $DATASET \
--model_name $MODEL_NAME \
--output_dir $OUTPUT_DIR \
--greedy \
--max_new_tokens 20 \
--is_scored
To analyze a new model: if the model is supported by vllm, use src/analysis/predict_vllm.py
for faster prediction; otherwise, implement it in src/models/local.py
and run src/analysis/predict.py
.
For contrastive metric, you should first generate both textual and visual logits, then run the following command:
python src/analysis/post_hoc_contrastive_decoding_metric.py \
--dataset $DATASET \
--model_name $MODEL_NAME
To run dynamic contrastive decoding, use the following command:
python src/inference_time/post_hoc_contrastive_decoding.py \
--dataset $DATASET \
--model_name $MODEL_NAME \
--output_dir $OUTPUT_DIR \
--method dynamic
To calculate the accuracy (Acc), run the following command:
python src/evaluate/evaluate_mc.py \
--dataset $DATASET \
--input_file $FILE_PATH
To calculate the recognized accuracy (R. Acc), run the following command:
python src/evaluate/evaluate_mc.py \
--dataset $DATASET \
--input_file $FILE_PATH \
--cleaned_model $MODEL_NAME
To calculate the flip rate (FR), run the following command:
python src/evaluate/evaluate_mc.py \
--dataset $DATASET \
--input_file $FILE_PATH \
--input_file_2 $FILE_PATH_2 \
--cleaned_model $MODEL_NAME
If you find this repo useful, please cite the following paper:
@article{zhu2024unraveling,
title={Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models},
author={Zhu, Tinghui and Liu, Qin and Wang, Fei and Tu, Zhengzhong and Chen, Muhao},
journal={arXiv preprint arXiv:2410.03659},
year={2024}
}