R-Bench: Are your Large Multimodal Model Robust to Real-world Corruption?

a benchmark focused on the Real-world Robustness of LMMs

Chunyi Li¹, Jianbo Zhang¹, Zicheng Zhang¹, Haoning Wu², Yuan Tian¹, Wei Sun¹,

Guo Lu¹, Xiaohong Liu¹, Xiongkuo Min¹, Weisi Lin³, Guangtao Zhai¹

¹Shanghai Jiaotong University, ²01 AI, ³Nanyang Technological University

Paper | Project Page | Data

Why are LMMs excellent in benchmarks but limited in the real-world?** Robustness is a crucial factor. In experiments, LMMs usually receive high-quality images, but in real-world scenarios that includes numerous corruption, such as object motion, lens blur, etc. If the robustness issue of LMMs can be solved, it may become as widly-used as single modality LLM, thus will bring tenfold convenience to daily human life. Therefore, we have established R-Bench to evaluate the robustness of LMMs in the real-world. R-Bench aims to test the resistance of different LMMs to corruptions and to identify the most significant corruptions affecting LMMs' performance, thereby pointing out optimization directions for future LMMs and helping them adapt to real-world images.

Release

[2024/10/12] 🔥 Add support for OpenCompass. Test your LMM robustness on MCQ task with one command.
[2024/10/10] 🔥 Release the technical report for R-Bench.
[2024/10/9] 🔥 Github repo for R-Bench is online!! Dataset Download

Benchmark Infomation

Construction Criteria

Reference Image： The selection of references is based on three principles: (1) Diversity: The data must contain different subjects, backgrounds, styles, etc. (2) Reality: The images must come from natural scenes, such as UGC taken by average users. (3) Quality: As high-quality reference information, the images must not already be distorted.

Distorted Image： We considered 33 common corruption scenarios in the real world as dimensions for our benchmark. Into: (1) 7 steps from capturing to receiving. (2) 7 groups from low level vision. (3) 3 levels from corruption strength.

Evaluate Criteria

Robustness can be categorized into absolute and relative aspects. Absolute Robustness refers to the performance that LMMs exhibit only on distorted images; while Relative Robustness is whether the outputs of LMMs are stable between reference/distorted images.

Benchmark Result

GPT4o is fully superior to other models in each distortion step, with an overwhelming advantage in absolute robustness and a slight lead in relative robustness. The open-source LMMs InternLM-XComposer2 and InternVL2 perform relatively well and can surpass proprietary LMMs (except GPT4o) in some dimensions. Most LMMs score lower in the first two steps, and relatively higher in the last five.

Absolute	MCQ low	MCQ mid	MCQ high	VQA low	VQA mid	VQA high	CAP low	CAP mid	CAP high	Overall
GPT4o	0.8176	0.7744	0.7391	0.7184	0.7291	0.6898	0.4235	0.4200	0.3997	0.6348
GPT4Turbo	0.7059	0.6398	0.6220	0.7055	0.7048	0.6806	0.3698	0.3811	0.3383	0.5722
GeminiPro	0.7529	0.7012	0.6708	0.6233	0.6315	0.5796	0.4006	0.4040	0.3734	0.5710
InternX2	0.7176	0.6770	0.6220	0.6288	0.6255	0.6180	0.4204	0.3982	0.3659	0.5638
InternVL2	0.7118	0.7019	0.6280	0.6442	0.6436	0.6383	0.3759	0.3669	0.3412	0.5614
GeminiFlash	0.7235	0.6708	0.7073	0.5975	0.6036	0.5575	0.3840	0.3522	0.3487	0.5495

Relative	MCQ low	MCQ mid	MCQ high	VQA low	VQA mid	VQA high	CAP low	CAP mid	CAP high	Overall
GPT4o	0.7471	0.6894	0.6159	0.5787	0.5725	0.5622	0.2274	0.2134	0.2083	0.4907
InternX2	0.6353	0.6087	0.5488	0.5038	0.5127	0.4639	0.2440	0.2317	0.2070	0.4396
MPlugOwl3	0.6087	0.5882	0.5488	0.5242	0.4877	0.4938	0.2423	0.2106	0.2205	0.4359
GPT4Turbo	0.5941	0.5590	0.4817	0.5872	0.5575	0.5196	0.1972	0.1910	0.1836	0.4302
DeepseekVL	0.5706	0.5342	0.4756	0.5384	0.5164	0.4934	0.2540	0.2341	0.2089	0.4251
GeminiPro	0.6706	0.6211	0.5793	0.4640	0.4799	0.4510	0.1773	0.1874	0.1649	0.4219

Additionally, we find that proprietary models outperform open-source models but still significantly lag behind humans, which are not yet ready for the real-world. Thus we welcome LMM developers to join R-Bench, extending their real-world applications. Above is a quick look of our benchmark. Please refer to our preprint for full benchmark result.

How to use

Using VLMEvalKit

You may evaluate your LMM with one command! Please download OpenCompass and run:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
python run.py --data R-Bench-Dis --model InternVL2-1B --verbose

Noted this is only for R-Bench MCQ Section! For full dataset please use the following steps.

Using our pipeline

Inference (optional)

First please download the dataset from modelscope:

from modelscope.msdatasets import MsDataset
ms_dataset = MsDataset.load(
            'R-Bench', namespace='lcysyzxdxc',
            subset_name='default', split='test')

Each instance in your dataset should be like:

'name': 'MMBench_35.jpg',
 'question': "What's the function of the demonstrated object?",
 'choice': 'A.running; B.Play football; C.Play tennis; D.Play basketball',
 'answer': 'C',
 'type': 'MCQ',
 'distortion': 1,
 'strength': 2,
 'ref_image': ...
 'dis_image': ...

Then you may define a function based on Your_LMM. It shold generate answer from ms_dataset above:

question=ms_dataset[num]['question']
choice=ms_dataset[num]['choice']
task=ms_dataset[num]['type']
byte=ms_dataset[num]['ref_image']['bytes']

def inference(question,choice,task,byte):
    
    if task=='MCQ':
        prompt = question + "\n" + choice +"\nAnswer with the option's letter from the given choices directly."
    elif task=='VQA':
        prompt = question+". Please answer no more than 10 words"
    elif task=='CAP':
        prompt =  "Please describe this image in general. Directly provide the description, do not include prefix like 'This image depicts'"
    else:
        raise ValueError('No task named'+task)
        return
                         
    image_file = io.BytesIO(byte)
    image = Image.open(image_file).convert('RGB')    

    answer = Your_LMM(image=image,prompt=prompt)
                         
    return answer

And finally you will got model_name+_ref.csv and model_name+_dis.csv. Check the R-Bench-Script.ipynb code for detail. You are strongly recommended to test in your own environment beyond this script, at your convenience.

Evaluation

Please use GPT-3.5 for evaluation (recommended). If you don't have api, you may try other LLM assisted evaluation. Both example are provided in R-Bench-Script.ipynb.

The msg for MCQ/VQA/CAP task are:

    if ans_file['type'][num]=='MCQ':
        for i in range(5):
            msg = f'''You will now be provided with a question [{question}] and a set of options [{answers}] with option [{correct_ans}] being the correct answer.
            Additionally, there will be an answer [{answer}] provided by a respondent. Please determine whether the respondent's answer is correct considering the context of the question.
            Even if the word choice is not completely the same, you can decide based on the given options and see whether the one in the answer is close enough to the given correct answer
            The result is 1 if the answer is correct and else the result is 0. Please only provide the result in the following format: Score:'''
            ...


    elif ans_file['type'][num]=='VQA':
        for i in range(5):
            msg= f'''Given the question [{question}], evaluate whether the response [{answer}] completely matches the correct answer [{correct_ans}]. 
            First, check the response and please rate score 0 if the response is not a valid answer.
            Please rate score 2 if the response completely or almost completely matches the correct answer on completeness, accuracy, and relevance. 
            Please rate score 1 if the response partly matches the correct answer on completeness, accuracy, and relevance.
            Please rate score 0 if the response doesn't match the correct answer on completeness, accuracy, and relevance at all.
            Please only provide the result in the following format: Score:'''
            ...

    elif ans_file['type'][num]=='CAP':
        corrects=eval(correct_ans)
        for correct in corrects:
            msg= f'''Evaluate whether the sentence [{answer}] completely matches the correct answer [{correct}]. 
            First, check the response and please rate score 0 if the response is not a valid answer.
            Please rate score 2 if the response completely or almost completely matches the correct answer on completeness, accuracy, and relevance. 
            Please rate score 1 if the response partly matches the correct answer on completeness, accuracy, and relevance.
            Please rate score 0 if the response doesn't match the correct answer on completeness, accuracy, and relevance at all.
            Please only provide the result in the following format: Score:'''
            ...

Show final result

Your final result will be two table, representing absolute/relative robustness in multiple dimension. Like:


Task:	MCQ	VQA	CAP
Strength:	high	mid	low
Step:	Environment	Camera	Analog	Source	Channel	Receive	Enhance
Group:	Blur	Luminance	Chrominance	Spatial	Noise	Compression	Wild

Contact

Feel free to contact the R-Bench team for queries.

Chunyi Li, lcysyzxdxc@sjtu.edu.cn, @lcysyzxdxc

Citation

If you find our work interesting, please feel free to cite our paper:

@misc{li2024rbench,
    title={R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?},
    author={Chunyi Li and Jianbo Zhang and Zicheng Zhang and Haoning Wu and Yuan Tian and Wei Sun and Guo Lu and Xiaohong Liu and Xiongkuo Min and Weisi Lin and Guangtao Zhai},
    year={2024},
    eprint={2410.05474},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.ipynb_checkpoints		.ipynb_checkpoints
exp		exp
figs		figs
result		result
R-Bench-Script.ipynb		R-Bench-Script.ipynb
README.md		README.md
R_Bench_Preview.pdf		R_Bench_Preview.pdf
Ref_Release.csv		Ref_Release.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruption?

Release

Benchmark Infomation

Construction Criteria

Evaluate Criteria

Benchmark Result

How to use

Using VLMEvalKit

Using our pipeline

Inference (optional)

Evaluation

Show final result

Contact

Citation

About

Releases

Packages

Languages

Q-Future/R-Bench

Folders and files

Latest commit

History

Repository files navigation

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruption?

Release

Benchmark Infomation

Construction Criteria

Evaluate Criteria

Benchmark Result

How to use

Using VLMEvalKit

Using our pipeline

Inference (optional)

Evaluation

Show final result

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages