A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

What do we expect from LMMs as AIGI evaluators and how do they perform?

Zicheng Zhang¹^*, Haoning Wu²^*, Chunyi Li¹^*, Yingjie Zhou¹, Wei Sun¹,

Xiongkuo Min¹, Zijian Chen¹, Xiaohong Liu¹, Weisi Lin², Guangtao Zhai¹^#

¹Shanghai Jiaotong University, ²Nanyang Technological University

^*Equal contribution. ^#Corresponding author.

Paper | Project Page | Github | Data

T2I models aim to create images that accurately align with the text and showcase high perceptual quality. Therefore, the proposed A-Bench includes two parts to diagnose whether LMMs are masters at evaluating AIGIs: 1) Semantic Understanding, 2) Quality Perception.

Release

[2024/9/26]🔥 Update the performance of GPT & Gemini with the latest version on A-Bench.
[2024/8/1]🔥 The A-Bench is released on VLMEvalKit, come and test your LMM with one command.
[2024/6/17]🔥 The A-Bench has now joined lmms-eval, which makes it easier to test LMM !!
[2024/6/5] 🔥 We are releasing the A-Bench data and meta information at Huggingface.
[2024/6/3] 🔥 Github repo for A-Bench is online. Do you want to find out if your LMM is a master at evaluating AI-generated images? Come and test on A-Bench !!

A-Bench Construction

Two key diagnostic subsets are defined: A-Bench-P1 → high-level semantic understanding, and A-Bench-P2 → low-level quality perception. For high-level semantic understanding, A-Bench-P1 targets three critical areas: Basic Recognition, Bag-of-Words Pitfalls Discrimination, and Outside Knowledge Realization, which are designed to progressively test the LMM’s capability in AIGI semantic understanding, moving from simple to complex prompt-related content. For low-level quality perception, A-Bench-P2 concentrates on Technical Quality Perception, Aesthetic Quality Evaluation, and Generative Distortion Assessment, which are designed to cover the common quality issues and AIGI-specific quality problems.

Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled, including 1,408 AIGIs for A-Bench-P1 and 1,456 for A-Bench-P2. Each AIGI is paired with a question-answer set annotated by human experts. We are open to submission-based evaluation for A-Bench. The details for submission are in the Evaluate your model on A-Bench Section.

Glance at A-Bench Performance

For open-source models, LLaVA-NeXT (Qwen-110B) takes the first place. For closed-source models, GEMINI 1.5 PRO takes the first place.

A Quick Look of the A-Bench Outcomes.

Participant Name	Major↑	Minor↑	Attr.↑	N. Adj.↑	Comp.↑	Number↑	Term↑	Contra.↑	Technical↑	Aesthetic↑	Generative↑
Gemini 1.5 Pro	93.82%	95.18%	94.35%	80.27%	72.14%	79.35%	72.88%	61.56%	84.70%	71.22%	77.61%
GPT-4v	92.95%	96.00%	87.40%	82.67%	64.39%	68.84%	77.60%	66.73%	83.60%	67.82%	68.34%
GPT-4o	94.34%	95.14%	91.99%	79.54%	76.40%	73.30%	77.47%	68.59%	85.44%	70.59%	61.61%
Qwen-VL-Max	92.56%	94.75%	91.99%	85.78%	68.94%	75.85%	78.94%	65.05%	84.47%	71.31%	69.77%
Human (Worst)	95.18%	94.24%	96.78%	88.70%	85.49%	82.46%	81.76%	88.91%	92.40%	94.32%	84.49%
Human (Best)	95.40%	95.21%	99.42%	95.17%	93.34%	91.73%	84.29%	96.05%	94.02%	94.69%	86.01%

We release the performance of top-tier closed-source LMMs against humans. Two conclusions can be obtained:

LMMs excel at basic recognition tasks but tend to be less effective when it comes to nuanced semantic understanding.
LMMs are poor quality evaluators.

Evaluate your model on A-Bench

With LMMs-Eval

Use LMMs-eval to automatically evaluate A-Bench:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
pip install -e .

export NUM_GPUS=8
export MODEL_NAME=idefics2
python3 -m accelerate.commands.launch --num_processes=$NUM_GPUS -m lmms_eval --model $MODEL_NAME --tasks abench_dev --batch_size 1 --log_samples --log_samples_suffix $MODEL_NAME_a_bench --output_path ./logs/

With VLMEvalKit

Use VLMEvalKit to automatically evaluate A-Bench:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

Example, quick test InternVL2-1B on the val and test sets of A-Bench:

python run.py --data A-Bench_VAL A-Bench_TEST --model InternVL2-1B --verbose

The val set has the correct answers and you can directly get the acc results. For test set performance, please submit the results to e-mail

With `datasets` API

TO evaluate on your custom model, you can use our converted dataset in huggingface datasets format:

pip install datasets

from datasets import load_dataset

ds = load_dataset("q-future/A-Bench-HF")
ds["dev"][0]

Outputs should be as follows:

{'id': 0,
 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x288>,
 'question': 'May I ask where the scene in the picture is located?',
 'option0': 'Forest',
 'option1': 'Riverside',
 'option2': 'Desert',
 'option3': 'N/A',
 'category': 'part1 -> bag_of_words -> attribute',
 'correct_choice': 'B'}

Which can be then evaluated with your own model's format for MCQ.

For example, if your model follows llava's format, it should be as follows:

di = ds["dev"][0]
prompt = di["question"] + "\n"
for i in range(4):
    if di[f"option{i}"] != "N/A":
        prompt += chr(ord("A")+i) + ". " + di[f"option{i}"] + "\n"
prompt = prompt + "Answer with the option's letter from the given choices directly."

print(prompt)

The prompt for the previous data item should be May I ask where the scene in the picture is located? A. Forest B. Riverside C. Desert Answer with the option's letter from the given choices directly.

Legacy

First download the dataset and meta information from Huggingface.

The imgs.zip contains all the AI-generated images and Abench.json contains all the meta information including the img_path, questions, answers, and categories. The item of Abench.json is structured like:

"img_path": "part1_0000.png",
"question": "What is the color of the windows in the house in the picture?",
"answers": [
    "white",
    "yellow",
    "blue"
],
"category": "part1 -> basic_recognition -> major"

The "img_path" indicates the path to the image in imgs.zip, the "question" is a string, the "answers" is a list of answer candidates (several false answers and the correct answer).

The correct answers are kept confidential to ensure A-Bench retains its long-term value as a benchmark for assessing AIGI evaluation capabilities.

Test without API

To test with your LMM, we suggest using the following prompt:

import json
with open("Abench.json", "r") as f:
    f = f.read()
    data = json.loads(f)

for item in data:
    image_file = 'path-to-imgs' + item["img_path"]
    message = item["question"] + "\n"
    for choice, ans in zip(["A.", "B.", "C.", "D."], item["answers"]):
        message += f"{choice} {ans}\n"
    message = message + "Answer with the option's letter from the given choices directly."
    print(message)

    # What is the color of the windows in the house in the picture?
    # A.white
    # B.yellow
    # C.blue
    # Answer with the option's letter from the given choices directly.

    # do your test here
    # response = LMM(image_file,message)
    item['response'] = response
    with open("results.jsonl", "a") as wf:
            json.dump(item, wf)
            wf.write("\n")

After finishing validation, you can submit the results via e-mail to get your LMM results on A-Bench !

Contact

Please contact any of the first authors of this paper for queries.

Zicheng Zhang, zzc1998@sjtu.edu.cn, @zzc-1998
Haoning Wu, haoning001@e.ntu.edu.sg, @teowu

Citation

If you find our work interesting, please feel free to cite our paper:

@misc{zhang2024abench,
      title={A-Bench: Are LMMs Masters at Evaluating AI-generated Images?}, 
      author={Zicheng Zhang and Haoning Wu and Chunyi Li and Yingjie Zhou and Wei Sun and Xiongkuo Min and Zijian Chen and Xiaohong Liu and Weisi Lin and Guangtao Zhai},
      year={2024},
      eprint={2406.03070},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Release

A-Bench Construction

Glance at A-Bench Performance

Evaluate your model on A-Bench

With LMMs-Eval

With VLMEvalKit

With `datasets` API

Legacy

Test without API

Contact

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Release

A-Bench Construction

Glance at A-Bench Performance

Evaluate your model on A-Bench

With LMMs-Eval

With VLMEvalKit

With datasets API

Legacy

Test without API

Contact

Citation

With `datasets` API