- [12/09/2024] ⭐ We release code for compositional framework (Gemini/Claude + SD3/SD2.1/Flux, ISG-Agent) today!
- [11/27/2024] 📄 We release our paper and dataset today!
- Updates & News
- Contents
- Interleaved Scene Graph
- Evaluating Your Own Model
- ISG-Agent: Exploring the Upper Bound for Interleaved Generation
- Acknowledgments
- Citation
This evaluation method and benchmark is designed for evaluating interleaved generation in four levels: Structural, Block, Image, and Holistic. It is an well established testbed for model can perform both multimodal understanding and generation such as Show-o and Anole.
Given that we mainly use GPT-4o for VQA in Image and Block level as well as MLLM-as-a-Judge in Holistic level, you can simply setup by: pip install openai
.
/ISG_eval
├── images (You should download it from huggingface and place here)
├── ISG-Bench.jsonl
├── ...
-
images: Contains images in queries and golden answer. You can download it from here and place them under ISG_eval.
-
ISG-Bench.jsonl: Contains ground truth compiled previously by ISG. One data sample is as follows. It contains
Query
for question andGolden
for human-annotated golden answer.
{
"id": "0000",
"Category": "Prediction",
"Query": [
{
"type": "text",
"content": "I will give you a picture of a person washing their hands. Please use a combination of 4 images and text to show what will happen next. Please generate an overall description first, then directly generate adjacent image blocks. For example, [whole description] <object1 image> <object2 image> <object3 image> <object4 image>."
},
{
"type": "image",
"content": "images/0000_q1.jpg"
}
],
"Golden": [
{
"type": "text",
"content": "The person continues to scrub their hands thoroughly, with the soap lathering up. The hands are cleaned under running water, and the lather is rinsed away."
},
{
"type": "image",
"content": "images/0000_g1.jpg"
},
{
"type": "image",
"content": "images/0000_g2.jpg"
},
{
"type": "image",
"content": "images/0000_g3.jpg"
},
{
"type": "image",
"content": "images/0000_g4.jpg"
}
],
"predict": {
"structural": {
"Query": [
"<query_text1>",
"<query_img1>"
],
"Answer": [
"<gen_text1>",
"<gen_img1>",
"<gen_img2>",
"<gen_img3>",
"<gen_img4>"
]
},
"block_tuple": {
"relation": [
[
"<gen_text1>",
"<query_img1>",
"is an overall description of"
],
...
]
},
"block_qa": {
"questions": [
{
"subject": "<gen_text1>",
"object": "<query_img1>",
"relation": "is an overall description of",
"Question": "Does <gen_text1> describe this image?"
},
...
]
},
"image_tuple": [
[
"entity",
"hands",
"<gen_img1>"
],
...
],
"image_qa": {
"questions": [
{
"image": "<gen_img1>",
"Question": "Are there hands in this image?",
"id": 0,
"Preliminary": []
},
...
]
}
}
}
{
"id": "0000",
"Category": "Prediction",
"output": [
{
"type": "text",
"content": "<text-content>"
},
{
"type": "image",
"content": "<path_of_the_input_image>"
}
]
}
Then, run the following script:
python ISG-eval.py --input_file <your file>
python summarize_performance.py --input_file <output of ISG-eval.py>
We provide Gemini/Claude + SD3/SD2.1/Flux for compositional framework. You can run the following script to generate interleaved content.
python compositional_inference.py \
--text_generator <gemini/claude> \
--image_generator <sd3/sd2.1/flux> \
--input_file ./ISG_eval/ISG-Bench.jsonl
Please See ISG_agent/README.md
for enviroment setup and how to use. You can also reproduct the experiment result by comparing to the chart.
Category | Model | Avg. | Style | Prog. | 3D | Dec. | I-T C. | Temp. | VST | VQA |
---|---|---|---|---|---|---|---|---|---|---|
Block | ISG-AGENT | 5.515 | 5.391 | 6.181 | 6.081 | 4.243 | 6.408 | 6.816 | 5.678 | 3.321 |
Image | ISG-AGENT | 0.574 | 0.538 | 0.752 | 0.359 | 0.617 | 0.368 | 0.670 | 0.713 | - |
Structural | ISG-AGENT | 0.871 | 0.944 | 0.967 | 0.788 | 0.902 | 0.800 | 1.000 | 0.987 | 0.577 |
Holistic | ISG-AGENT | 6.262 | 5.873 | 6.459 | 4.887 | 7.582 | 6.932 | 4.540 | 7.030 | 6.795 |
This project is a follow-up of MLLM-as-a-Judge. This work is partially funded by Toyota Motor Corporation. We’d also like to extend a thank you to Jieyu Zhang, Weikai Huang, and Zixian Ma for their insightful feedback and support.
@article{chen2024interleaved,
title={Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment},
author={Dongping Chen and Ruoxi Chen and Shu Pu and Zhaoyi Liu and Yanru Wu and Caixi Chen and Benlin Liu and Yue Huang and Yao Wan and Pan Zhou and Ranjay Krishna},
journal={arXiv preprint arXiv:2411.17188},
year={2024},
}