Authors: Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny
The official implementation of our paper: Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents.
While large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, they fall short in reasoning over a large number of images, a complex but common real-world application. Existing benchmarks for Multi-Image Question Answering fail to comprehensively evaluate this capability of LMMs. To bridge this gap, we introduce two document haystack benchmarks, DocHaystack and InfoHaystack, designed to evaluate LMMs' performance on large-scale visual document retrieval and understanding. Unlike previous benchmarks, DocHaystack and InfoHaystack map each question to a substantially larger document collection, scaling up to 1,000 visual documents. This expanded scope more accurately represents large-scale document retrieval scenarios and offers a greater challenge in retrieval accuracy and visual question answering. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework enabling efficiently question answering across thousands of images, setting a new standard on our DocHaystack and InfoHaystack benchmarks.
First, we should download the DocHaystack and InfoHaystack Benchmarks from Huggingface 🤗, respectively. Then, place the downloaded benchmarks into the data/*
directory. The data should be organized in the following format:
├── dochaystacks
│ ├── data
│ │ ├── Train
│ │ │ ├── infographicsvqa_images
│ │ │ ├── spdocvqa_images
│ │ ├── Test
│ │ │ ├── DocHaystack_100
│ │ │ ├── DocHaystack_200
│ │ │ ├── DocHaystack_1000
│ │ │ ├── InfoHaystack_100
│ │ │ ├── InfoHaystack_200
│ │ │ ├── InfoHaystack_1000
│ │ ├── test_docVQA.json
│ │ ├── test_infoVQA.json
│ │ ├── train_specific.json
To evaluate the performance of LMMs on DocHaystack and InfoHaystack, execute the scripts provided in the scripts/*
directory.
By running the following commands, you can obtain the results of current LMMs on large-scale visual document understanding without any additional processing. For Qwen2-VL, we reduce the input image resolution using the --low_res
and --scale_factor
options to ensure all inputs fit on a single A100 GPU (80G). LLaVA-OneVision, however, cannot process large-scale visual documents, even when attempting to handle multiple input images as a video using the --no_patch
option. For this reason, we only provide a script demonstrating how to run LLaVA-OneVision on our benchmarks.
Note: Due to API calls, there may be variations in the results. However, the overall conclusions drawn should remain consistent.
sh scripts/zero-shot/qwen2vl/*.sh
sh scripts/zero-shot/llava_ov/*.sh
sh scripts/zero-shot/gpt4o/*.sh
sh scripts/zero-shot/gemini/*.sh
sh scripts/retrieval/*.sh
We enhance the large-scale visual document understanding capabilities of existing LMMs through vision-centric retrieval-augmented generation (V-RAG). To evaluate the performance of V-RAG, you first need to obtain the visual-centric retrieval results and save them in the /output/retrieval/*
directory. Once the retrieved results are available, augmenting any LMM is straightforward: simply feed the retrieved top k images into the model. By running the following commands, you can easily evaluate the performance of LMMs augmented by visual-centric retrieval on DocHaystack and InfoHaystack.
Note: For LLaVA-OneVision, we observed that the model collapses when handling multiple images directly (without video-like processing) with top_k = 5.
sh scripts/zero-shot-vrag/qwen2vl/eval.sh
sh scripts/zero-shot-vrag/llava_ov/eval.sh
sh scripts/zero-shot-vrag/gpt4o/eval.sh
sh scripts/zero-shot-vrag/gemini/eval.sh
Model | DocHaystack-100 | DocHaystack-200 | DocHaystack-1000 | InfoHaystack-100 | InfoHaystack-200 | InfoHaystack-1000 |
---|---|---|---|---|---|---|
LLaVA-OV+V-RAG | 69.72 | 65.14 | 55.05 | 43.22 | 41.94 | 36.77 |
Gemini+V-RAG | 73.39 | 65.14 | 58.72 | 57.42 | 57.42 | 47.10 |
GPT-4o+V-RAG | 81.65 | 72.48 | 66.97 | 65.16 | 63.23 | 56.77 |
Qwen2-VL+V-RAG | 82.57 | 74.31 | 66.06 | 65.81 | 65.81 | 60.00 |
We fine-tune Qwen2-VL on our curated dataset using LLaMA-Factory, which makes the implementation straightforward by following their instructions. To maintain balance during fine-tuning, we ensure that the number of samples from infographicsvqa (899) matches the number of docvqa samples (899).
Our repository builds on Qwen2-VL, LLaVA-OneVision, LLaMA-Factory, GPT-4o, Gemini. Thanks for them!
If you find our paper and code helpful, we would greatly appreciate it if you could leave a star and cite our work. Thanks!
@article{chen2024document,
title={Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents},
author={Chen, Jun and Xu, Dannong and Fei, Junjie and Feng, Chun-Mei and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2411.16740},
year={2024}
}
If you have any questions, please feel free to contact us.