GitHub - Vision-CAIR/dochaystacks

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Authors: Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

The official implementation of our paper: Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents.

Catalogue:

Introdction
Citation
Data Preparation
Evaluation
- Visual-Centric Retrieval
- Augmented Multi-Image Reasoning
Fine-Tuning
Acknowledgments
Contact

Introduction

While large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, they fall short in reasoning over a large number of images, a complex but common real-world application. Existing benchmarks for Multi-Image Question Answering fail to comprehensively evaluate this capability of LMMs. To bridge this gap, we introduce two document haystack benchmarks, DocHaystack and InfoHaystack, designed to evaluate LMMs' performance on large-scale visual document retrieval and understanding. Unlike previous benchmarks, DocHaystack and InfoHaystack map each question to a substantially larger document collection, scaling up to 1,000 visual documents. This expanded scope more accurately represents large-scale document retrieval scenarios and offers a greater challenge in retrieval accuracy and visual question answering. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework enabling efficiently question answering across thousands of images, setting a new standard on our DocHaystack and InfoHaystack benchmarks.

Data Preparation

First, we should download the DocHaystack and InfoHaystack Benchmarks from Huggingface 🤗, respectively. Then, place the downloaded benchmarks into the data/* directory. The data should be organized in the following format:

├── dochaystacks
│   ├── data
│   │   ├── Train
│   │   │   ├── infographicsvqa_images
│   │   │   ├── spdocvqa_images
│   │   ├── Test
│   │   │   ├── DocHaystack_100
│   │   │   ├── DocHaystack_200
│   │   │   ├── DocHaystack_1000
│   │   │   ├── InfoHaystack_100
│   │   │   ├── InfoHaystack_200
│   │   │   ├── InfoHaystack_1000
│   │   ├── test_docVQA.json
│   │   ├── test_infoVQA.json
│   │   ├── train_specific.json

Evaluation

To evaluate the performance of LMMs on DocHaystack and InfoHaystack, execute the scripts provided in the scripts/* directory.

By running the following commands, you can obtain the results of current LMMs on large-scale visual document understanding without any additional processing. For Qwen2-VL, we reduce the input image resolution using the --low_res and --scale_factor options to ensure all inputs fit on a single A100 GPU (80G). LLaVA-OneVision, however, cannot process large-scale visual documents, even when attempting to handle multiple input images as a video using the --no_patch option. For this reason, we only provide a script demonstrating how to run LLaVA-OneVision on our benchmarks.

Note: Due to API calls, there may be variations in the results. However, the overall conclusions drawn should remain consistent.

sh scripts/zero-shot/qwen2vl/*.sh
sh scripts/zero-shot/llava_ov/*.sh
sh scripts/zero-shot/gpt4o/*.sh
sh scripts/zero-shot/gemini/*.sh

Visual-Centric Retrieval

sh scripts/retrieval/*.sh

Augmented Multi-Image Reasoning

We enhance the large-scale visual document understanding capabilities of existing LMMs through vision-centric retrieval-augmented generation (V-RAG). To evaluate the performance of V-RAG, you first need to obtain the visual-centric retrieval results and save them in the /output/retrieval/* directory. Once the retrieved results are available, augmenting any LMM is straightforward: simply feed the retrieved top k images into the model. By running the following commands, you can easily evaluate the performance of LMMs augmented by visual-centric retrieval on DocHaystack and InfoHaystack.

Note: For LLaVA-OneVision, we observed that the model collapses when handling multiple images directly (without video-like processing) with top_k = 5.

sh scripts/zero-shot-vrag/qwen2vl/eval.sh
sh scripts/zero-shot-vrag/llava_ov/eval.sh
sh scripts/zero-shot-vrag/gpt4o/eval.sh
sh scripts/zero-shot-vrag/gemini/eval.sh

Model	DocHaystack-100	DocHaystack-200	DocHaystack-1000	InfoHaystack-100	InfoHaystack-200	InfoHaystack-1000
LLaVA-OV+V-RAG	69.72	65.14	55.05	43.22	41.94	36.77
Gemini+V-RAG	73.39	65.14	58.72	57.42	57.42	47.10
GPT-4o+V-RAG	81.65	72.48	66.97	65.16	63.23	56.77
Qwen2-VL+V-RAG	82.57	74.31	66.06	65.81	65.81	60.00

Fine-Tuning

We fine-tune Qwen2-VL on our curated dataset using LLaMA-Factory, which makes the implementation straightforward by following their instructions. To maintain balance during fine-tuning, we ensure that the number of samples from infographicsvqa (899) matches the number of docvqa samples (899).

Acknowledgments

Our repository builds on Qwen2-VL, LLaVA-OneVision, LLaMA-Factory, GPT-4o, Gemini. Thanks for them!

Citation

If you find our paper and code helpful, we would greatly appreciate it if you could leave a star and cite our work. Thanks!

@article{chen2024document,
  title={Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents},
  author={Chen, Jun and Xu, Dannong and Fei, Junjie and Feng, Chun-Mei and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2411.16740},
  year={2024}
}

Contact

If you have any questions, please feel free to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
eval		eval
img		img
model		model
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Catalogue:

Introduction

Data Preparation

Evaluation

Visual-Centric Retrieval

Augmented Multi-Image Reasoning

Fine-Tuning

Acknowledgments

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

Vision-CAIR/dochaystacks

Folders and files

Latest commit

History

Repository files navigation

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Catalogue:

Introduction

Data Preparation

Evaluation

Visual-Centric Retrieval

Augmented Multi-Image Reasoning

Fine-Tuning

Acknowledgments

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages