Skip to content

Commit

Permalink
Update 2024-04-12-mantis.markdown
Browse files Browse the repository at this point in the history
  • Loading branch information
wenhuchen authored Apr 13, 2024
1 parent 180339b commit 15fef5e
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions _posts/2024-04-12-mantis.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -6,37 +6,37 @@ author: Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Che
# categories: jekyll update
---

## Introduction
# Introduction

While many previous works have focused on single-image reasoning tasks, the ability to reason over multiple images is still an open challenge. **Can we enhance large multimodal models with multi-image reasoning ability via instruction tuning?**. To achieve this goal, we propose **Mantis**, a series of LMMs with decent multi-image reasoning ability, achieving SOTA performance on a series of multi-image benchmarks. Mantis is trained on the newly curated dataset **Mantis-Instruct**, a large-scale multi-image QA dataset that covers various multi-image reasoning tasks.

[Resources](##resources) are listed at the end of the blog.
[Resources](#resources) are listed at the end of the blog.

## Methodology
# Methodology

We have modifed the Fuyu and LLaVA's training and inference codes to support **interleaved text-image inputs**. In this version, we use LLaVA-1.5 as the base model for the first version of Mantis, called **Mantis-llava**. (Mantis-Fuyu will be released in the future.)

Similar to LLaVA, we use `<image>` as the placeholder of each image in the text. Besides, for each `<image>` token, we will automatically replace all the images into the following format: `(image {i}: <Image><image></Image>)`, where `i` is the serial number in the sequence of images.

## Training: Mantis-Instruct dataset
# Training: Mantis-Instruct dataset

![Figure 2: Illustrations of parts of the datasets in Mantis-Instruct]({{"/assets/Mantis/images/mantis-instruct-cases.jpeg" | relative_url }})

We curate datasets from multiple publicly available datasets to satisfy the training requirements above, forming Mantis-Instruct. The dataset is divided into 4 parts:

**1. Multi-Image Reasoning.**
1. Multi-Image Reasoning.
The first parts comes from existing datasets where multiple images are involved, including Spot-the-diff, Birds-to-words, Dreamsim, NLVR2. These dataset all involve reasoning across multiple images.

**2. Contrastive Captioning.**
2. Contrastive Captioning.
The second part is the newly curated dataset from existing captioning datasets, ShareGPT-4V, and LAION-GPT4-Vision. Give multiple images, the model need to generate a caption for a specific image (`Caption Generation`), or judge which image matches the provided caption (`Caption Matching`).

**3. Multi-Image Instruction Following.**
3. Multi-Image Instruction Following.
The third part reformats previous single-image multi-turn instruction-following datasets into a multi-image version, including LLaVA-665k-merged, LRV-Instruction-merged. We add proper image denotation like `For the second image, ...` for each question.

**4. Single-Image Reasoning.**
4. Single-Image Reasoning.
Similar to LLaVA-NeXT, we include DocVQA, DVQA, and ChartQA to enhance the model's ability on diagrams and OCR. This is also to avoid the model forgetting ability on single-image reasoning.

## Evaluation: Multi-image evaluation set
# Evaluation: Multi-image evaluation set

We release 2 versions of Mantis, [Mantis-llava-7b-v1.0](https://huggingface.co/TIGER-Lab/Mantis-llava-7b-v1.0) and [Mantis-llava-7b-v1.1](https://huggingface.co/TIGER-Lab/Mantis-llava-7b-v1.1). Mantis-llava-7b-v1.0 is trained on the Mantis-Instruct dataset, while Mantis-llava-7b-v1.1 is trained on the Mantis-Instruct dataset along with [Co-Instruct](https://co-instruct.github.io/). They are based on [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and [llava-hf/bakLlava-v1-hf](https://huggingface.co/llava-hf/bakLlava-v1-hf) respectively. We evaluate the models on multiple benchmarks, including:

Expand All @@ -46,7 +46,7 @@ We release 2 versions of Mantis, [Mantis-llava-7b-v1.0](https://huggingface.co/T
4. [Mementos](https://arxiv.org/abs/2401.10529): Multi-image captioning
5. Mantis-eval: Multi-choice Question Answering curated by ourselves

### NLVR2, Birds-to-words, and Mantis-eval
## NLVR2, Birds-to-words, and Mantis-eval

| Models | NLVR2 | Birds-to-words | Mantis-eval |
|-----------------------|-----------|----------------|------------|
Expand All @@ -64,7 +64,7 @@ We release 2 versions of Mantis, [Mantis-llava-7b-v1.0](https://huggingface.co/T

As shown in Table 1, our model achieves decent performance on the benchmarks. Specifically, Mantis-llava-7b-v1.0 gets 84.93 on NLVR2 and 52.52 accuracies on Birds-to-words. Mantis-llava-7b-v1.1 gets 50.69 on the Mantis-eval dataset, which surpasses the LLaVA-1.5 by 19.35 and BLIP2 by 0.92.

### Qbench2
## Qbench2

| Models | Accuracy |
|-------------------------------|------------|
Expand All @@ -88,20 +88,20 @@ As shown in Table 1, our model achieves decent performance on the benchmarks. Sp
QBench is a benchmark evaluating whether LMMs can properly judge and compare the quality of a benchmark.
On the Qbench2 leaderboard, our model achieves 52.30 accuracy, which surpasses all previous open-source LMMs reported in the leaderboard. After introducing the co-instruct training data, the performance further increases to 72.00 accuracy, which surpasses 3 close-source models and is only behind GPT-4V by 4.52. However, there are some performance decreases on the 2 held-in datasets NLVR2 and Birds-to-words. We attribute it to be the trade-off across various datasets. Besides, the performance on the held-in evaluation dataset after the decreases of our model still surpasses the baseline models for the NLVR2 dataset. For Birds-to-words evaluation datasets, our model is only behind BLIP2 by 2.08.

### Mementos
## Mementos

![Table 3: The performance of Mantis-llava-7b-v1.1 on Mementos behavior recognition evaluation.]({{"/assets/Mantis/images/mementos.jpeg" | relative_url }})

Mementos is a benchmark to test multiple image sequence reasoning ability. It evaluates whether LMMs can accurately capture the contents in the image sequences, understand the situations, and then generate a detailed description of the provided image sequence. The evaluation will require GPT-4 to extract an action list and an object list described in the generated text, then compare with the reference description with precision, recall, and F1-score.

We report the performance in 3 domains daily life, robotics, and comics. Results are shown in Table 3. Mantis-llava-7b-v1.1 achieves 29.80%, 32.38%, and 15.99% F1-score on the daily life, robotics, and comics domains, respectively. Our model surpasses all the previous open-source LMMs. The performance in the daily life domain is only behind GPT-4. We found that Mantis is particularly good at capturing information from dynamic scenes where multiple images are involved. This phenomenon is akin to the insect mantis, which is adept at recognizing moving and dynamic objects but struggles with static objects. This similarity is why we have named our model after this insect.

## Ongoing Work
# Ongoing Work

Mantis is a active work in progres. We have demonstrated that Mantis-llava-7b-v1.1 has achieved remarkable performance on various benchmarks, including NLVR2, Birds-to-words, Mementos, and Qbench2. However, there are still some limitations and future directions that we need to address, such as performance drops on single-image reasoning tasks, and the context length limitation of the model. We plan to keep improving the model's performance on single-image reasoning tasks and explore more efficient ways to handle multiple images. Larger models and more diverse datasets will be used to further improve the model's performance.
We also plan to investigate the effects of the heuristics we have applied in the data curation process and further improve the model's performance on multi-image reasoning tasks. We hope that our work can inspire more research in the field of multi-image reasoning and contribute to the development of large multimodal models.

## Resources
# Resources

- [Code](https://github.com/TIGER-AI-Lab/Mantis)
- 🤗 [Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis)
Expand Down

0 comments on commit 15fef5e

Please sign in to comment.