Skip to content

[CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models"

License

Notifications You must be signed in to change notification settings

AdaCheng/EgoThink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

🌐 Homepage | πŸ€— Dataset | πŸ€— Paper | πŸ“– arXiv | πŸ† Leaderboard

intro

Figure 1: The main categories of EgoThink to comprehensively assess the capability of thinking from a first-person perspective.

πŸ”” News

[2024-10]: Our related paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.
[2024-04]: EgoThink is invited to be presented in ByteDance.
[2024-04]: EgoThink will be presented as a Poster (HighlightπŸ‘€) in CVPR 2024.
[2024-03]: EgoThink is presented in AITIME.
[2024-02]: EgoThink has been accepted by CVPR 2024.
[2023-11]: Our paper Can Vision-Language Models Think from a First-Person Perspective? has been released.

πŸ’Ύ Dataset

Overview

overview

Figure 2: Categories with fine-grained dimensions and their corresponding examples of EgoThink benchmark.

Download (Choose One of Two)

  1. Clone our GitHub Repo.
git clone https://github.com/AdaCheng/EgoThink.git
cd data
  1. Download in our Hugging Face Repo.

πŸ”§ Dependencies

Here we provide the basic environment, you need to additionally install requirements for your evaluated open-source models.

conda create --name egothink python=3.10
conda activate egothink
pip install -U pip

# Install requirements
pip install -r requirements.txt

πŸ“Š Evaluation

Add New Open-Source Models

🫰 Thank you very much if you would like to contribute the code of the new model you have deployed!

  1. create test_{new_model}.py in /models.
  2. Add the new model in get_model() in /models/__init__.py.
# BLIP2-7B
if model_name == 'blip2-7b':
  from .test_blip2 import TestBlip2
  return TestBlip2(name='blip2_opt', model_type='pretrain_opt6.7b', config_path='/models/blip_configs/blip2_pretrain_opt6.7b.yaml', device=device)

Inference

  • API-based Model

Please update the API-based models' keys and base_urls between the line 23 to line 33 of file gpt_eval.py.

# dataset: Activity, Object/existence, etc.
# MODEL: GPT series models, such as gpt-4o
python gpt_eval.py \
    --model_name $MODEL \
    --annotation_path /${dataset}/annotations.json \
    --answer_path /answer/${dataset} \
  • Open-Source Model
# dataset: Activity, Object/existence, etc.
# MODEL: models defined in the models file
# DEVICE: GPU id, 0/1/2..., currently only single card can run
python eval.py \
    --model_name $MODEL \
    --annotation_path /${dataset}/annotations.json \
    --answer_path /answer/${dataset} \
    --batch_size 1 \
    --device $DEVICE

Evaluation

Please update the API-based models' key and base between the line 463 to line 546 of file common.py.

# data-folder: the folder name of answer.
# bench-name: Activity, Object/existence, etc.
# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4o (default), gpt-3.5-turbo, claude-2, etc.
python  gen_judgment.py \
    --data-folder /answer \
    --bench-name $dataset \
    --mode single \
    --model-list $EVA_MODELS \
    --judge-model $EVA_JUDGE_MODEL 
    --parallel 4
    --judge-file judge_prompts.jsonl

Show Results

# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4 (default), gpt-3.5-turbo, claude-2, etc.
python show_result.py \
    --input-file {data_folder}/{bench-name}/model_judgment/{judge-model}_single.jsonl \
    --judge-model $EVA_JUDGE_MODEL \
    --model-list  $EVA_MODELS \
    --mode single

πŸ† Leaderboard

Update

πŸ‘‹ Feel free to contribute to the performance of your model by adding it to our "RESULTS SECTION" (from line 398) in index.html; we will review and merge it accordingly.

<tr style="background-color: #f8fffe;">
    <td style="text-align: left;"><b>GPT-4V(ision)</b></td>
    <td><b>65.5</b></td>
    <td>62.0</td>
    <td><b>82.0</b></td>
    <td><b>58.0</b></td>
    <td><b>59.5</b></td>
    <td style="text-decoration: underline;">86.0</td>
    <td style="text-decoration: underline;">62.0</td>
    <td><b>42.0</b></td>
    <td>48.0</td>
    <td><b>83.0</b></td>
    <td><b>55.0</b></td>
    <td><b>64.0</b></td>
    <td><b>84.0</b></td>
</tr> 

Overview

The detailed Table can be found in Here.

overview

Table 1: Combined single-answer grading scores on zero-shot setups for various dimensions. The bold indicates the best performance while the underline indicates the second-best performance. Exist, Attr, Afford, Loc, Spatial, Count, Compar, Situated, Nav and Assist represent existence, attribute, affordance, location, spatial relationship, counting, comparison, situated reasoning, navigation, and assistance.

Contact

Citation

@InProceedings{Cheng_2024_CVPR,
    author    = {Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
    title     = {EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14291-14302}
}

Acknowledge

Thanks to Xiaolong Wang, Yangyang Yu, Zixin Sun, and Zhaoyang Li for their contributions to data collection and construction. We appreciate Zeyuan Yang, Szymon Tworkowski, Guan Wang, and Zonghan Yang for their support of API resources; Xinghang Li for his valuable discussion; Siyu Wang for her code base on the annotation system.

Furthermore, we appreciate the developers behind the following projects for their significant contributions to our research: Ego4D, Multi-Modality-Arena, FastChat.

About

[CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published