E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

Ye Liu^1,2, Zongyang Ma^2,3, Zhongang Qi², Yang Wu⁴, Ying Shan², Chang Wen Chen¹

¹The Hong Kong Polytechnic University ²ARC Lab, Tencent PCG
³Institute of Automation, Chinese Academy of Sciences ⁴Tencent AI Lab

E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:

E.T. Bench: A large-scale and high-quality benchmark for event-level and time-sensitive video understanding, comprising 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains.
E.T. Chat: A multi-modal large language model (MLLM) that specializes in time-sensitive video-conditioned chatting. It reformulates timestamp prediction as a novel embedding matching problem.
E.T. Instruct 164K: A meticulously collected instruction-tuning dataset tailored for time-sensitive video understanding scenarios.

We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.

🔥 News

2024.09.28 ⭐️ Code, model, and dataset release.
2024.09.27 🎉 E.T. Bench has been accepted to NeurIPS 2024 (Datasets and Benchmarks Track).

🏆 Leaderboard

Our online leaderboard is under construction. Stay tuned!

🔮 Benchmark

Please refer to the Benchmark page for details about E.T. Bench.

🛠️ Model

Please refer to the Model page for training and testing E.T. Chat.

📦 Dataset

Please refer to the Dataset page for downloading E.T. Instruct 164K.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}

💡 Acknowledgements

This project was built upon the following repositories with many thanks to their authors.

LLaVA, LAVIS, EVA, LLaMA-VID, TimeChat, densevid_eval

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
docs		docs
etchat		etchat
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

🔥 News

🏆 Leaderboard

🔮 Benchmark

🛠️ Model

📦 Dataset

📖 Citation

💡 Acknowledgements

About

Releases

Packages

Languages

License

PolyU-ChenLab/ETBench

Folders and files

Latest commit

History

Repository files navigation

E.T. Bench: Towards Open-Ended Event-LevelVideo-Language Understanding

🔥 News

🏆 Leaderboard

🔮 Benchmark

🛠️ Model

📦 Dataset

📖 Citation

💡 Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

E.T. Bench: Towards Open-Ended Event-Level
Video-Language Understanding

Packages