Ye Liu1,2, Zongyang Ma2,3, Zhongang Qi2, Yang Wu4, Ying Shan2, Chang Wen Chen1
1The Hong Kong Polytechnic University 2ARC Lab, Tencent PCG
3Institute of Automation, Chinese Academy of Sciences 4Tencent AI Lab
E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:
- E.T. Bench: A large-scale and high-quality benchmark for event-level and time-sensitive video understanding, comprising 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains.
- E.T. Chat: A multi-modal large language model (MLLM) that specializes in time-sensitive video-conditioned chatting. It reformulates timestamp prediction as a novel embedding matching problem.
- E.T. Instruct 164K: A meticulously collected instruction-tuning dataset tailored for time-sensitive video understanding scenarios.
We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.
2024.09.28
โญ๏ธ Code, model, and dataset release.2024.09.27
๐ E.T. Bench has been accepted to NeurIPS 2024 (Datasets and Benchmarks Track).
Our online leaderboard is under construction. Stay tuned!
Please refer to the Benchmark page for details about E.T. Bench.
Please refer to the Model page for training and testing E.T. Chat.
Please refer to the Dataset page for downloading E.T. Instruct 164K.
Please kindly cite our paper if you find this project helpful.
@inproceedings{liu2024etbench,
title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2024}
}
This project was built upon the following repositories with many thanks to their authors.