[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark
action-recognition
video-understanding
video-data
self-supervised
multimodal
video-dataset
open-set-recognition
video-retrieval
video-question-answering
masked-autoencoder
temporal-action-localization
contrastive-learning
spatio-temporal-action-localization
zero-shot-retrieval
video-clip
vision-transformer
zero-shot-classification
foundation-models
instruction-tuning
-
Updated
Sep 23, 2024 - Python