Skip to content

chakravarthi589/Video-Question-Answering_Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Video-Question-Answering (VideoQA) Resources

The Video-Question-Answering-Resources repository is a curated guide for beginners and researchers interested in the Video Question Answering (VQA) field. It provides an organized collection of the most relevant papers, models, datasets, and additional resources to help users understand and contribute to this evolving area. The repository focuses on the intersection of computer vision and natural language processing, particularly how video data can be used to answer complex questions, offering a range of materials from introductory guides to advanced research. (Last Update on 12/14/2024)

Keywords:

Video question answering (VideoQA), LLMs, Long video understanding, Spatial Reasoning, Temporal Reasoning, Multi-Choice QA, Open-Ended QA;

Curators:

Bharatesh Chakravarthi, Ph.D
Joseph Raj Vishal



Beginners Guide to Video Question Answering

  1. Answering Questions from YouTube Videos with OpenAI Whisper and GPT-4 (Medium article)

  2. Try a quick example on how to use LLMs for Video Question Answering here (Check Additional Resources for API key)

  3. Community Computer Vision Course (Unit 4) MultiModal Models


Publications

Survey/Review Papers

  • A Survey on Generative AI and LLM for Video Generative Understanding, and Streaming (2024) [Paper]
  • Video Question Answering: a Survey of Models and Datasets (2021) [Paper]
  • A survey on VQA: Datasets and approaches (2020, ITCA) [Paper]

Conference/Journal Papers

2024

  • (Our Paper) Eyes on the Road:State-of-the-art Video Question Answering Models Assessment for Traffic Monitoring Tasks [Paper]
  • TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning [Paper]
  • MoReVQA: Exploring Modular Reasoning Models for Video Question Answering (CVPR) [Paper]
  • Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question Answering (CVPR) [Paper]
  • VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models[Paper]
  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (CVPR) [Paper]
  • Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR) [Paper]
  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [Paper]
  • Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering [Paper]
  • LVBench: An Extreme Long Video Understanding Benchmark [Paper]
  • Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding [Paper]
  • Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [Paper]
  • CinePile: A Long Video Question Answering Dataset and Benchmark [Paper]
  • Video-Language Alignment via Spatio-Temporal Graph Transformer [Paper]
  • Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [Paper]
  • VideoChat: Chat-Centric Video Understanding [Paper]
  • LITA: Language Instructed Temporal-Localization Assistant [Paper]
  • Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports [Paper]
  • Videoagent: Long-form Video Understanding with Large Language Model as Agent [Paper]
  • AMEGO: Active Memory from Long EGOcentric Videos [Paper]
  • Video Instruction Tuning With Synthetic Data [Paper]
  • Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [Paper]
  • Video-MME:The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [Paper]
  • How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites.[Paper]
  • Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. [Paper]
  • VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs [Paper]
  • TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [Paper]
  • Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context [Paper]
  • VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models [Paper]
  • ViLA: Efficient video-language alignment for video question answering(ECCV24) [Paper]
  • STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering [Paper]
  • STAR: A Benchmark for Situated Reasoning in Real-World Videos [Paper]
  • LongVLM:Efficient Long Video Understanding via Large Language Models [Paper]
  • FunQa:Towards Suprising Video Comprehension [Paper]

2023

  • Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models (CVPR) [Paper]
  • ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos (CVPR) [Paper]
  • Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering (CVPR) [Paper]
  • Discovering Spatio-Temporal Rationales for Video Question Answering (ICCV) [Paper]
  • Egoschema: A Diagnostic Benchmark for Very Long-Form Video Language Understanding (NeurIPS) [Paper]
  • Visual Instruction Tuning (NeurIPS) [Paper]
  • A Simple LLM Framework for Long-Range Video Question-Answering (Preprint) [Paper]
  • A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [Paper]
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [Paper]
  • Video Question Answering Using CLIP-Guided Visual-Text Attention [Paper]
  • Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data [Paper]
  • Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization [Paper]
  • Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question-Answering [Paper]
  • A Dataset for Medical Instructional Video Classification and Question Answering [Paper]

2022

  • Measuring Compositional Consistency for Video Question Answering (CVPR) [Paper]
  • From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question Answering (CVPR) [Paper]
  • Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS) [Paper]
  • Dynamic Spatio-Temporal Modular Network for Video Question Answering [Paper]
  • Ego4D:Around the World in 3000 Hours of EgoCentric Video[Paper]
  • Flamingo: A Visual Language Model for Few-Shot Learning [Paper]
  • Saying the Unseen: Video Descriptions via Dialog Agents [Paper]
  • Learning to Answer Visual Questions from Web Videos [Paper]
  • In-the-Wild Video Question Answering [Paper]
  • FIBER:Fill-in-the-Blanks as a Challenging Video Understanding Framework [Paper]
  • VQuAD: Video Question Answering Diagnostic Dataset [Paper]
  • NEWSKVQA:Knowledge-Aware News Video Question Answering [Paper]
  • Learning to Answer Questions in Dynamic Audio-Visual Scenarios (CVPR) [Paper]

2021

  • NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR) [Paper]
  • Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling (CVPR) [Paper]
  • AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning (CVPR) [Paper]
  • On the Hidden Treasure of Dialog in Video Question Answering (ICCV) [Paper]
  • Self-Supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA (AAAI) [Paper]
  • Hierarchical Conditional Relation Networks for Multimodal Video Question Answering [Paper]
  • TruMan: Trope Understanding in Movies and Animations [Paper]
  • Perceiver IO: A General Architecture for Structured Inputs & Outputs[Paper]
  • VideoGPT:Video Generation using VQ-VAE and Transformers [Paper]
  • Clip4clip: An empirical study of clip for end-to-end video clip retrieval and captioning.(ACM) [Paper]
  • VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [Paper]
  • Just Ask: Learning to Answer Questions from Millions of Narrated Videos [Paper]
  • AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant [Paper]
  • SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events. (CVPR) [Paper]
  • Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments (CVPR)[Paper]
  • Progressive Graph Attention Network for Video Question Answering [Paper]
  • Transferring Domain=-Agnostic Knowledge in Video Question Answering [Paper]
  • Video Question Answering with Phrases via Semantic Roles [Paper]

2020

  • BERT Representations for Video Question Answering (WACV) [Paper]
  • KnowIT VQA: Answering Knowledge-Based Questions about Videos (AAAI) [Paper]
  • Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering (AAAI) [Paper]
  • TVQA+: Spatio-Temporal Grounding for Video Question Answering [Paper]
  • Video Question Answering for Surveillance (TechRxiv - Not Peer Reviewed) [Paper]
  • The MSR-Video to Text Dataset with Clean Annotations [Paper]
  • HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (ACL) [Paper]
  • CLEVRER: CoLlision Events for Video REpresentation and Reasoning [Paper]
  • TVQA: Localized,Compositional Video Question Answering [Paper]
  • LifeQA: A Real-Life Dataset for Video Question Answering [Paper]
  • TutorialVQA: Question Answering Dataset for Tutorial Videos [Paper]
  • Video Question Answering on Screencast Tutorials(ACM)[Paper]
  • Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning.[Paper]
  • DramaQA: Character-Centered Video Story Understanding with Hierarchical QA.[Paper]

2019

  • EgoVQA: An Egocentric Video Question Answering Benchmark Dataset (CVPR) [Paper]
  • Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering (AAAI) [Paper]
  • Compositional Attention Networks with Two-Stream Fusion for Video Question Answering [Paper]
  • Learning to Reason with Relational Video Representation for Question Answering [Paper]
  • Video Question Answering with Spatio-Temporal Reasoning [Paper]
  • Spatio-Temporal Relation Reasoning for Video Question Answering [Paper]
  • Moments in Time Dataset: one million videos for event understanding [Paper]
  • Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence [Paper]

2018

  • Multimodal Dual Attention Memory for Video Story Question Answering (CVPR) [Paper]
  • TVQA: Localized, Compositional Video Question Answering [Paper]
  • Explore Multi-Step Reasoning in Video Question Answering [Paper]
  • Towards Automatic Learning of Procedures From Web Instructional Videos [Paper]
  • Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction [Paper]
  • On the effectiveness of task granularity for transfer learning [Paper]

2017

  • A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question Answering (CVPR) [Paper]
  • MarioQA: Answering Questions by Watching Gameplay (CVPR) [Paper]
  • Leveraging Video Description to Learn Video Question Answering (AAAI) [Paper]
  • Video Question Answering via Gradually Refined Attention over Appearance and Motion [Paper]
  • DeepStory: Video Story QA by Deep Embedded Memory Networks [Paper]
  • Video Question Answering via Hierarchical Spatio-Temporal Attention Networks [Paper]
  • The "something something" video database for learning and evaluating visual common sense [Paper]
  • Video Question Answering via Attribute-Augmented Attention Network Learning(ACM) [Paper]

2016

  • MovieQA: Understanding Stories in Movies through Question-Answering (CVPR) [Paper]
  • MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR) [Paper]
  • TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR) [Paper]

2015

  • Uncovering Temporal Context for Video Question Answering [Paper]

Datasets

Year Name Key Features
2024 NExT-GQA The NExT-GQA dataset augments the NExT-QA dataset with temporal labels for Causal (“why/how”), Temporal (“before/when/after”) type questions. The annotations are done in a weekly supervised setup by labeling validation and test sets. 8,911 QA pairs from 1,557 videos are annotated with 10,531 valid temporal segments
2024 MVBench The MVBench dataset focuses on evaluating multi-modal video understanding by covering 20 complex video tasks that emphasize temporal reasoning, from perception to cognition.The MVBench dataset includes over 566,747 video clips from diverse sources, such as COCO, WebVid, YouCook2, and more. The dataset also covers a wide variety of task types, such as question-answering, captioning, and conversation tasks, with more than 200 multiple-choice questions generated for each temporal understanding task
2024 LVBench The LVBench dataset consists of 103 videos, each with a minimum duration of 30 minutes. There are a total of 1549 question-answer pairs associated with these videos, with an average of 24 questions per hour of video content
2024 FunQA FunQA is a video question-answering dataset featuring 4.3K counter-intuitive and humorous video clips with 312K free-text QA pairs, an average answer length of 34.2 words, and subsets like HumorQA, CreativeQA, and MagicQA highlighting humor, creativity, and magic-themed reasoning
2024 MedVidQA MedVidQA dataset comprises 3,010 human-annotated instructional questions and visual answers from 900 health-related videos.This dataset forms apart of the challenge of two tasks,medical instructions question generation and Video Corpus Visual Answer Localization (VCVAL).
2024 Video-MME The Video-MME is a comprehensive benchmark designed to evaluate Multi-Modal Large Language Models (MLLMs) in video analysis.Covers short (< 2min), medium (4-15min), and long (30-60min) videos to test MLLMs' ability to process varying time frames. This includes 6 primary domains, such as Knowledge, Film and TV, Sports, Life Records, and Multilingualism, with 30 subfields, ensuring broad generalizability. Integrates video frames, subtitles, and audio.
2024 CinePile The CinePile dataset consists of 9,396 movie clips sourced from the Movieclips YouTube channel, divided into training and testing splits of 9,248 and 148 videos, respectively. Through a question-answer generation and filtering pipeline, the dataset produced 298,888 training points and 4,940 test-set points, averaging 32 questions per video scene.
2023 TextVR The TextVR dataset is a large-scale cross-modal video retrieval dataset, containing 42,200 sentence queries for 10,500 videos across eight scenario domains, including Street View, Game, Sports, Driving, Activity, TV Show, and Cooking.
2023 Social-IQ-2.0 This dataset is from the Social IQ challenge, consisting of 1000 videos,6000 questions and 24,000 answers. This challenge was co-hosted with the Artificial Social Intelligence Workshop at ICCV'23
2023 VideoChat VideoChat is a video-centric multimodal instruction data based on WebVid-10M. The project features a 100K video-instruction dataset created using human-assisted and semi-automatic annotation techniques.
2022 Ego4D Ego4D is a comprehensive egocentric video dataset comprising 3,670 hours of daily-life activities recorded by 931 camera wearers across 74 locations in 9 countries, covering various scenarios like household, outdoor, and workplace settings.
2022 NEWSKVQA NEWSKVQA is a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities.
2022 MedVidQACL This dataset consists of Medical Instructional Videos and Questions based on those videos consists of 899 Videos each of 4 mins and 3K Questions manually annotated
2022 FIBER The FIBER dataset consists of 28,000 videos and description.The dataset consists of MCQ-type questions as well as video captioning data.Consists of 28K videos and 28K questions each of 10 seconds duration
2022 Casual-VidQA This dataset consists of 26K Videos with 107K questions.Manually annotated.
2022 MUSIC-AVQA This dataset consists of 9.3K Music Video each 60s long with 45K Manually annotated.
2022 VQuAD This dataset consists of 7K videos with 1.3Million Questions offering spatial and temporal properties.It consist of Synthetic Videos
2022 STAR The STAR is a dataset for Situated Reasoning, which provides challenging question-answering tasks, symbolic situation descriptions and logic-grounded diagnosis via real-world video situations. It consists of 4 Question Types, 60K Situated Questions , 23K Situation Video Clips and 140K Situation Hypergraphs
2022 In-the-Wild This consists of dataset with videos recorded outdoors(survival , agriculture,natural disaster and military),Consists of 369 videos with 916 questions each about a minute and 10 seconds long.
2022 AGQA 2.0 AGQA 2.0 is the succeeding dataset of AGQA.With this dataset, there exists a benchmark of 96.85M question-answer pairs and a balanced subset of 2.27M question-answer pairs
2022 WebVidVQA3M Data Consists of Web Videos 2M with 3M Question and each video 4 mins long.This consists of automatically tagged videos
2021 HowToVQA69M Data Consists of 69M videos, with 69M questions each video being 2 minutes long,This consists of manually tagged videos
2021 iVQA This dataset consists of 10K videos with 10K questions each of 8 minutes long.
2021 PanoAVQA PanoAVQA dataset consists of 360 degree panoramic videos.Consists of a total of 5.4K videos and 20K spatial and 31.7K audio-video QAs.
2021 AGQA Action Genome Question Answering (AGQA) is a benchmark for compositional spatiotemporal reasoning. AGQA contains 192M unbalanced question-answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question-answer pairs
2021 Video-QAP VideoQAP dataset consists of Web videos consisting of 35K Videos and 162K Questions each 36.2 seconds long
2021 KnowIT-X-VQA An extension of KnowIT dataset.this dataset consists of TV videos (12.1K) and 21.4K questions.
2021 Charades-SRL-QA These consists of Charades , HomeMade videos with 9.5K videos with 71K questions each 29 seconds long.
2021 NExTQA The NExT-QA dataset comprises 5,440 videos, split into 3,870 for training, 570 for validation, and 1,000 for testing. It features around 52,044 question-answer pairs, with approximately 47,692 for multiple-choice QA and 52,044 for open-ended QA. The questions are divided into three main types: causal questions (48% of the dataset), temporal questions (29%), and descriptive questions (23%).
2021 LSMDC-QA (Requires request access) LSMDC-QA (Large Scale Movie Description Challenge) contains 118,081 short video clips extracted from 202 movies. It consists of 7408 clips, and evaluation is performed on a test set of 1000 videos from movies disjoint.
2021 Env-QA Env-QA consists of 23.3K videos collected in AI2-THOR simulator and 85.1K questions
2021 SUTD-TrafficQA SUTD-TrafficQA takes the form of videoQA based dataset consists of 10080 in the wild videos and annotated 62535 QA pairs for complex traffic based scenarios.
2020 CLEVRER CLEVRER focuses on temporal reasoning and inferencing of synthetic videos. Consists of 10K videos 305K Questions each of 5 seconds duration
2020 LifeQA LifeQA consists of videos of data-to-data activities.It consists 275 video clips and over 2.3k multiple-choice questions.
2020 How2R-and-How2QA The How2R and How2QA datasets contain 9,371 and 9,035 episodes, with 24,328 and 21,509 clips averaging around 17 seconds each, divided into training, validation, and testing sets.
2021 TGIF-QA-R This dataset consists of 71K GIFs each of 3 seconds long and 165K Questions.Extended version of the TGIF-QA dataset.
2020 DramaQA DramaQA dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various-length video clips, with each QA pair belonging to one of four difficulty levels.
2020 KnowITVQA KnowITVQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory consisting of 207 videos each 20 minutes long
2020 V2C-QA Video2CommonSense dataset consists of Web Videos for image captioning and VideoQA consists of 1.5K videos and 37K questions.
2020 PsTuts-VQA The PsTuts dataset includes the following resources: 76 videos (5.6 hours in total), 17,768 question-answer pairs, and a domain knowledge-base with 1,236 entities and 2,196 options.It focuses on video tutorials
2019 Social-QA This Kaggle repository consists of the Social-QA dataset. Social-IQ contains 1,250 natural in-the-wild social situations, 7,500 questions and 52,500 correct and incorrect answers.
2019 TutorialVQAD TutorialVQAD consists of tutorial pertaining to image editing software. Total number of videos 76 and total number of questions 6195
2019 Moments in Time Dataset The Moments in Time dataset consists of one million videos, each 3 seconds long, with 339 different classes.
2018 TVQA TVQA is a large-scale video question-answering dataset built from six popular TV shows, including Friends, The Big Bang Theory, and How I Met Your Mother. It contains 152.5K QA pairs sourced from 21.8K video clips, covering over 460 hours of content.
2018 SVQA SVQA dataset consists of Attribute comparison, count, integer comparison, exist and query type questions.This consists of synthetic videos almost 12K and 118K Questions
2018 YouCook2 YouCook2 is one of the largest instructional video datasets focused on task-oriented cooking, featuring 2,000 untrimmed videos from 89 recipes, with an average of 22 videos per recipe. Each video, averaging 5.26 minutes and totalling 176 hours, includes annotated procedure steps with their corresponding temporal boundaries.
2018 TVQA+ TVQA+ includes 29.4K multiple-choice questions grounded in both temporal and spatial domains. A set of visual concept words—objects and people—are identified to collect spatial groundings, and corresponding object regions in individual frames are annotated with bounding boxes.
2017 TGIF-QA TGIF-QA, a large-scale dataset, contains 165K question-answer pairs based on animated GIFs, testing video-based Visual Question Answering (VQA) across four question types: Repetition Count, Repeating Action, State Transition, and Frame QA.
2017 MarioQA MarioQA is a dataset specifically designed for video-based question-answering in the context of Super Mario Bros. gameplay, containing over 70,000 question-answer pairs linked to gameplay footage.
2017 VideoQA VideoQA dataset of 18100 automatically crawled user-generated videos and titles.Videos collected from web videos with 174k questions each of 90s each
2017 Something-Something v1 & v2 Something-Something is a collection of 220,847 labelled video clips of humans performing predefined basic actions with everyday objects. The dataset comprises 220,847 videos divided into a training set of 168,913, a validation set of 24,777, and a test set of 27,157 (without labels), totalling 174 unique labels.
2016 MSVD-QA The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset derived from the Microsoft Research Video Description (MSVD) dataset, which includes around 120K sentences describing over 2,000 videos snippets. The dataset includes 1,970 video clips and approximately 50.5K QA pairs.
2016 MSRVTT-QA MSRVTT-QA consists of 10K web video clips with a total duration of 41.2 hours. It spans 200k clip-sentence pairs. Each video clip is annotated with about 20 natural sentences.
2016 MovieQA The MovieQA dataset is designed for movie question answering, aimed at evaluating automatic story comprehension through both video and text. It contains nearly 15,000 multiple-choice questions derived from over 400 movies.
2016 PororoQA The Pororo dataset based on children's cartoons features a simple story structure with episodes averaging 7.2 minutes, where similar events are frequently repeated. The dataset comprises 8,834 QA pairs, with an average of 51.66 questions per episode, excluding ambiguous or unrelated questions.
2015 VideoQA(FIB) This dataset consists of VideoQA, from multiple sources with videdos 109K video clips and duration of over 1000 hours with 390744 questions.
2014 Activity Net ActivityNet is a large-scale video benchmark for human activity understanding. ActivityNet aims to cover a wide range of complex human activities. ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours.
2013 YouTube2Text-QA YouTube2Text data consists of 1987 videos with 122708 descriptions.These include short descriptions of videos.

Models

Open Source Models

Model Name Links
InternVL Hugging Face , GitHub
LLaVa Hugging Face , GitHub
LITA GitHub
End2End ChatBot Hugging Face , GitHub
VideoLLAMA2 Hugging Face, GitHub
FrozenBiLM GitHub
PercieverIO Hugging Face,GitHub
InstructBlipVideo Hugging Face , GitHub
VideoGPT Hugging Face,GitHub
Qwen2-VL Hugging Face,GitHub
ViLA GitHub
LongVLM GitHub

Closed Source Models

Model Name API Link
ChatGPT Here
Gemini Here
Llama 3.2 Here

Additional-Resources

  1. OpenAI Docs
  2. Gemini Docs
  3. LLAMA Docs
  4. Azure Samples