VideoGLaMM

Shehan Munasinghe , Hanan Gani , Wenqi Zhu , Jiale Cao, Eric Xing, Fahad Shahbaz Khan. Salman Khan,

Mohamed bin Zayed University of Artificial Intelligence, Tianjin University, Linköping University, Australian National University, Carnegie Mellon University

📢 Latest Updates

📦 Code, checkpoints will be released soon. Stay tuned!

Overview

VideoGLaMM is a large video multimodal video model capable of pixel-level visual grounding. The model responds to natural language queries from the user and intertwines spatio-temporal object masks in its generated textual responses to provide a detailed understanding of video content. VideoGLaMM seamlessly connects three key components: a Large Language Model (LLM); dual vision encoders; and a spatio-temporal pixel decoder. The dual vision encoders extract spatial and temporal features separately, which are jointly passed to the LLM to output responses rich in both spatial and temporal cues. This is facilitated by end-to-end training on our proposed benchmark Grounded conversation Generation (GCG) dataset featuring 38k Video-QA triplets with 87k objects and 671k fine-grained masks.

🏆 Highlights

We introduce Video Grounded Large Multi-modal Model (VideoGLaMM), a video large multimodal model, capable of pixel-level visual grounding, featuring an end-to-end alignment mechanism.
To achieve fine-grained spatio-temporal alignment, we introduce a benchmark grounded conversation generation (GCG) dataset consisting of 38k grounded video-QA triplet pairs and 83k objects and roughly 671k fine-grained spatio-temporal masks.
We assess the performance of VideoGLaMM across diverse tasks spanning grounded conversation generation, visual grounding, and referring video segmentation, where it achieves state-of-the-art performance

Architecture

VideoGLaMM consists of following key components: (i) Spatio-Temporal Dual Encoder, (ii) Dual Alignment V-L Adapters for image and video features, (iii) Large Language Model (LLM) iv) L-V Adapter and (iv) Promptable Pixel Decoder.

Benchmark and Annotation Pipeline

We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.

Examples 🔍

Given user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.

Citation 📜

@article{munasinghe2024videoglamm,
  title={VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos}, 
  author={Shehan Munasinghe and Hanan Gani and Wenqi Zhu and Jiale Cao and Eric Xing and Fahad Khan and Salman Khan},
  journal={ArXiv},
  year={2024},
  url={https://arxiv.org/abs/2411.04923}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs/images		docs/images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoGLaMM

📢 Latest Updates

Overview

🏆 Highlights

Architecture

Benchmark and Annotation Pipeline

Examples 🔍

Citation 📜

About

mbzuai-oryx/VideoGLaMM

Folders and files

Latest commit

History

Repository files navigation

VideoGLaMM

📢 Latest Updates

Overview

🏆 Highlights

Architecture

Benchmark and Annotation Pipeline

Examples 🔍

Citation 📜

About

Topics

Resources

Stars

Watchers

Forks