Skip to content

Latest commit

 

History

History
81 lines (55 loc) · 4.26 KB

README.md

File metadata and controls

81 lines (55 loc) · 4.26 KB

VideoGLaMM

Shehan Munasinghe , Hanan Gani , Wenqi Zhu , Jiale Cao, Eric Xing, Fahad Shahbaz Khan. Salman Khan,

Mohamed bin Zayed University of Artificial Intelligence, Tianjin University, Linköping University, Australian National University, Carnegie Mellon University

Website paper


📢 Latest Updates

  • 📦 Code, checkpoints will be released soon. Stay tuned!

Overview

VideoGLaMM Architectural Overview

VideoGLaMM is a large video multimodal video model capable of pixel-level visual grounding. The model responds to natural language queries from the user and intertwines spatio-temporal object masks in its generated textual responses to provide a detailed understanding of video content. VideoGLaMM seamlessly connects three key components: a Large Language Model (LLM); dual vision encoders; and a spatio-temporal pixel decoder. The dual vision encoders extract spatial and temporal features separately, which are jointly passed to the LLM to output responses rich in both spatial and temporal cues. This is facilitated by end-to-end training on our proposed benchmark Grounded conversation Generation (GCG) dataset featuring 38k Video-QA triplets with 87k objects and 671k fine-grained masks.


🏆 Highlights

  1. We introduce Video Grounded Large Multi-modal Model (VideoGLaMM), a video large multimodal model, capable of pixel-level visual grounding, featuring an end-to-end alignment mechanism.

  2. To achieve fine-grained spatio-temporal alignment, we introduce a benchmark grounded conversation generation (GCG) dataset consisting of 38k grounded video-QA triplet pairs and 83k objects and roughly 671k fine-grained spatio-temporal masks.

  3. We assess the performance of VideoGLaMM across diverse tasks spanning grounded conversation generation, visual grounding, and referring video segmentation, where it achieves state-of-the-art performance


Architecture

VideoGLaMM Architecture

VideoGLaMM consists of following key components: (i) Spatio-Temporal Dual Encoder, (ii) Dual Alignment V-L Adapters for image and video features, (iii) Large Language Model (LLM) iv) L-V Adapter and (iv) Promptable Pixel Decoder.


Benchmark and Annotation Pipeline

Annotation Pipeline

We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.


Examples 🔍

Given user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.

VideoGLaMM Architecture


Citation 📜

@article{munasinghe2024videoglamm,
  title={VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos}, 
  author={Shehan Munasinghe and Hanan Gani and Wenqi Zhu and Jiale Cao and Eric Xing and Fahad Khan and Salman Khan},
  journal={ArXiv},
  year={2024},
  url={https://arxiv.org/abs/2411.04923}
}