Skip to content

Latest commit

 

History

History
193 lines (119 loc) · 6.58 KB

README.md

File metadata and controls

193 lines (119 loc) · 6.58 KB

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges

videollamb-page arXiv

Recent advancements in large-scale video-language models, demonstrate remarkable capabilities in real-time planning and interaction with real-world environments, yet their training is constrained by high computational costs and limited annotated datasets. Traditional methods, like video compression and sliding window techniques, often compromise critical visual information or disrupt semantic flow. In addition, current predesigned QA benchmarks fail to adequately assess long video understanding due to inherent biases from static image features and the base LLM. To address these issues, we introduce VideoLLaMB, a framework utilizing Memory Bridge Layers with recurrent memory tokens to encode entire video content without discarding vital information. We also propose SceneTilling algorithm to split video into semantic units to keep the semantic flow. Finally, We present the "Needle in a Video Haystack" benchmark to evaluate long video understanding over needle of different modalities comprehensively.

Table of Contents

Install

  1. Clone this repository and navigate to VideoLLaMB folder
git clone https://github.com/nlco-bigai/VideoLLaMB.git
cd VideoLLaMB
  1. Install Package
conda create -n videollamb python=3.10 -y
conda activate videollamb
pip install --upgrade pip
pip install -e .
conda install ffmpeg
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install flash-attn --no-build-isolation --no-cache-dir

QuickStart With CLI

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.cli --model-path checkpoints/videollamb-llava-1.5-7b --video-file XXX.mp4

Streaming Video Caption with CLI

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.cli_streaming --model_path checkpoints/videollamb-llava-1.5-7b
streaming_demo.mp4

Gradio Demo

Download the checkpoint, place it to the checkpoints directory, then run following command:

python -m llava.serve.gradio_demo
demo_gradio.mp4

Train

  1. Prepare data

We combine the video instruction from PLLaVA and image instruction from LLaVA for training. Please check DATA for details.

  1. Prapare model weights for initialization

Our model is initialized on LLaVA, you can download the llava-v1.5-7b, and put it to checkpoints/llava-v1.5-7b. For visual encoders, we select them from LanguageBind, you can download LanguageBind_Image and LanguageBind_Video_merge, and put them to checkpoints/LanguageBind_Image and checkpoints/LanguageBind_Video_merge

  1. Start Training

Training takes 23 hours for LLaVA-1.5-7B in 4-A800-80G

bash scripts/finetune_video_image.slurm # bash
sbatch scripts/finetune_video_image.slurm # slurm cluster

We also provide a script to backpropagate the LLM loss to the bridge for each recurrent iteration.

bash scripts/finetune_video_image_loss.slurm # bash
sbatch scripts/finetune_video_image_loss.slurm # slurm cluster

Evaluate

  1. Prepare data

We provide evaluation pipelines for EgoScheme, NExTQA, EgoPlan, and MVBench. Please check DATA for details.

  1. Start Evaluating

a. Traditional Benchmark

bash scripts/eval/egoschema.sh # egoschema
bash scripts/eval/nextqa.sh # nextqa
bash scripts/eval/egoplan.sh # egoplan
bash scripts/eval/mvbench.sh # mvbench

b. MM-NIAVH

check our benchmark Needle In A Video Haystack (NIAVH)

Model Zoo

Model Base Model Training Data Download Link
VideoLLaMB-7B llava-v1.5-7b magic_json, LLaVA 🤗videollamb-llava-1.5-7b
VideoLLaMB-7B-Mem (MM-NIAVH) llava-v1.5-7b magic_json, LLaVA 🤗videollamb-mem-llava-1.5-7b

Acknowledgement

Model:

Data:

Demo:

Citation

@misc{mm-niavh,
    title={MLLM Pressure Test: Needle In A Video Haystack},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    publisher={github},
    url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
    year={2024}
}

@article{videollamb,
    title={VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    journal={arxiv},
    year={2024}
}