Skip to content

Latest commit

 

History

History
187 lines (141 loc) · 7.53 KB

README.md

File metadata and controls

187 lines (141 loc) · 7.53 KB

Updates

  • (2024.09.20) VideoTGB is accepted at EMNLP 2024! 🔥🔥
  • (2024.02.27) Paper Release, check it on Arxiv.
  • (2024.02.26) Initial Release (´▽`ʃ♡ƪ)

Overview

This is a chat agent based on our work Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding. This work is finetuned on video-instruction datasets and image-instruction datasets.

We have meticulously chosen two distinct architectural paradigms for our study: the encoder-decoder architecture, exemplified by BLIP2-Flan-T5-xl, and the decoder-only architecture, represented by InstructBLIP-Vicuna-7B. For further exploration, we also provide the code to tune the LLM with LoRA.

Installation

# clone project
git clone https://github.com/bigai-nlco/VideoTGB
cd VideoTGB

# create conda environment
conda create -n VideoTGB
conda activate VideoTGB

# install requirements
pip install -r requirements.txt

Data Preparation

You can download all the instruction data and evaluation data from Video-LLaVA/DATA

inputs/ivinstruct
├── llava_image_tune
└── videochatgpt_tune

How to run

Our training framework offers tailored scripts to meet the diverse needs of researchers.

Train model

# run on local
python src/train.py experiment=LSTP_SF_blip2flant5xl_videoinstruct # blip2-flan-t5-xl + video-instruct
python src/train.py experiment=LSTP_SF_instructblipvicuna7b_videoinstruct # instructblip-vicuna-7b + video-instruct

# run on cluster
sbatch scripts/videoinstruct_train.slurm # blip2-flan-t5-xl + video-instruct
sbatch scripts/videoinstruct_vicuna_train.slurm # instructblip-vicuna-7b + video-instruct

For those with limited GPU resources, we also provide the pipeline to shorten the training procedure

# step 1: generate the pseudo labels from the base-model, and extract the optical flow in advance

# step 2: train the temporal sampler
python src/train.py experiment=LSTP_TG_blip2flant5xl_videoinstruct

# step 3: train VideoTGB with fixed temporal sampler
python src/train.py experiment=LSTP_blip2flant5xl_ivinstruct # blip2-flan-t5-xl + video-instruct + image-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivinstruct # instructblip-vicuna-7b + video-instruct + image-instruct
python src/train.py experiment=LSTP_blip2flant5xl_ivtinstruct # blip2-flan-t5-xl (LoRA) + video-instruct + image-instruct + text-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivtinstruct # instructblip-vicuna-7b (LoRA) + video-instruct + image-instruct + text-instruct

Evaluate model

# run inference for VideoTGB-Vicuna-7B
bash eval/scripts/run_qa_msvd_vicuna.sh
bash eval/scripts/run_qa_msrvtt_vicuna.sh
bash eval/scripts/run_qa_activitynet_vicuna.sh

# run inference for VideoTGB-Flan-T5-xl
bash eval/scripts/run_qa_msvd.sh
bash eval/scripts/run_qa_msrvtt.sh
bash eval/scripts/run_qa_activitynet.sh

# run evaluation
bash eval/scripts/eval_qa_msvd.sh
bash eval/scripts/eval_qa_msrvtt.sh
bash eval/scripts/eval_qa_activitynet.sh

Configures

data:
  - text_dir
  - video_dir
  - processor_name
  - sampler_processor_name
  - nframe # final sampled frames
  - target_size # image size
  - batch_size
model:
  - model_name_or_path
  - sampler_name_or_path
  - of_extractor_name_or_path
  - optimizer
  - scheduler
  - generate_configs
path:
  - data_dir
  - video_dir
  - text_dir
  - output_dir
trainer: 
  - strategy
  - accelerator
  - devices
  - num_nodes
  - precision

Evaluation Results

Metrics: Accuracy/Score

Methods LLM size MSVD-QA MSRVTT-QA ActivityNet-QA
FrozenBiLM 1B 32.2/- 16.8/- 24.7/-
VideoChat 7B 56.4/2.8 45.0/2.5 -/2.2
LLaMA-Adapter 7B 54.9/3.1 43.8/2.7 34.2/2.7
Video-LLaMA 7B 51.6/2.5 29.6/1.8 12.4/1.1
Video-ChatGPT 7B 64.9/3.3 49.3/2.8 35.2/2.7
Video-LLaVA 7B 70.7/3.9 59.2/3.5 45.3/3.3
VideoTGB-7B 7B 71.3/3.9 57.3/3.3 43.9/3.3

Demo

We provide the chat demo supported by Gradio. We also provide some checkpoints, you can download it an put it to ckpts/VideoTGB-Chat/.

Model Zoo

Model Base Model Training Data Strategy for LLM Download Link
LSTP-7B InstructBlip-Vicuna-7B Video-ChatGPT, LLaVA fixed Huggingface
LSTP-FlanT5xl FlanT5-xl Video-ChatGPT, LLaVA fixed Huggingface
python -m demo.demo

Acknowledgement

Citation

If you find our work helpful, please consider ⭐️ and cite our work:

@article{wang2024videotgb,
    title={Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge},
    author={Wang, Yuxuan and Wang, Yueqian and Wu, Pengfei and Liang, Jianxin and Zhao, Dongyan and Liu, Yang and Zheng, Zilong},
    year={2024},
    journal = {arXiv preprint arXiv:2402.16050}
}