VTimeLLM [Paper]

Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

📢 Latest Updates

Jan-2: Thanks to Xiao Xia , Shengbo Tong and Beining Wang, we have refactored the code to now support both the LLAMA and ChatGLM3 architectures. We translated the training data into Chinese and fine-tuned a Chinese version based on the ChatGLM3-6b.
Dec-14: Released the training code and data. All the resources including models, datasets and extracted features are available here. 🔥🔥
Dec-4: VTimeLLM: demo released.

VTimeLLM Overview 💡

VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary.

VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents.

Contributions 🏆

We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.
We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.
We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=vtimellm python=3.10
conda activate vtimellm

git clone https://github.com/huangb23/VTimeLLM.git
cd VTimeLLM
pip install -r requirements.txt

Additionally, install additional packages for training cases.

pip install ninja
pip install flash-attn --no-build-isolation

Running Demo Offline 💿

To run the demo offline, please refer to the instructions in offline_demo.md.

Training 🚋

For training instructions, check out train.md.

Qualitative Analysis 🔍

A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks.

Video Understanding and Conversational Tasks 💬

Creative Tasks 🖌️

Fine-grained Understanding Tasks 🌐

Video Reasoning Tasks ❓

Acknowledgements 🙏

We are grateful for the following awesome projects our VTimeLLM arising from:

LLaVA: Large Language and Vision Assistant
FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
LLaMA: Open and Efficient Foundation Language Models
Vid2seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
InternVid: A Large-scale Video-Text dataset

If you're using VTimeLLM in your research or applications, please cite using this BibTeX:

@inproceedings{huang2024vtimellm,
  title={Vtimellm: Empower llm to grasp video moments},
  author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={14271--14280},
  year={2024}
}

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
checkpoints		checkpoints
docs		docs
images		images
scripts		scripts
vtimellm		vtimellm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTimeLLM [Paper]

📢 Latest Updates

VTimeLLM Overview 💡

Contributions 🏆

Installation 🔧

Running Demo Offline 💿

Training 🚋

Qualitative Analysis 🔍

Video Understanding and Conversational Tasks 💬

Creative Tasks 🖌️

Fine-grained Understanding Tasks 🌐

Video Reasoning Tasks ❓

Acknowledgements 🙏

License 📜

About

Contributors 2

Languages

License

huangb23/VTimeLLM

Folders and files

Latest commit

History

Repository files navigation

VTimeLLM [Paper]

📢 Latest Updates

VTimeLLM Overview 💡

Contributions 🏆

Installation 🔧

Running Demo Offline 💿

Training 🚋

Qualitative Analysis 🔍

Video Understanding and Conversational Tasks 💬

Creative Tasks 🖌️

Fine-grained Understanding Tasks 🌐

Video Reasoning Tasks ❓

Acknowledgements 🙏

License 📜

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages