Momentor (ICML 2024)

The official repository of paper Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning.

Momentor Overview

Momentor is a Video-LLM designed for fine-grained comprehension and localization in videos. It is composed of a frame encoder, a linear projection layer, a Temporal Perception Module (TPM), and a Large Language Model (LLM). We carefully design the Temporal Perception Module (TPM) to improve fine-grained temporal modeling and representation. Architecture and training of Momentor are shown in the following figure.

Installation

Git clone our repository and creating conda environment:

cd Momentor/momentor
conda create --name=momentor python=3.10
conda activate momentor
pip install -r requirements.txt

Training

For training instructions, check out train_momentor.md.

Moment-10M

We present Moment-10M, a large-scale video instruction dataset with segment-level annotation. We use videos from YTTemporal-1B to construct Moment-10M. We propose an automatic data generation engine to extract instance and event information from these videos and generate segment-level instruction following data. We meticulously design 5 single-segment tasks and 3 cross-segment tasks, which enables Video-LLMs perform comprehensive segment-level reasoning.

We are releasing our Moment-10M dataset, you can download it from the following links: part1, part2.

You can also download the data for Grounded Event-Sequence Modeling here: GESM.

After downloading and extracting the dataset to obtain the data files, you can use convert_data.py to transform the data into a text dialogue format and download_videos.py to download the corresponding video files. The usage for these scripts is as follows:

python convert_data.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

python download_videos.py --source_path <path_to_data_file> --video_path <path_to_store_videos>

Parameters:

--source_path: The path to the input data file containing identifiers for the videos.
--video_path: The path where the downloaded video files will be stored.

For GESM data extraction, use convert_data_gesm.py as follows:

python convert_data_gesm.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

Citation

If you found our work useful in your research, please consider giving this repository a star and citing our paper as followed:

@misc{qian2024momentor,
      title={Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning}, 
      author={Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang},
      year={2024},
      eprint={2402.11435},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgment

Thanks to the open source of the following projects:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
images		images
momentor		momentor
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Momentor (ICML 2024)

Momentor Overview

Installation

Training

Moment-10M

Citation

Acknowledgment

About

Releases

Packages

Contributors 2

Languages

DCDmllm/Momentor

Folders and files

Latest commit

History

Repository files navigation

Momentor (ICML 2024)

Momentor Overview

Installation

Training

Moment-10M

Citation

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages