VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu $^{1 *}$, Haoyu Wu $^{1,2 *}$ , Ziqiang Zheng $^1$, Chen Wei $^3$, Yingqing He$^1$, Renjie Pi $^1$, Qifeng Chen$^1$

$^1$ HKUST $^2$ Renmin University of China $^3$ Johns Hopkins University

($^*$ Equal Contribution. Work completed during Haoyu's internship at HKUST.)

Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected.

@article{liu2024videodpo,
      title={VideoDPO: Omni-Preference Alignment for Video Diffusion Generation}, 
      author={Liu, Runtao and Wu, Haoyu and Zheng, Ziqiang and Wei, Chen and He, Yingqing and Pi, Renjie and Chen, Qifeng},
      journal={arXiv preprint arXiv:2412.14167},
      year={2024}
}

News

[2024/12/19] 🔥 We release the paper and the project.

TODO

Merge to VideoTuna
Release videocrafter2, t2v-turbo training dataset
Release code for cogvideox
Release code for videocrafter2 and t2v-turbo

Get Started

prepare environments

conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt

prepare checkpoints

VideoCrafter2

run following instruction to create initial checkpoints.

mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt
python utils/create_ref_model.py

T2V-Turbo(V1)

T2V-Turbo is latent consistency model. We provide finetuning LCM based on VC2. Please download vc2 checkpoints first. And then run:

mkdir -p checkpoints/t2v-turbo
wget -O checkpoints/t2v-turbo/unet_lora.pt "https://huggingface.co/jiachenli-ucsb/T2V-Turbo-VC2/resolve/main/unet_lora.pt?download=true"

Prepare Training Data

download vidpro-vc2-dataset.tar from the following link. then ln -s the dataset to /data/vidpro-dpo-dataset. or u could also add dataset with same structure in configs/dpo/vidpro/train_data.yaml

to reduce peak memory use in training stage, we recommend to disable validation by not providing val_data.yaml.

Finetune VideoCrafter2

bash configs/vc_dpo/run.sh

Inference VideoCrafter2

We support inference with different types of inputs and outputs. We support both json and text formats to read prompts.

bash script_sh/inference_t2v.sh

Finetune T2V-Turbo(V1)

bash configs/t2v_turbo_dpo/run.sh

Inference T2V-Turbo(V1)

bash configs/t2v_turbo_dpo/turbo_visualize.sh

Helper Functions

besides, we also provide some useful tools to improve your finetuning experiences. We could automatically remove training logs without any checkpoints saved.

python utils/clean_results.py -d ./results

Results

Analysis of OmniScore on videos from VC2. (a) The difference between the maximum and minimum OmniScore among N videos as N increases. (b) Histogram of OmniScore. (c) Histogram of the difference in OmniScore between two samples in a preference pair. (d) Correlation heatmap of the OmniScore across dimensions.

VideoDPO alignment performance. We apply our proposed VideoDPO on three state-of-the-art open-source models and evaluate performance on VBench, HPS (V), and PickScore. After training with VideoDPO, all models achieve the best performance on VBench, with improvements also observed on HPS (V) or PickScore, demonstrating the effectiveness of our approach.

Comparison of sub-dimension scores before and after alignment on VBench for VC2, T2V-Turbo, and CogVideo.

Ablation studies. We study different strategies and configurations, including (a) the pair strategy, (b) the filter strategy, (c) α values, the tuning hyper-parameter for re-weighting, and (d) N values, the number of video samples for each text prompt. Q is short for visual quality, and S is short for semantic alignment.

Acknowledgement

Our work is developed on the following open-source projects,we would like to express our sincere thanks to their contributions: VideoCrafter2,T2V-turbo,CogvideoX,VideoTuna,Vbench, VidProM.

Thank I Chieh Chen for valuable suggesstions on demos.

Gallery

Before Alignment	After Alignment

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
configs		configs
data		data
lvdm		lvdm
prompts		prompts
scripts		scripts
scripts_sh		scripts_sh
utils		utils
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

News

TODO

Get Started

prepare environments

prepare checkpoints

VideoCrafter2

T2V-Turbo(V1)

Prepare Training Data

Finetune VideoCrafter2

Inference VideoCrafter2

Finetune T2V-Turbo(V1)

Inference T2V-Turbo(V1)

Helper Functions

Results

Acknowledgement

Gallery

About

Releases

Packages

Contributors 3

Languages

CIntellifusion/VideoDPO

Folders and files

Latest commit

History

Repository files navigation

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

News

TODO

Get Started

prepare environments

prepare checkpoints

VideoCrafter2

T2V-Turbo(V1)

Prepare Training Data

Finetune VideoCrafter2

Inference VideoCrafter2

Finetune T2V-Turbo(V1)

Inference T2V-Turbo(V1)

Helper Functions

Results

Acknowledgement

Gallery

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages