GitHub - YingqingHe/LVDM: LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He ¹ Tianyu Yang ² Yong Zhang ² Ying Shan ² Qifeng Chen ¹

¹ The Hong Kong University of Science and Technology ² Tencent AI Lab

TL;DR: An efficient video diffusion model that can:
1️⃣ conditionally generate videos based on input text;
2️⃣ unconditionally generate videos with thousands of frames.

🍻 Results

☝️ Text-to-Video Generation

"A corgi is swimming fastly"	"astronaut riding a horse"	"A glass bead falling into water with a huge splash. Sunset in the background"	"A beautiful sunrise on mars. High definition, timelapse, dramaticcolors."	"A bear dancing and jumping to upbeat music, moving his whole body."	"An iron man surfing in the sea. cartoon style"

✌️ Unconditional Long Video Generation (40 seconds)

⏳ TODO

Release pretrained text-to-video generation models and inference code
Release unconditional video generation models
Release training code
Update training and sampling for long video generation

⚙️ Setup

Install Environment via Anaconda

conda create -n lvdm python=3.8.5
conda activate lvdm
pip install -r requirements.txt

Pretrained Models and Used Datasets

Download the pretrained checkpoints via the following commands in Linux terminal:

mkdir -p models/ae
mkdir -p models/lvdm_short
mkdir -p models/t2v

# sky timelapse
wget -O models/ae/ae_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_sky.ckpt
wget -O models/lvdm_short/short_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_sky.ckpt  

# taichi
wget -O models/ae/ae_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_taichi.ckpt
wget -O models/lvdm_short/short_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_taichi.ckpt

# text2video
wget -O models/t2v/model.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/t2v.ckpt

Prepare UCF-101 dataset

mkdir temp; cd temp

# Download UCF-101 from the official website https://www.crcv.ucf.edu/data/UCF101.php (The UCF101 data )

wget https://www.crcv.ucf.edu/data/UCF101/UCF101.rar --no-check-certificate
unrar x UCF101.rar

# Download annotations from https://www.crcv.ucf.edu/data/UCF101.php (The Train/Test Splits for Action Recognition on UCF101 data set)

wget https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip --no-check-certificate
unzip UCF101TrainTestSplits-RecognitionTask.zip

# Split the train and test split
cd ..
python lvdm/data/split_ucf101.py # please check this script

Download manually:

Sky Timelapse: VideoAE, LVDM_short, LVDM_pred, LVDM_interp, dataset
Taichi: VideoAE, LVDM_short, dataset
Text2Video: model

💫 Inference

Sample Short Videos

unconditional generation

bash shellscripts/sample_lvdm_short.sh

text to video generation

bash shellscripts/sample_lvdm_text2video.sh

Sample Long Videos

bash shellscripts/sample_lvdm_long.sh

💫 Training

Train video autoencoder

bash shellscripts/train_lvdm_videoae.sh

remember to set PROJ_ROOT, EXPNAME, DATADIR, and CONFIG.

Train unconditional lvdm for short video generation

bash shellscripts/train_lvdm_short.sh

remember to set PROJ_ROOT, EXPNAME, DATADIR, AEPATH and CONFIG.

Train unconditional lvdm for long video generation

# TBD

💫 Evaluation

bash shellscripts/eval_lvdm_short.sh

remember to set DATACONFIG, FAKEPATH, REALPATH, and RESDIR.

📃 Abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

🔮 Pipeline

😉 Citation

@article{he2022lvdm,
      title={Latent Video Diffusion Models for High-Fidelity Long Video Generation}, 
      author={Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen},
      year={2022},
      eprint={2211.13221},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🤗 Acknowledgements

We built our code partially based on latent diffusion models and TATS. Thanks the authors for sharing their awesome codebases! We aslo adopt Xintao Wang's Real-ESRGAN for upscaling our text-to-video generation results. Thanks for their wonderful work!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
configs		configs
input		input
lvdm		lvdm
scripts		scripts
shellscripts		shellscripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
requirements_h800_gpu.txt		requirements_h800_gpu.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

🍻 Results

☝️ Text-to-Video Generation

✌️ Unconditional Long Video Generation (40 seconds)

⏳ TODO

⚙️ Setup

Install Environment via Anaconda

Pretrained Models and Used Datasets

💫 Inference

Sample Short Videos

Sample Long Videos

💫 Training

Train video autoencoder

Train unconditional lvdm for short video generation

Train unconditional lvdm for long video generation

💫 Evaluation

📃 Abstract

🔮 Pipeline

😉 Citation

🤗 Acknowledgements

About

Releases

Packages

Languages

License

YingqingHe/LVDM

Folders and files

Latest commit

History

Repository files navigation

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

🍻 Results

☝️ Text-to-Video Generation

✌️ Unconditional Long Video Generation (40 seconds)

⏳ TODO

⚙️ Setup

Install Environment via Anaconda

Pretrained Models and Used Datasets

💫 Inference

Sample Short Videos

Sample Long Videos

💫 Training

Train video autoencoder

Train unconditional lvdm for short video generation

Train unconditional lvdm for long video generation

💫 Evaluation

📃 Abstract

🔮 Pipeline

😉 Citation

🤗 Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages