Code for NeurIPS'21 paper CCVS: Context-aware Controllable Video Synthesis.
CCVS: Context-aware Controllable Video Synthesis
Guillaume Le Moing, Jean Ponce, Cordelia Schmid
Paper: https://arxiv.org/abs/2107.08037
Project page: https://16lemoing.github.io/ccvs
Abstract: This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.
The code is tested with pytorch 1.7.0 and python 3.8.6
To install dependencies with conda run:
conda env create -f env.yml
conda activate ccvs
To install apex run:
cd tools
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../..
BAIR Robot Pushing - (Repo) - (License)
Create corresponding directory:
mkdir datasets/bairhd
Download the high resolution data from this link and put it in the new directory, then run:
tar -xvf datasets/bairhd/softmotion_0511.tar.gz -C datasets/bairhd
Preprocess BAIR dataset for resolution 256x256:
python data/scripts/preprocess_bairhd.py --data_root datasets/bairhd --dim 256
We also provide our annotation tool to later estimate the (x,y) position of the arm:
python data/scripts/annotate_bairhd.py --data_root datasets/bairhd/original_frames_256 --out_dir datasets/bairhd/annotated_frames
Kinetics-600 - (Repo) - (License)
This dataset is a collection of YouTube links from which we extract the corresponding train and test videos running:
mkdir datasets/kinetics
wget https://storage.googleapis.com/deepmind-media/Datasets/kinetics600.tar.gz -P datasets/kinetics
tar -xvf datasets/kinetics/kinetics600.tar.gz -C datasets/kinetics
python data/scripts/download_kinetics.py datasets/kinetics/kinetics600/train.csv datasets/kinetics/kinetics600/train_videos --trim
python data/scripts/download_kinetics.py datasets/kinetics/kinetics600/test.csv datasets/kinetics/kinetics600/test_videos --trim
Preprocess the dataset:
python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/train_videos --out_root datasets/kinetics/preprocessed_videos --out_name train_64p_square_32t --max_vid_len 32 --resize 64 --square_crop
python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/test_videos --out_root datasets/kinetics/preprocessed_videos --out_name test_64p_square_32t --max_vid_len 32 --resize 64 --square_crop
Split the data into folds and precompute metadata for faster training/testing:
python data/scripts/compute_folds_kinetics.py train 100 64p_square_32t
python data/scripts/compute_folds_kinetics.py test 40 64p_square_32t --max_per_fold 1248
AudioSet-Drums - (Repo) - (License) - (License of curated version)
Create corresponding directory:
mkdir datasets/drums
Download the data from this link and run:
unzip datasets/drums/AudioSet_Drums.zip -d datasets/drums
UCF101 - (Repo)
Create corresponding directory:
mkdir datasets/ucf101
Download the data from this link and run:
mkdir datasets/ucf101/videos
unrar e datasets/ucf101/UCF101.rar datasets/ucf101/videos
BAIR Robot Pushing
First, train the frame autoencoder:
bash scripts/bairhd/train_frame_autoencoder.sh
Then, train the transformer for different tasks (one should change --q_load_path
in the corresponding files to point to the checkpoints of the trained autoencoder) :
- Video prediction
bash scripts/bairhd/train_transformer.sh
- Point-to-point synthesis
bash scripts/bairhd/train_transformer_p2p.sh
- State-conditioned synthesis (this requires to train a state estimator first and change the corresponding
--s_load_path
before training the transformer)
bash scripts/bairhd/train_state_estimator.sh
bash scripts/bairhd/train_transformer_state.sh
- Unconditional synthesis
bash scripts/bairhd/train_transformer_unc.sh
Kinetics-600
The same applies, e.g., for video prediction:
bash scripts/kinetics/train_frame_autoencoder.sh
bash scripts/kinetics/train_transformer.sh
UCF101
The same applies, e.g., for video prediction:
bash scripts/ucf101/train_frame_autoencoder.sh
bash scripts/ucf101/train_transformer.sh
AudioSet-Drums
For audio-conditioned synthesis, we train two encoders (one to compress frames, the other to compress sound features) and then train the transformer:
bash scripts/drums/train_frame_autoencoder.sh
bash scripts/drums/train_stft_autoencoder.sh
bash scripts/drums/train_transformer_audio.sh
We provide checkpoints for various configurations:
Dataset | Future prediction | Point-to-point synthesis | State-conditioned synthesis | Sound-conditioned synthesis | Unconditional synthesis | Download |
---|---|---|---|---|---|---|
BAIR Robot Pushing | ✓ | ✓ | ✓ | ✗ | ✓ | checkpoint |
Kinetics-600 | ✓ | ✓ | ✗ | ✗ | ✗ | checkpoint |
UCF101 | ✓ | ✗ | ✗ | ✗ | ✗ | checkpoint |
AudioSet-Drum | ✓ | ✗ | ✗ | ✓ | ✗ | checkpoint |
Extract checkpoints with the following command (by replacing CKPT.zip
with the corresponding name).
unzip CKPT.zip -d checkpoints/
Synthesize videos from downloaded checkpoints.
BAIR Robot Pushing
bash scripts/bairhd/save_videos_state_off.sh
bash scripts/bairhd/save_videos_p2p.sh
bash scripts/bairhd/save_videos_state_on.sh
bash scripts/bairhd/save_videos_unc.sh
Kinetics-600
bash scripts/kinetics600/save_videos.sh
bash scripts/kinetics600/save_videos_p2p.sh
UCF101
bash scripts/ucf101/save_videos.sh
AudioSet-Drums
bash scripts/drums/save_videos_audio_off.sh
bash scripts/drums/save_videos_audio_on.sh
Here are some important flags:
--vid_len
: the total number of frames in synthetic videos (including conditioning frames)--x_cond_len
: the length of tokens corresponding to conditioning frames. In the preceding experiments one frame is represented by 64 tokens so one can set this flag to0
for unconditionnal synthesis,64
for one input frame,128
for two...--keep_state
: add this flag in sound- or state- conditioned synthesis to effectvely use the control (otherwise sound / state are also predicted)
After inference, compute evaluation metrics with the following commands:
python tools/tf_fvd/fvd.py --exp_tag TAG
python tools/pytorch_metrics/metrics.py --exp_tag TAG
where TAG
is the name of the directory (inside results/
folder) under which videos where saved during inference.
The first command computes the Fréchet video distance (FVD), and second one the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM).
One can use the --idx
flag to compute PSNR / SSIM for specific timesteps.
If you find this code useful in your research, please consider citing:
@inproceedings{lemoing2021ccvs,
title = {{CCVS}: Context-aware Controllable Video Synthesis},
author = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
booktitle = {NeurIPS},
year = {2021}
}
This code borrows from StyleGAN2, minGPT, pytorch-liteflownet and VQVAE.
CCVS is released under the MIT license.