Skip to content

jjihwan/SV3D-fine-tune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SV3D fine-tuning

Fine-tuning code for SV3D

Input Image Before Training After Training
input image Image 2 Image 3

Setting up

PyTorch 2.0

conda create -n sv3d python==3.10.14
conda activate sv3d
pip3 install -r requirements.txt

Install deepspeed for training

pip3 install deepspeed

Get checkpoints 💾

Store them as following structure:

cd SV3D-fine-tuning
    .
    └── checkpoints
        └── sv3d_p.safetensors

Dataset 📀

Prepare dataset as following. We use Objaverse 1.0 dataset with preprocessing pipeline. See objaverse dataloader for detail. orbit_frame_0020.png is input image, and video_latent.pt is the video latent encoded by SV3D encoder, without regularization (i.e. channel is 8)

cd dataset
    .
    └── 000-000
    |   └── orbit_frame_0020.png # input image
    |   └── orbit_frame.pt # video latent
    └── 000-001
    |   └── orbit_frame_0020.png
    |   └── orbit_frame.pt
    └── ...

Training 🚀

I used a single A6000 GPU(VRAM 48GB) to fine-tune.

sh scripts/sv3d_finetune.sh

Inference ❄️

Store the input images in assets

sh scripts/inference.sh

Notes

  • The encoder weights of the vae are not provided in sv3d_p.safetensors.
    • To obtain the video latents, you should run the encoder separately and use them in the training pipeline, which is due to saving the time and the GPU VRAM for training.
    • Note that you should use the output of the encoder of the vae, not the sample from the distribution defined by the mean and variance of the encoder. In our case, we used AutoencoderKLTemporalDecoder which is the same vae used in the SVD pipeline.

Acknowledgement 🤗

The source code is based on SV3D. Thanks for the wonderful codebase!

Additionally, GPU and NFS resources for training are supported by fal.ai🔥.

Feel free to refer to the fal Research Grants!