Fine-tuning code for SV3D
Input Image | Before Training | After Training |
---|---|---|
conda create -n sv3d python==3.10.14
conda activate sv3d
pip3 install -r requirements.txt
pip3 install deepspeed
Store them as following structure:
cd SV3D-fine-tuning
.
└── checkpoints
└── sv3d_p.safetensors
Prepare dataset as following.
We use Objaverse 1.0 dataset with preprocessing pipeline.
See objaverse dataloader for detail.
orbit_frame_0020.png
is input image, and video_latent.pt
is the video latent encoded by SV3D encoder, without regularization (i.e. channel is 8)
cd dataset
.
└── 000-000
| └── orbit_frame_0020.png # input image
| └── orbit_frame.pt # video latent
└── 000-001
| └── orbit_frame_0020.png
| └── orbit_frame.pt
└── ...
I used a single A6000 GPU(VRAM 48GB) to fine-tune.
sh scripts/sv3d_finetune.sh
Store the input images in assets
sh scripts/inference.sh
- The encoder weights of the vae are not provided in sv3d_p.safetensors.
- To obtain the video latents, you should run the encoder separately and use them in the training pipeline, which is due to saving the time and the GPU VRAM for training.
- Note that you should use the output of the encoder of the vae, not the sample from the distribution defined by the mean and variance of the encoder. In our case, we used AutoencoderKLTemporalDecoder which is the same vae used in the SVD pipeline.
The source code is based on SV3D. Thanks for the wonderful codebase!
Additionally, GPU and NFS resources for training are supported by fal.ai🔥.
Feel free to refer to the fal Research Grants!