Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.
Your star is our fuel! We're revving up the engines with it!
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†
(†Corresponding Author)
- A more powerful one 😝 .
- Release training code.
-
2024/07/01
Release the model and code of FoleyCrafter.
Use the following command to install dependencies:
# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter
# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install
The checkpoints will be downloaded automatically by running inference.py
.
You can also download manually using following commands.
git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion
git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/
Put checkpoints as follows:
└── checkpoints
├── semantic
│ ├── semantic_adapter.bin
├── vocoder
│ ├── vocoder.pt
│ ├── config.json
├── temporal_adapter.ckpt
│ │
└── timestamp_detector.pth.tar
You can launch the Gradio interface for FoleyCrafter by running the following command:
python app.py --share
python inference.py --save_dir=output/sora/
Results:
Input Video |
Generated Audio |
0.mp4 |
0.mp4 |
1.mp4 |
1.mp4 |
2.mp4 |
2.mp4 |
3.mp4 |
3.mp4 |
- Temporal Alignment with Visual Cues
python inference.py \
--temporal_align \
--input=input/avsync \
--save_dir=output/avsync/
Results:
Ground Truth |
Generated Audio |
0.mp4 |
0.mp4 |
1.mp4 |
1.mp4 |
2.mp4 |
2.mp4 |
- Using Prompt
# case1
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--save_dir=output/PromptControl/case1/
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--prompt='noisy, people talking' \
--save_dir=output/PromptControl/case1_prompt/
# case2
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--save_dir=output/PromptControl/case2/
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--prompt='seagulls' \
--save_dir=output/PromptControl/case2_prompt/
Results:
Generated Audio |
Generated Audio |
Without Prompt |
Prompt: noisy, people talking |
0.mp4 |
0.mp4 |
Without Prompt |
Prompt: seagulls |
0.mp4 |
0.mp4 |
- Using Negative Prompt
# case 3
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--save_dir=output/PromptControl/case3/
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--nprompt='river flows' \
--save_dir=output/PromptControl/case3_nprompt/
# case4
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--save_dir=output/PromptControl/case4/
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--nprompt='noisy, wind noise' \
--save_dir=output/PromptControl/case4_nprompt/
Results:
Generated Audio |
Generated Audio |
Without Prompt |
Negative Prompt: river flows |
0.mp4 |
0.mp4 |
Without Prompt |
Negative Prompt: noisy, wind noise |
0.mp4 |
0.mp4 |
options:
-h, --help show this help message and exit
--prompt PROMPT prompt for audio generation
--nprompt NPROMPT negative prompt for audio generation
--seed SEED ramdom seed
--temporal_align TEMPORAL_ALIGN
use temporal adapter or not
--temporal_scale TEMPORAL_SCALE
temporal align scale
--semantic_scale SEMANTIC_SCALE
visual content scale
--input INPUT input video folder path
--ckpt CKPT checkpoints folder path
--save_dir SAVE_DIR generation result save path
--pretrain PRETRAIN generator checkpoint path
--device DEVICE
@misc{zhang2024pia,
title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
year={2024},
eprint={2407.01494},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Yiming Zhang: zhangyiming@pjlab.org.cn
YiCheng Gu: yichenggu@link.cuhk.edu.cn
Yanhong Zeng: zengyanhong@pjlab.org.cn
Please check Apache-2.0 license for details.
The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.
We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.