Skip to content

Latest commit

 

History

History
405 lines (297 loc) · 17.6 KB

README.md

File metadata and controls

405 lines (297 loc) · 17.6 KB

FoleyCrafter

Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.

Your star is our fuel! We're revving up the engines with it!

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†

(†Corresponding Author)

What's New

  • A more powerful one 😝 .
  • Release training code.
  • 2024/07/01 Release the model and code of FoleyCrafter.

Setup

Prepare Environment

Use the following command to install dependencies:

# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter

# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install

Download Checkpoints

The checkpoints will be downloaded automatically by running inference.py.

You can also download manually using following commands.

  • Download the text-to-audio base model. We use Auffusion
  • git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion
  • Download FoleyCrafter
  • git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/

    Put checkpoints as follows:

    └── checkpoints
        ├── semantic
        │   ├── semantic_adapter.bin
        ├── vocoder
        │   ├── vocoder.pt
        │   ├── config.json
        ├── temporal_adapter.ckpt
        │   │
        └── timestamp_detector.pth.tar
    

    Gradio demo

    You can launch the Gradio interface for FoleyCrafter by running the following command:

    python app.py --share

    Inference

    Video To Audio Generation

    python inference.py --save_dir=output/sora/

    Results:

    Input Video

    Generated Audio

    0.mp4
    0.mp4
    1.mp4
    1.mp4
    2.mp4
    2.mp4
    3.mp4
    3.mp4
    • Temporal Alignment with Visual Cues
    python inference.py \
    --temporal_align \
    --input=input/avsync \
    --save_dir=output/avsync/

    Results:

    Ground Truth

    Generated Audio

    0.mp4
    0.mp4
    1.mp4
    1.mp4
    2.mp4
    2.mp4

    Text-based Video to Audio Generation

    • Using Prompt
    # case1
    python inference.py \
    --input=input/PromptControl/case1/ \
    --seed=10201304011203481429 \
    --save_dir=output/PromptControl/case1/
    
    python inference.py \
    --input=input/PromptControl/case1/ \
    --seed=10201304011203481429 \
    --prompt='noisy, people talking' \
    --save_dir=output/PromptControl/case1_prompt/
    
    # case2
    python inference.py \
    --input=input/PromptControl/case2/ \
    --seed=10021049243103289113 \
    --save_dir=output/PromptControl/case2/
    
    python inference.py \
    --input=input/PromptControl/case2/ \
    --seed=10021049243103289113 \
    --prompt='seagulls' \
    --save_dir=output/PromptControl/case2_prompt/

    Results:

    Generated Audio

    Generated Audio

    Without Prompt

    Prompt: noisy, people talking

    0.mp4
    0.mp4

    Without Prompt

    Prompt: seagulls

    0.mp4
    0.mp4
    • Using Negative Prompt
    # case 3
    python inference.py \
    --input=input/PromptControl/case3/ \
    --seed=10041042941301238011 \
    --save_dir=output/PromptControl/case3/
    
    python inference.py \
    --input=input/PromptControl/case3/ \
    --seed=10041042941301238011 \
    --nprompt='river flows' \
    --save_dir=output/PromptControl/case3_nprompt/
    
    # case4
    python inference.py \
    --input=input/PromptControl/case4/ \
    --seed=10014024412012338096 \
    --save_dir=output/PromptControl/case4/
    
    python inference.py \
    --input=input/PromptControl/case4/ \
    --seed=10014024412012338096 \
    --nprompt='noisy, wind noise' \
    --save_dir=output/PromptControl/case4_nprompt/
    

    Results:

    Generated Audio

    Generated Audio

    Without Prompt

    Negative Prompt: river flows

    0.mp4
    0.mp4

    Without Prompt

    Negative Prompt: noisy, wind noise

    0.mp4
    0.mp4

    Commandline Usage Parameters

    options:
      -h, --help            show this help message and exit
      --prompt PROMPT       prompt for audio generation
      --nprompt NPROMPT     negative prompt for audio generation
      --seed SEED           ramdom seed
      --temporal_align TEMPORAL_ALIGN
                            use temporal adapter or not
      --temporal_scale TEMPORAL_SCALE
                            temporal align scale
      --semantic_scale SEMANTIC_SCALE
                            visual content scale
      --input INPUT         input video folder path
      --ckpt CKPT           checkpoints folder path
      --save_dir SAVE_DIR   generation result save path
      --pretrain PRETRAIN   generator checkpoint path
      --device DEVICE

    BibTex

    @misc{zhang2024pia,
      title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
      author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
      year={2024},
      eprint={2407.01494},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }
    

    Contact Us

    Yiming Zhang: zhangyiming@pjlab.org.cn

    YiCheng Gu: yichenggu@link.cuhk.edu.cn

    Yanhong Zeng: zengyanhong@pjlab.org.cn

    LICENSE

    Please check Apache-2.0 license for details.

    Acknowledgements

    The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.

    We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.