PyTorch Implementation of [AudioLCM (ACM-MM'24)]: an efficient and high-quality text-to-audio generation with latent consistency model.
We provide our implementation and pretrained models as open-source in this repository.
Visit our demo page for audio samples.
- Oct, 2024: FlashAudio released.
- Sept, 2024: Make-An-Audio 3 (Lumina-Next) accepted by NeurIPS'24.
- July, 2024: AudioLCM accepted by ACM-MM'24.
- June, 2024: Make-An-Audio 3 (Lumina-Next) released in Github and HuggingFace.
- May, 2024: AudioLCM released in Github and HuggingFace.
We provide an example of how you can generate high-fidelity samples quickly using AudioLCM.
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
Simply download the weights from Huggingface.
Download:
audiolcm.ckpt and put it into ./ckpts
BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w
t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP
bert-base-uncased and put it into ./ldm/modules/encoders/CLAP
CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP
See requirements in requirement.txt
:
python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt
- We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec.
- Before training, we need to construct the dataset information into a tsv file, which includes the name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio).
- We provide a tsv file of the audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.
Assume you have already got a tsv file to link each caption to its audio_path, which means the tsv_file has "name","audio_path","dataset" and "caption" columns in it. To get the melspec of audio, run the following command, which will save mels in ./processed
python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv
Add the duration into the tsv file
python ldm/data/preprocess/add_duration.py
Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace data.params.spec_dir_path with the data(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums
python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7
The training result will be saved in ./logs/
After Training VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train the Diffusion model
python main.py --base configs/autoencoder1d.yaml -t --gpus 0,1,2,3,4,5,6,7
The training result will be saved in ./logs/
Please refer to Make-An-Audio
This implementation uses parts of the code from the following Github repos: Make-An-Audio CLAP, Stable Diffusion, as described in our code.
If you find this code useful in your research, please consider citing:
@misc{liu2024audiolcm,
title={AudioLCM: Text-to-Audio Generation with Latent Consistency Models},
author={Huadai Liu and Rongjie Huang and Yang Liu and Hengyuan Cao and Jialei Wang and Xize Cheng and Siqi Zheng and Zhou Zhao},
year={2024},
eprint={2406.00356},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.