AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

PyTorch Implementation of [AudioLCM (ACM-MM'24)]: an efficient and high-quality text-to-audio generation with latent consistency model.

We provide our implementation and pretrained models as open-source in this repository.

Visit our demo page for audio samples.

AudioLCM HuggingFace Space

News

Oct, 2024: FlashAudio released.
Sept, 2024: Make-An-Audio 3 (Lumina-Next) accepted by NeurIPS'24.
July, 2024: AudioLCM accepted by ACM-MM'24.
June, 2024: Make-An-Audio 3 (Lumina-Next) released in Github and HuggingFace.
May, 2024: AudioLCM released in Github and HuggingFace.

Quick Started

We provide an example of how you can generate high-fidelity samples quickly using AudioLCM.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Support Datasets and Pretrained Models

Simply download the weights from Huggingface.

Download:
    audiolcm.ckpt and put it into ./ckpts  
    BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w  
    t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP
    bert-base-uncased and put it into ./ldm/modules/encoders/CLAP
    CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP

Dependencies

See requirements in requirement.txt:

Inference with a pre-trained model

python scripts/txt2audio_for_lcm.py  --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt  vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps  -r ckpt/audiolcm.ckpt

Dataset preparation

We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec.
Before training, we need to construct the dataset information into a tsv file, which includes the name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio).
We provide a tsv file of the audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.

Generate the melspec file of audio

Assume you have already got a tsv file to link each caption to its audio_path, which means the tsv_file has "name","audio_path","dataset" and "caption" columns in it. To get the melspec of audio, run the following command, which will save mels in ./processed

python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv

Add the duration into the tsv file

python ldm/data/preprocess/add_duration.py

Train variational autoencoder

Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace data.params.spec_dir_path with the data(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums

python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7

The training result will be saved in ./logs/

Train latent diffsuion

After Training VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train the Diffusion model

python main.py --base configs/autoencoder1d.yaml -t  --gpus 0,1,2,3,4,5,6,7

The training result will be saved in ./logs/

Evaluation

Please refer to Make-An-Audio

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio CLAP, Stable Diffusion, as described in our code.

Citations

If you find this code useful in your research, please consider citing:

@misc{liu2024audiolcm,
      title={AudioLCM: Text-to-Audio Generation with Latent Consistency Models}, 
      author={Huadai Liu and Rongjie Huang and Yang Liu and Hengyuan Cao and Jialei Wang and Xize Cheng and Siqi Zheng and Zhou Zhao},
      year={2024},
      eprint={2406.00356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
ldm		ldm
pythonscripts		pythonscripts
scripts		scripts
vocoder/bigvgan		vocoder/bigvgan
wav_evaluation		wav_evaluation
.gitignore		.gitignore
README.md		README.md
audiocaps_test_16000_struct.tsv		audiocaps_test_16000_struct.tsv
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

News

Quick Started

Support Datasets and Pretrained Models

Dependencies

Inference with a pre-trained model

Dataset preparation

Generate the melspec file of audio

Train variational autoencoder

Train latent diffsuion

Evaluation

Acknowledgements

Citations

Disclaimer

About

Releases

Packages

Contributors 3

Languages

Text-to-Audio/AudioLCM

Folders and files

Latest commit

History

Repository files navigation

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

News

Quick Started

Support Datasets and Pretrained Models

Dependencies

Inference with a pre-trained model

Dataset preparation

Generate the melspec file of audio

Train variational autoencoder

Train latent diffsuion

Evaluation

Acknowledgements

Citations

Disclaimer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages