FastSpeech2-Clean

This repository provides a clean and modernized implementation of FastSpeech2 and LightSpeech. Existing repositories often have problems with reproducibility or contain bugs. This version aims to solve these problems by providing a cleaner and more up-to-date code base.

Currently, pre-processing and training are only implemented for Chinese speech datasets. However, the scripts are designed to be easily adapted to other languages.

Models

If you have suggestions to further enhance speech quality, please contribute by opening an issue or pull request.

AISHELL-3 (85.62 hours) and biaobei (11.86 hours)

Model	UTMOS	CER	Val Loss	Params
FastSpeech2	2.8628	0.2600	0.5460	25.36M
LightSpeech, d_model=512	2.7543	0.2603	0.5569	6.36M
LightSpeech, d_model=256	2.6096	0.2654	0.5716	1.67M
Ground Truth	2.5376	0.2895	0.0	-

Even more data (total: 156.37 hours)

Model	UTMOS	CER	Val Loss	Params
LightSpeech, d_model=512	2.7720	0.2568	0.6322	6.36M
LightSpeech, d_model=256	2.6359	0.2607	0.6485	1.67M
Ground Truth	2.5396	0.2911	0.0	-

Notes

MOS is calculated using UTMOS (higher is better), and CER is calculated using Whisper (lower is better).
The "ground truth" refers to the reconstruction of the true mel spectrograms by the vocoder bigvgan_v2_22khz_80band_fmax8k_256x.
For the prediction of the generated spectrograms, hifigan_universal_v1 was used. Note that hifigan_lj_ft_t2_v1, hifigan_lj_v1 or bigvgan_base_22khz_80band can give better results. See also my other repository.
Approximately 20% of the dataset is used for validation.

Audio Samples

Hanzi	Pinyin	IPA
展览将对全体观众实行免费入场提供义务讲解	zhan2 lan3 jiang1 dui4 quan2 ti3 guan1 zhong4 <sil> shi2 xing2 mian3 fei4 ru4 chang3 <sil> ti2 gong1 yi4 wu4 jiang2 jie3	ʈʂan2 lan3 tɕjaŋ1 twei̯4 tɕʰɥɛn2 tʰi3 kwan1 ʈʂʊŋ4 <sil>1 ʂɻ̩2 ɕiŋ2 mjɛn3 fei̯4 ɻu4 ʈʂʰaŋ3 <sil>1 tʰi2 kʊŋ1 i4 u4 tɕjaŋ2 tɕje3

LightSpeech, d_model=512

lightspeech_new.mp4

FastSpeech

fastspeech2.mp4

Ground truth (reconstructed ground truth mel spectogram)

ground_truth.mp4

Prediction

After downloading a model, you can generate speech using Chinese characters, pinyin, or International Phonetic Alphabet (IPA). Only PyTorch is required, but optionally matplotlib, librosa and g2pw are needed.

Example Commands

See python predict.py --help for all commands.

Simplified Chinese: python predict.py "开车慢慢前行" --type simplified --model fastspeech2.pt --model_class "FastSpeech2" --speaker 0
Pinyin: python predict.py "kai1 che1 man4 man4 qian2 xing2" --type pinyin --model fastspeech2.pt --model_class "FastSpeech2" --speaker 1
IPA: python predict.py "kʰai̯1 ʈʂʰɤ1 man4 man4 tɕʰjɛn2 ɕiŋ2" --type ipa --model fastspeech2.pt --model_class "FastSpeech2" --speaker 1
Listing available speakers: python predict.py --list-speakers --model fastspeech2.pt
Simulating Other Languages: Since the model is trained on phonemes, it can simulate other languages. For example, "how are you?" could be transcribed in IPA as "hau̯2 aɻ2 ju2". However, the quality for out-of-distribution words is not as good.

The supported phones are ['<sil>', 'n', 'a', 'ŋ', 'j', 'i', 'w', 't', 'ɤ', 'ʂ', 'ə', 'tɕ', 'u', 'ɛ', 'ou̯', 'l', 'ʈʂ', 'ɕ', 'p', 'au̯', 'k', 'ei̯', 'ai̯', 'o', 'tɕʰ', 'm', 'ʊ', 'tʰ', 'ts', 'ʐ̩', 'ʈʂʰ', 's', 'y', 'f', 'e', 'ɻ̩', 'x', 'ɥ', 'ɹ̩', 'h', 'kʰ', 'pʰ', 'tsʰ', 'ɻ', 'ʐ', 'aɚ̯', 'ɚ', 'z̩', 'ɐ', 'ou̯˞', 'ɔ', 'ɤ̃', 'u˞', 'œ', 'ɑ̃', 'ʊ̃']

Here, <sil> denotes a silence marker.

ONNX Deployment

The script convert_to_onnx.py can be used to convert our models into the ONNX format, allowing deployment in various environments, including browsers via WebAssembly. Below is a comparison of the inference speed in Firefox using WebAssembly. The vocoder used is HifiGAN with three variants: v1, v2, and v3.

Model	Vocoder	Load Time (ms)	Avg. Inference Time (ms)
LightSpeech (d_model=512)	v1	517.00	2348.38
LightSpeech (d_model=512)	v2	125.00	223.74
LightSpeech (d_model=512)	v3	124.00	230.26
LightSpeech (d_model=256)	v1	237.00	2437.76
LightSpeech (d_model=256)	v2	113.00	190.44
LightSpeech (d_model=256)	v3	83.00	209.72
FastSpeech2	v1	704.00	2992.72
FastSpeech2	v2	481.00	371.70
FastSpeech2	v3	504.00	379.18

Instructions for training

Preparing training data

Organize all audio files in a directory named dataset with the following structure: dataset/SPEAKER_NAME/FILENAME.wav and dataset/SPEAKER_NAME/FILENAME.TextGrid. For instance, the file SSB00050001.wav from AISHELL-3 would be located at dataset/SSB0005/SSB00050001.wav.

For AISHELL-3 (Apache v2.0 license) and biaobei (non-commercial use only), pretrained TextGrid files are available in this repository. However, you can also generate your own annotations if needed.

Make sure that the .TextGrid files have the following sections: "hanzis", "pinyins", "phones".

Installing packages

conda create --name fastspeech2-clean python=3.11 pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda activate fastspeech2-clean
pip install librosa pandas penn tgt openai-whisper jiwer

Preprocessing

Preprocess your dataset with preprocess.py. This script is tailored to the Chinese language, but can also be adapted for other languages. Make sure to change DATASET_PATH and OUTPUT_PATH in preprocess.py if your input/output files should be in a different folder.

Training

Run CUBLAS_WORKSPACE_CONFIG=:4096:8 python train.py to train the network. The CUBLAS_WORKSPACE_CONFIG=:4096:8 flag is only necessary because of the use of torch.use_deterministic_algorithms(True).

Literature

LightSpeech: LightSpeech has demonstrated that CNN architectures can achieve similar performance to transformers with reduced computational overhead.
BigVGAN Vocoder: BigVGAN is also implemented for comparison. However, Hifi-GAN tends to be faster.
Pitch estimation: Many FastSpeech implementations use DIO + StoneMask, but these perform significantly worse than neural network based approaches. Here I use PENN, the current state of the art.
Objective Metrics: Instead of looking only at the mel spectrogram loss, we employ UTMOS for MOS estimation and Whisper for Character Error Rate (CER). The best parameters are selected based on speech quality (MOS), intelligibility (CER) and validation loss. I have found that MOS alone is only weakly correlated with speech quality. This paper also came to the same conclusion.

Disclaimer

This Text-to-Speech (TTS) system is provided as-is, without any guarantees or warranty. By using this system, you agree that the developers hold no responsibility or liability for any harm or damages that may result from the use of the generated speech.

Responsibility for Generated Content

The developers of this TTS system are not responsible for the content generated by the system. Users are solely responsible for any speech generated using this tool and must ensure that their use complies with all applicable laws and regulations.

Use of Voice Data

All voice data used in this TTS system is the property of the original voice actors or respective owners. The use of this data is subject to the terms and conditions set by the original owners. Users must obtain appropriate permissions or licenses for any commercial use of the generated speech or underlying voice data.

Ethical Use

We encourage ethical use of this TTS system. The generated speech should not be used for any malicious activities, including but not limited to, spreading misinformation, creating deepfakes, impersonation, or any other activities that could harm individuals or society.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_to_onnx.py		convert_to_onnx.py
example.jpg		example.jpg
fastspeech2.py		fastspeech2.py
lightspeech.py		lightspeech.py
predict.py		predict.py
preprocess.py		preprocess.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastSpeech2-Clean

Models

AISHELL-3 (85.62 hours) and biaobei (11.86 hours)

Even more data (total: 156.37 hours)

Notes

Audio Samples

LightSpeech, d_model=512

FastSpeech

Ground truth (reconstructed ground truth mel spectogram)

Prediction

Example Commands

ONNX Deployment

Instructions for training

Preparing training data

Installing packages

Preprocessing

Training

Literature

Disclaimer

Responsibility for Generated Content

Use of Voice Data

Ethical Use

About

Releases 1

Packages

Languages

License

lars76/fastspeech2-clean

Folders and files

Latest commit

History

Repository files navigation

FastSpeech2-Clean

Models

AISHELL-3 (85.62 hours) and biaobei (11.86 hours)

Even more data (total: 156.37 hours)

Notes

Audio Samples

LightSpeech, d_model=512

FastSpeech

Ground truth (reconstructed ground truth mel spectogram)

Prediction

Example Commands

ONNX Deployment

Instructions for training

Preparing training data

Installing packages

Preprocessing

Training

Literature

Disclaimer

Responsibility for Generated Content

Use of Voice Data

Ethical Use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages