FASTSPeech2

TTS(= Text-To-Speech) Model for studying and researching. This Repository is mainly based on ming024/FastSpeech2 and modified or added some codes. We use AI-HUB: Multi-Speaker-Speech dataset and MLS(=Multilingual LibriSpeech) dataset for training.

Dataset

AI-HUB: Multi-Speaker-Speech
- Language: Korean 🇰🇷
- sample_rate: 48kHz
MLS(=Multilingual LibriSpeech)
- Language: German 🇩🇪
- sample_rate: 16kHz
LJSpeech)
- Language: English 🇺🇸
- sample_rate: 22.05kHz

Languages

We trained FastSpeech2 Model following languages with introducing each language's phonsets we embedded and trained. We used Montreal-Forced Alignment tool to obtain the alignments between the utterances and the phoneme sequences as described in the paper. As you can see, we embedded IPA Phoneset.

🇰🇷 Korean

Korean MFA dictionary v2.0.0a :

 'b d dʑ e eː h i iː j k kʰ k̚ k͈ m n o oː p pʰ p̚ p͈ s sʰ s͈ t tɕ tɕʰ tɕ͈ tʰ t̚ t͈ u uː w x ç ŋ ɐ ɕʰ ɕ͈ ɛ ɛː ɡ ɣ ɥ ɦ ɨ ɨː ɭ ɰ ɲ ɸ ɾ ʌ ʌː ʎ ʝ β'

🇩🇪 German

German MFA dictionary v2.0.0a :

a aj aw aː b c cʰ d eː f h iː j k kʰ l l̩ m m̩ n n̩ oː p pf pʰ s t ts tʃ tʰ uː v x yː z ç øː ŋ œ ɐ ɔ ɔʏ ə ɛ ɟ ɡ ɪ ɲ ʁ ʃ ʊ ʏ

🇺🇸 English(US)

English MFA dictionary v2.2.1 :

a aj aw aː b bʲ c cʰ cʷ d dʒ dʲ d̪ e ej f fʲ fʷ h i iː j k kp kʰ kʷ l m mʲ m̩ n n̩ o ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ u uː v vʲ vʷ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔ ɔj ə əw ɚ ɛ ɛː ɜ ɜː ɝ ɟ ɟʷ ɡ ɡb ɡʷ ɪ ɫ ɫ̩ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ

wandb

If you wanna see the training status, you can check here. You can check theses things above wandb link:

Listen to the samples(= Label Speech & predicted Speech)
- Available only in some experiments in 🇩🇪 GERMAN.
  - you can hear samples at:
    - Tables section in the dashboard
    - Hidden Pannels section in the bottom of each run's board.
  - Available to listen samples:
    - T4MR_4_x_summed_1800k_BS1, T4MR_6_x_summed_max_ ..., T4MR_10_rs_22k_msl_ ...,
      T4MR_15_hate_energy_ ..., T4MR_17_basic_but_bs64.
- We wanted to continue to collect samples during training in 🇰🇷 Korean, but couldn't. (Had to care storage)
Training / Eval's Mel-Spectrogram

Recent Experiments

T27_Hope_that_u_can_replace_that_with_sth_better
- FastSpeech2 + PostNet | 🇺🇸 English | Single_Speaker
- Batch_Size: 64
- Epochs: 800
T25_END_Game
- FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker: 8505
- Resampled (from 48kHz to 22.05kHz)
- Batch_Size: 64
- Epochs: 600
T24_Thank_you_Mobius:
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
- Non-Stationary Noise Reduction -> Resampled (from 48kHz to 22.05kHz)
- Batch_Size: 64
- Epochs: 600
T23_You_Just_Chosse_ur_Burden
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
- Resampled (from 48kHz to 22.05kHz) -> Non-Stationary Noise Reduction
- Batch_Size: 64
- Epochs: 600
T22_Theres_No_comfort
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
- Resampled (from 48kHz to 22.05kHz)
- Batch_Size: 64
- Epochs: 600

Features(Differences?)

🤗accelerate can allow multi-gpu training easily: Trained on 2 x NVIDIA GeForce RTX 4090 GPUs.
torchmalloc.py and 🌈colorama can show your resource in real-time (during training) like below:

example

Referred: 🤗huggingface/peft/ .. example
🔇noisereduce is available when you run preprocessor.py.
- Non-Stataionary Noise Reduction
- prop_decrease can avoid data-distortion. (0.0 ~ 1.0)
wandb instead of Tensorboard. wandb is compatible with 🤗accelerate and with 🔥pytorch.
🔥[Pytorch-Hub]NVIDIA/HiFi-GAN: used as a vocoder.

Preprocess

This preprocess.py can give you the pitch, energy, duration and phones from TextGrid files.

python preprocess.py config/LibriTTS/preprocess.yaml

Train

First, you should log-in wandb with your token key in CLI.

wandb login --relogin '##### Token Key #######'

Next, you can set your training environment with following commands.

accelerate config

With this command, you can start training.

accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda

Also, you can train your TTS model with this command.

CUDA_VISIBLE_DEVICES=0,3 accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda

Synthesize

you can synthesize speech in CLI with this command:

python synthesize.py --raw_texts <Text to syntheize to speech> --restore_step 53100

Also, you can check this jupyter-notebook when you try to synthesize.

References

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
ming024/FastSpeech2
HGU-DLLAB/Korean-FastSpeech2-Pytorch Public
pytorch_hub/nvidia/HIFI-GAN
🤗 Accelerate
🤗 Accelerate(Github)
🤗 huggingface/peft/.../peft_lora_clm_accelerate_ds_zero3_offload.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FASTSPeech2

Dataset

Languages

wandb

Recent Experiments

Features(Differences?)

Preprocess

Train

Synthesize

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

FASTSPeech2

Dataset

Languages

wandb

Recent Experiments

Features(Differences?)

Preprocess

Train

Synthesize

References