TTS(= Text-To-Speech) Model for studying and researching. This Repository is mainly based on ming024/FastSpeech2 and modified or added some codes. We use AI-HUB: Multi-Speaker-Speech dataset and MLS(=Multilingual LibriSpeech) dataset for training.
- AI-HUB: Multi-Speaker-Speech
Language
: Korean 🇰🇷sample_rate
: 48kHz
- MLS(=Multilingual LibriSpeech)
Language
: German 🇩🇪sample_rate
: 16kHz
- LJSpeech)
Language
: English 🇺🇸sample_rate
: 22.05kHz
We trained FastSpeech2 Model following languages with introducing each language's phonsets we embedded and trained. We used Montreal-Forced Alignment
tool to obtain the alignments between the utterances and the phoneme sequences as described in the paper. As you can see, we embedded IPA Phoneset
.
🇰🇷 Korean
'b d dʑ e eː h i iː j k kʰ k̚ k͈ m n o oː p pʰ p̚ p͈ s sʰ s͈ t tɕ tɕʰ tɕ͈ tʰ t̚ t͈ u uː w x ç ŋ ɐ ɕʰ ɕ͈ ɛ ɛː ɡ ɣ ɥ ɦ ɨ ɨː ɭ ɰ ɲ ɸ ɾ ʌ ʌː ʎ ʝ β'
🇩🇪 German
a aj aw aː b c cʰ d eː f h iː j k kʰ l l̩ m m̩ n n̩ oː p pf pʰ s t ts tʃ tʰ uː v x yː z ç øː ŋ œ ɐ ɔ ɔʏ ə ɛ ɟ ɡ ɪ ɲ ʁ ʃ ʊ ʏ
🇺🇸 English(US)
a aj aw aː b bʲ c cʰ cʷ d dʒ dʲ d̪ e ej f fʲ fʷ h i iː j k kp kʰ kʷ l m mʲ m̩ n n̩ o ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ u uː v vʲ vʷ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔ ɔj ə əw ɚ ɛ ɛː ɜ ɜː ɝ ɟ ɟʷ ɡ ɡb ɡʷ ɪ ɫ ɫ̩ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ
If you wanna see the training status, you can check here. You can check theses things above wandb link
:
- Listen to the
samples
(=Label Speech
&predicted Speech
)- Available only in some experiments in 🇩🇪 GERMAN.
- you can hear samples at:
Tables
section in the dashboardHidden Pannels
section in the bottom of each run's board.
- Available to listen samples:
T4MR_4_x_summed_1800k_BS1
,T4MR_6_x_summed_max_ ...
,T4MR_10_rs_22k_msl_ ...
,
T4MR_15_hate_energy_ ...
,T4MR_17_basic_but_bs64
.
- you can hear samples at:
- We wanted to continue to collect samples during training in 🇰🇷 Korean, but couldn't. (Had to care storage)
- Available only in some experiments in 🇩🇪 GERMAN.
- Training / Eval's Mel-Spectrogram
- T27_Hope_that_u_can_replace_that_with_sth_better
- FastSpeech2 + PostNet | 🇺🇸 English | Single_Speaker
Batch_Size
: 64Epochs
: 800
- T25_END_Game
- FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker:
8505
- Resampled (from
48kHz
to22.05kHz
) Batch_Size
: 64Epochs
: 600
- FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker:
- T24_Thank_you_Mobius:
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505
Non-Stationary
Noise Reduction -> Resampled (from48kHz
to22.05kHz
)Batch_Size
: 64Epochs
: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- T23_You_Just_Chosse_ur_Burden
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505
- Resampled (from
48kHz
to22.05kHz
) ->Non-Stationary
Noise Reduction Batch_Size
: 64Epochs
: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- T22_Theres_No_comfort
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505
- Resampled (from
48kHz
to22.05kHz
) Batch_Size
: 64Epochs
: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- 🤗
accelerate
can allowmulti-gpu
training easily: Trained on 2 x NVIDIA GeForce RTX 4090 GPUs. torchmalloc.py
and 🌈colorama
can show your resource in real-time (during training) like below:example
- 🔇
noisereduce
is available when you runpreprocessor.py
.Non-Stataionary Noise Reduction
prop_decrease
can avoid data-distortion. (0.0 ~ 1.0)
wandb
instead ofTensorboard
.wandb
is compatible with 🤗accelerate
and with 🔥pytorch
.- 🔥
[Pytorch-Hub]NVIDIA/HiFi-GAN
: used as a vocoder.
This preprocess.py
can give you the pitch, energy, duration and phones from TextGrid
files.
python preprocess.py config/LibriTTS/preprocess.yaml
First, you should log-in wandb with your token key in CLI.
wandb login --relogin '##### Token Key #######'
Next, you can set your training environment with following commands.
accelerate config
With this command, you can start training.
accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda
Also, you can train your TTS model with this command.
CUDA_VISIBLE_DEVICES=0,3 accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda
you can synthesize speech in CLI with this command:
python synthesize.py --raw_texts <Text to syntheize to speech> --restore_step 53100
Also, you can check this jupyter-notebook when you try to synthesize.