Skip to content

Latest commit

 

History

History
141 lines (123 loc) · 8.03 KB

README.md

File metadata and controls

141 lines (123 loc) · 8.03 KB

FASTSPeech2

TTS(= Text-To-Speech) Model for studying and researching. This Repository is mainly based on ming024/FastSpeech2 and modified or added some codes. We use AI-HUB: Multi-Speaker-Speech dataset and MLS(=Multilingual LibriSpeech) dataset for training.

Dataset

Languages

We trained FastSpeech2 Model following languages with introducing each language's phonsets we embedded and trained. We used Montreal-Forced Alignment tool to obtain the alignments between the utterances and the phoneme sequences as described in the paper. As you can see, we embedded IPA Phoneset.

🇰🇷 Korean
Korean MFA dictionary v2.0.0a : 'b d dʑ e eː h i iː j k kʰ k̚ k͈ m n o oː p pʰ p̚ p͈ s sʰ s͈ t tɕ tɕʰ tɕ͈ tʰ t̚ t͈ u uː w x ç ŋ ɐ ɕʰ ɕ͈ ɛ ɛː ɡ ɣ ɥ ɦ ɨ ɨː ɭ ɰ ɲ ɸ ɾ ʌ ʌː ʎ ʝ β'
🇩🇪 German
German MFA dictionary v2.0.0a : a aj aw aː b c cʰ d eː f h iː j k kʰ l l̩ m m̩ n n̩ oː p pf pʰ s t ts tʃ tʰ uː v x yː z ç øː ŋ œ ɐ ɔ ɔʏ ə ɛ ɟ ɡ ɪ ɲ ʁ ʃ ʊ ʏ
🇺🇸 English(US)
English MFA dictionary v2.2.1 : a aj aw aː b bʲ c cʰ cʷ d dʒ dʲ d̪ e ej f fʲ fʷ h i iː j k kp kʰ kʷ l m mʲ m̩ n n̩ o ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ u uː v vʲ vʷ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔ ɔj ə əw ɚ ɛ ɛː ɜ ɜː ɝ ɟ ɟʷ ɡ ɡb ɡʷ ɪ ɫ ɫ̩ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ

wandb wandb

If you wanna see the training status, you can check here. You can check theses things above wandb link:

  • Listen to the samples(= Label Speech & predicted Speech)
    • Available only in some experiments in 🇩🇪 GERMAN.
      • you can hear samples at:
        • Tables section in the dashboard
        • Hidden Pannels section in the bottom of each run's board.
      • Available to listen samples:
        • T4MR_4_x_summed_1800k_BS1, T4MR_6_x_summed_max_ ..., T4MR_10_rs_22k_msl_ ...,
          T4MR_15_hate_energy_ ..., T4MR_17_basic_but_bs64.
    • We wanted to continue to collect samples during training in 🇰🇷 Korean, but couldn't. (Had to care storage)
  • Training / Eval's Mel-Spectrogram

Recent Experiments

  • T27_Hope_that_u_can_replace_that_with_sth_better
    • FastSpeech2 + PostNet | 🇺🇸 English | Single_Speaker
    • Batch_Size: 64
    • Epochs: 800
  • T25_END_Game
    • FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker: 8505
    • Resampled (from 48kHz to 22.05kHz)
    • Batch_Size: 64
    • Epochs: 600
  • T24_Thank_you_Mobius:
    • FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
    • Non-Stationary Noise Reduction -> Resampled (from 48kHz to 22.05kHz)
    • Batch_Size: 64
    • Epochs: 600
  • T23_You_Just_Chosse_ur_Burden
    • FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
    • Resampled (from 48kHz to 22.05kHz) -> Non-Stationary Noise Reduction
    • Batch_Size: 64
    • Epochs: 600
  • T22_Theres_No_comfort
    • FastSpeech2 | 🇰🇷 Korean | Single_Speaker: 8505
    • Resampled (from 48kHz to 22.05kHz)
    • Batch_Size: 64
    • Epochs: 600

Features(Differences?)

  • 🤗accelerate can allow multi-gpu training easily: Trained on 2 x NVIDIA GeForce RTX 4090 GPUs.
  • torchmalloc.py and 🌈colorama can show your resource in real-time (during training) like below:
    example
  • 🔇noisereduce is available when you run preprocessor.py.
    • Non-Stataionary Noise Reduction
    • prop_decrease can avoid data-distortion. (0.0 ~ 1.0)
  • wandb instead of Tensorboard. wandb is compatible with 🤗accelerate and with 🔥pytorch.
  • 🔥[Pytorch-Hub]NVIDIA/HiFi-GAN: used as a vocoder.

Preprocess

This preprocess.py can give you the pitch, energy, duration and phones from TextGrid files.

python preprocess.py config/LibriTTS/preprocess.yaml 

Train

First, you should log-in wandb with your token key in CLI.

wandb login --relogin '##### Token Key #######'

Next, you can set your training environment with following commands.

accelerate config

With this command, you can start training.

accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda

Also, you can train your TTS model with this command.

CUDA_VISIBLE_DEVICES=0,3 accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda

Synthesize

you can synthesize speech in CLI with this command:

python synthesize.py --raw_texts <Text to syntheize to speech> --restore_step 53100

Also, you can check this jupyter-notebook when you try to synthesize.

References