This repo is mainly based on 🍵 Matcha-TTS Official Github and some codes are modified. The purpose of this repository is to study and study 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching.
- 🔥
Pytorch
, ⚡Lightning
, 🐉🐲🐲hydra-core
- 🤗
wandb
Click 👉
While studying 🍵 Matcha-TTS Official Github, I modified some codes to make it simpler.
- Logger: 🤗
wandb
(More comfortable and easy access) - Vocoder: 🔥
[Pytorch-Hub]NVIDIA/HiFi-GAN
- Alignment: resemble-ai/monotonic_align
These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter Notebook file to Colab to share the synthesized example-speeches below:
- 😲 trim_butterfly_16.ipynb |
BS: 16
|NVIDIA GeForce RTX 4080 (x1)
- 😵 decent_meadow_46.ipynb |
BS: 32
|LR: 2e-5
|NVIDIA GeForce RTX 4080 (x1)
- ⭐ wobbly_frog_53.ipynb |
BS: 16
|bf16-mixed
|NVIDIA GeForce RTX 4080 (x1)
- 👽 wobbly_serenity_54.ipynb |
BS: 32
|bf16-mixed
|NVIDIA GeForce RTX 4080 (x1)
- 😣 jolly_frog_47.ipynb |
BS: 32
|LR: 2e-5
|NVIDIA GeForce RTX 4090 (x1)
- 🌟 eager_frost_50.ipynb |
BS: 16
|NVIDIA GeForce RTX 4090 (x1)
- ✨ royal_grass_56.ipynb |
BS: 16
|bf16-mixed
|NVIDIA GeForce RTX 4090 (x1)
import gc
import torch
import lightning as L
class MemoryCleanupCallback(L.Callback):
def on_train_epoch_end(self, trainer, pl_module):
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
def on_validation_epoch_end(self, trainer, pl_module):
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
This is not included in requirements.txt
. You can install MAS(Monotonic_Alignment_Search) with a following command below:
pip install git+https://github.com/resemble-ai/monotonic_align.git
you can use like this:
import monotonic_align
Dataset: LJSpeech
Language
: English 🇺🇸Speaker
: Single Speakersample_rate
: 22.05kHz
Let's assume we are training with LJ Speech
- Download the dataset from here, extract it to your own data dir (In my case:
data/LJSpeech/ljs/LJSpeech-1.1
), and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo. - Go to
configs/data/ljspeech.yaml
and change
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
- Generate normalisation statistics with the yaml file of dataset configuration
PYTHONPATH=. python matcha/utils/generate_data_statistics.py
- Update these values in
configs/data/ljspeech.yaml
underdata_statistics
key.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.5170512199401855
mel_std: 2.0643811225891113
Now you got ready to train
!
First, you should log-in wandb with your token key in CLI.
wandb login --relogin '<your-wandb-api-token>'
And you can run training with one of these commands:
PYTHONPATH=. python matcha/train.py experiment=ljspeech
# If you run training on a cetain gpu_id:
CUDA_VISIBLE_DEVICES=2 PYTHONPATH=. python matcha/train.py experiment=ljspeech
Also, you can run for multi-gpu training:
# If you run multi-gpu training:
CUDA_VISIBLE_DEVICES=2,3 PYTHONPATH=. python matcha/train.py experiment=ljspeech trainer.devices=[0,1]
These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter-Notebook file to Colab to share the synthesized example-speeches.
- you can check more samples Colab notebooks (Examples) above.
- You can refer to the code for synthesis:
matcha/utils/synthesize_utils.py
- This notebook is also on this github repo:
notebooks/Samples_wobbly_frog_53.ipynb
CLI Arguments
: Will be Updated!
- 🍵 Paper: Matcha-TTS: A fast TTS architecture with conditional flow matching
└ Github: 🍵 Matcha-TTS Official Github - MAS(Monotonic Alignment Search)
└ resemble-ai/monotonic_align - 🔥
Pytorch
- ⚡
Lightning
- 🐉🐲🐲
hydra-core