Skip to content

elu-lab/matcha_tts_e

Repository files navigation

🍵 matcha_tts_e

This repo is mainly based on :octocat: 🍵 Matcha-TTS Official Github and some codes are modified. The purpose of this repository is to study and study 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching.

Trying to code simpler

While studying :octocat: 🍵 Matcha-TTS Official Github, I modified some codes to make it simpler.

Colab notebooks (Examples):

These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter Notebook file to Colab to share the synthesized example-speeches below:

  • 😲 trim_butterfly_16.ipynb Open In Colab | BS: 16 | NVIDIA GeForce RTX 4080 (x1)
  • 😵 decent_meadow_46.ipynb Open In Colab | BS: 32 | LR: 2e-5 | NVIDIA GeForce RTX 4080 (x1)
  • wobbly_frog_53.ipynb Open In Colab | BS: 16 | bf16-mixed | NVIDIA GeForce RTX 4080 (x1)
  • 👽 wobbly_serenity_54.ipynb Open In Colab | BS: 32 | bf16-mixed | NVIDIA GeForce RTX 4080 (x1)
  • 😣 jolly_frog_47.ipynb Open In Colab | BS: 32 | LR: 2e-5 | NVIDIA GeForce RTX 4090 (x1)
  • 🌟 eager_frost_50.ipynb Open In Colab | BS: 16 | NVIDIA GeForce RTX 4090 (x1)
  • royal_grass_56.ipynb Open In Colab | BS: 16 | bf16-mixed | NVIDIA GeForce RTX 4090 (x1)

MemoryCleanupCallback Added!

import gc
import torch
import lightning as L

  class MemoryCleanupCallback(L.Callback):
      def on_train_epoch_end(self, trainer, pl_module):
          if torch.cuda.is_available():
              torch.cuda.empty_cache()
          gc.collect()
          
      def on_validation_epoch_end(self, trainer, pl_module):
          if torch.cuda.is_available():
              torch.cuda.empty_cache()
          gc.collect()

MAS(=Monotonic Alignment Search) Installation

This is not included in requirements.txt. You can install MAS(Monotonic_Alignment_Search) with a following command below:

:octocat: resemble-ai/monotonic_align

pip install git+https://github.com/resemble-ai/monotonic_align.git

you can use like this:

import monotonic_align

Dataset: LJSpeech

  • Language: English 🇺🇸
  • Speaker: Single Speaker
  • sample_rate: 22.05kHz

Compute mel_mean, mel_std of ljspeech dataset

Let's assume we are training with LJ Speech

  1. Download the dataset from here, extract it to your own data dir (In my case: data/LJSpeech/ljs/LJSpeech-1.1), and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.
  2. Go to configs/data/ljspeech.yaml and change
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
  1. Generate normalisation statistics with the yaml file of dataset configuration
PYTHONPATH=. python matcha/utils/generate_data_statistics.py
  1. Update these values in configs/data/ljspeech.yaml under data_statistics key.
data_statistics:  # Computed for ljspeech dataset 
  mel_mean: -5.5170512199401855
  mel_std: 2.0643811225891113

Now you got ready to train!

Train

First, you should log-in wandb with your token key in CLI.

wandb login --relogin '<your-wandb-api-token>'

And you can run training with one of these commands:

PYTHONPATH=. python matcha/train.py experiment=ljspeech
# If you run training on a cetain gpu_id:
CUDA_VISIBLE_DEVICES=2 PYTHONPATH=. python matcha/train.py experiment=ljspeech

Also, you can run for multi-gpu training:

# If you run multi-gpu training:
CUDA_VISIBLE_DEVICES=2,3 PYTHONPATH=. python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Synthesize

These codes are run and the example-speeches are synthesized in my vscode environment. I moved this Jupyter-Notebook file to Colab to share the synthesized example-speeches.

Samples_wobbly_frog_53.ipynb Open In Colab

Reference

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published