Skip to content

Implementation of MelNet in PyTorch to generate high-fidelity audio samples

Notifications You must be signed in to change notification settings

jgarciapueyo/MelNet-SpeechGeneration

Repository files navigation

MelNet: PyTorch implementation

This project is a PyTorch implementation of S. Vasquez and M. Lewis, “Melnet: A generative model for audio in the frequency domain” which aims at generating high-fidelity audio samples by using two-dimensional time-frequency representations (spectrograms) in conjunction with a highly expressive probabilistic model and multiscale generation procedure.

For a more complete description than the one found in this README of MelNet, this implementation and the results achieved, please see the Report or the Presentation of the project.

Table of contents

  1. Results
  2. Project Structure
  3. Setup
    1. Setup with Ananconda
    2. Setup with Docker
  4. Usage
  5. Description of MelNet
  6. Notes

Results

Context

Each model's tiers models were trained individually on a NVIDIA GTX 2080 with 8GB of VRAM. The size of each tier is defined by the number of layers and the hidden size (RNN hidden state size), being hidden size the parameter that affects the size of a tier the most. To be able to fit in the GPU memory, the hidden size of the models had to be reduced to 200 (from the 512 used in the original paper).

To illustrate the architectures of the trained models in a compact way, from now on we will follow this pattern: d(dataset)_t(number of tiers)_l[number of layers]_hd(hidden size)_gmm(GMM Mixture Components).

Initial Results

The first model was trained on a Podcast dataset (dataset containing dialogue-based podcast audios), following the architecture used by Vasquez and Lewis in MelNet for unconditional speech for Blizzard (Table 1) but with a hidden size of 200 instead of 512 due to memory constraints.

Spectrogram viewed at different stages generated by the initial archi- tecture. Architecture: dpodcast_t6_l12.5.4.3.2.2_hd200_gmm10

Architecture: dpodcast_t6_l12.5.4.3.2.2_hd200_gmm10. The wav file can be found here.

The Upsampling Layers appear to be able to add detail to the spectrogram generated by previous tiers, but the initial tier was not able to dictate a coherent high-level structure.

Experiments with Upsampling Layers Only

As a means to see how much impact the initial tier has in the final spectrogram, we modified the algorithm for synthesis. In the normal synthesis algorithm, the first tier generates unconditionally a low-resolution spectrogram and the upsampling tiers add detail. In the modified synthesis algorithm, the first tier is an item from the dataset (a low-resolution spectrogram) and only the upsampling layers are used to add detail.

Spectrogram viewed at different stages generated using a real low- resolution spectrogram. Architecture: dljspeech_t6_l0.7.6.5.4.4_hd200_gmm10. The first tier does not have layers because it was not used.

Architecture: dljspeech_t6_l0.7.6.5.4.4_hd200_gmm10. The first tier does not have layers because it was not used. The wav file can be found here.

Experiments with First Tier

Knowing that the first tier is important because it dictates the high-level structure of the spectrogram, we compare the impact that the hidden size and the number of layers have on the loss of the firs tier.

First tier: Hidden size vs. Loss and Number of Layers vs. Loss Hidden size vs. Loss plot: Architecture: dljspeech_t6_l14.5.4.3.2.2_hdX_gmm10.
Number of layers vs. Loss plot: Architecture: dljspeech_t6_lX.5.4.3.2.2_hd64_gmm10.

From these results, we can conclude that the parameter hidden size has a greater impact on the loss than the number of layers.

Final Result

Finally, we trained the biggest model we could, after seeing that the size of the tiers has an impact on the quality of the spectrograms generated.

Spectrogram viewed at different stages.

Architecture: dljspeech_t6_l12.7.6.5.4.4_hd200_gmm10. The wav file can be found here.

Project Structure

SpeechGeneration-MelNet
|-- assets/           <- images used in the README.md, Report and Presentation
|-- datasets/         <- original data used to train the model (you have to create it)
|
|-- logs/             <- (you have to create it or it will be created automatically)
|   |-- general/      <- logs for general training
|   `-- tensorboard/  <- logs for displaying in tensorboard
|
|-- models/
|   |-- chkpt/     <- model weigths for different runs stored in pickle format. It stores also the
|   |                 training parameters. (you have to create it or it'll be created automatically)
|   `-- params/    <- description of the parameters to train and do speech synthesis according 
|                     to the paper and the dataset
|
|-- notebooks/     <- Jupyter Notebooks explaining different parts of the data pipeline 
|                     or the model
|
|-- results/       <- spectrograms, waveforms and wav files synthesized from trained models
|
|-- src/                  <- source code for use in this project
|   |-- data/             <- scripts to download and load the data
|   |-- dataprocessing/   <- scripts to turn raw data into processed data to input to the model
|   |-- model/            <- scripts of the model presented in the paper
|   |-- utils/            <- scripts that are useful in the project
|   |-- synthesis.py      <- main program to perform synthesis (see Usage section)
|   `-- train.py          <- main program to perform training (see Usage section)
|
|-- utils/                <- files for running the model in Docker
|
|-- environment.yml      <- file for reproducting the environment (created with anaconda)  
`-- Makefile             <- file with commands to run the project without effort

Setup

Setup with Anaconda

  1. Download and install Anaconda
  2. Clone the source code with git:
git clone https://github.com/jgarciapueyo/MelNet-SpeechGeneration
cd MelNet-SpeechGeneration
  1. Prepare the environment with Anaconda and activate it
conda create --name melnet -f environment.yml
conda activate melnet

Setup with Docker

  1. Download and install Docker
  2. Clone the source code with git:
git clone https://github.com/jgarciapueyo/MelNet-SpeechGeneration
cd MelNet-SpeechGeneration
  1. Create the image
docker build -f utils/docker/Dockerfile -t melnet .

or make build-container

  1. Run the container
docker run -it --rm --gpus all --mount src="$(pwd)",target=/app,type=bind melnet

or make run-container

Usage

Training

  1. Set up the project following the instructions in Setup.
  2. Download a dataset in the folder datasets/. As an example, the datasets Librispeech and LJSpeech can be downloaded by running
make data-librispeech
make data-ljspeech
  1. Create a YAML file for training a complete model (several tiers) on a dataset. This YAML will contain information about the architecture of the model and other parameters needed when transforming the audio waveforms of the dataset into melspectrograms. More information about the structure of the training YAML files can be found here.
  2. Train your MelNet model
python src/train.py -p models/params/{dataset}/{training_config_file}.yml

More options for training a model can be found here, like resuming training or specifying the tier/tiers of the model to be trained.
When training a model, it will automatically create a log file logs/general/{modelarchitecture}/{tier}_{timestamp}, a folder for tensorboard files logs/tensorboard/{modelarchitecture}_{timestamp}_{tier}/ and a folder for the weights of the model models/chkpt/{modelarchitecture}/ (each tier is stored separately in pickle format using .pt file extension).

Synthesis

After having trained a complete model (all the tiers), you can generate unconditionally spectrograms:

  1. Create a YAML file for performing synthesis. This YAML will contain information about the path to the weights of the tiers and the output folder. More information about the synthesis YAML file can be found here.
  2. Generate spectrograms
python src/synthesis.py -p models/params/{dataset}/{training_config_file}.yml -s models/params/{dataset}/{synthesis_config_file}.yml -t {timesteps_spectrogram}

When synthesizing a spectrogram, it will stored the spectrogram as an image and as a tensor in the path specified in the synthesis YAML file. It will save it in the tensorboard format in logs/tensorboard/{modelarchitecture}_{timestamp}_{tier}/ and create a log file logs/general/{modelarchitecture}/synthesis_{timestamp}.

Description of MelNet

# TODO
A complementing description to the original paper can be found in the Report of the project, adding new figures which can help understanding MelNet architecture.

Notes

This project is part of the course DD2465 Advanced, Individual Course in Computer Science during my studies at KTH.

About

Implementation of MelNet in PyTorch to generate high-fidelity audio samples

Topics

Resources

Stars

Watchers

Forks