-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Dataset
TTS requires a dataset with audio files and their corresponding transcripts. The advantage of TTS architecture is that you do not need to align transcriptions with the audio since the model is able to learn the alignment itself. Some of the known open-datasets are given below.
- Gaussian like distribution on clip and text lengths. So plot the distribution if clip lenghts and check if it covers enough short and long voice clips.
- Mistake free. Remove any wrong or broken files. Check annotations, compare transcript and audio length.
- Noise free. Background noise might lead your model to struggle to learn especially a good alignment. Even if it learns the alignemnt, the final result might be much worse than you expected.
- Compatible tone and pitch among voice clips. For instance, if you are using an audio book recording for your project, it might have impersonation for different characters in the book. This kind of divergences between instances downgrades the model performance.
- Good phoneme coverage. Make sure that your dataset covers a good portion of phonemes, di-phonemes and in some languages tri-phonemes. Depending on your use case, if phoneme coverage is low, the model might have a hard time to pronounce novel difficult words.
- Naturalness of recordings. Your model learns whatever is in your dataset. Therefore, if you like to hear a voice as natural as possible with all the tone and pitch differences, for instance with different punctuations, your dataset should also accomodate similar attributes.
If you like to use a bespoken dataset, you might like to perform a couple of quality checks before training. TTS provides couple of notebooks (CheckSpectrograms, AnalyzeDataset to expedite this part for you.
-
AnalyzeDataset is for checking dataset distribution in terms of clip and transcript lengths. It is good to find outlier instances (too long, short text but long voice clip etc.)and remove them before training. Keep in mind that, we like to have a good balance between long and short clips to prevent any kind of bias in traning. If you have only short clips (1-3 secs), then your model might suffer from long sentences in the inference time and if your instances are long, then it might not learn the alignment or might take too long to train the model.
-
CheckSpectrograms is to measure the noise level of the clips and find good audio processing parameters. Noise level might be observed by checking spectrograms. If spectrograms look cluttered, especially in silent parts, this dataset might not be a good candidate for a TTS project. If your voice clips are too noisy in the background, it makes things harder for your model to learn the alignment and the final result might be different than the voice you are given. If the spectrograms look good, then the next step is to find good set of audio processing parameters, defined in
config.json
. In the notebook, you can compare different set of parameters and see the resynthesis results in relation to given ground-truth. Find the best parameters that give the best possible synthesis performance.
Another important practical detail is the quantization of voice clips. If your dataset has a very high bit-rate, that might cause slow data-load time and consequently slow training. It is better to reduce the sample-rate of your dataset around 16000-22050.
Before training, you need to make sure that the data loader (TTSDataset.py) is compatible with your dataset. In general, it'd be enough for any dataset, unless you have something specific to consider. Then, it is better to take a look at it and edit necessarily.
If the data loaded looks fine, then you need to implement a preprocessor for your own dataset in dataset/preprocess.py. There are already some example preprocessors for most of the open-datasets.
config.json
is the configuration file for everything about your model and training. After you are comfortable with all the previous steps, you need to fill the dataset related parameters in this file. We try to keep config.json
as descriptive as possible. Follow the comments there to have a better understanding of the parameters.