Unable to replicate results #624

vishal16babu · 2021-01-06T22:23:59Z

I tried replicating the results mentioned in https://google.github.io/tacotron/publications/speaker_adaptation/ but the results are quite different in terms of voice clarity and intonation.

Original voice:
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/samples/p240_00000.mp3

Result mentioned at https://google.github.io/tacotron/publications/speaker_adaptation/, example 6:
https://google.github.io/tacotron/publications/speaker_adaptation/demos/synthesized/p240_00073.wav

Result obtained by running demo_cli.py:
https://drive.google.com/file/d/1AO6vrOnXtfVoCXCrGsESLKBs9KbsA1e8/view?usp=sharing

@CorentinJ what am I missing here?

ghost · 2021-01-07T10:12:04Z

Hi @vishal16babu , I take it you are using Corentin's pretrained models? Use LibriSpeech for a fair comparison with the examples.

If you want to match the VCTK results, you need to follow the procedure outlined in the paper. This means training the synthesizer and vocoder on the VCTK dataset. Here are my results when I do that using this repo with default settings. The cloned voice is comparable to Google's results, though their audio quality is much better.

Encoder: Pretrained (1.56M steps)
Synthesizer: VCTK (242k steps)
Vocoder: VCTK (300k steps)

P240 validation samples: validation.zip (492KB, 7 .wav files)

vishal16babu · 2021-01-07T19:40:33Z

@blue-fish thanks a lot for the update. Are the models you used to generate these results uploaded somewhere? If not can you add them at #400 (comment) . It will be very helpful.
I was under the impression that I could use one of these pre-trained models for cloning unseen voices(voices not part of the dataset on which they are trained) but the results are much worse for that case.
Results on audios that do not belong to any datasets:
https://drive.google.com/drive/folders/1zFqKqkdnvG9c2c-7i7x_XHuh0YfBcr_H?usp=sharing

ghost · 2021-01-07T21:28:39Z

SV2TTS is not that good for cloning unseen voices. For better results, collect 10+ minutes of transcribed speech samples, make a dataset, and finetune the pretrained synthesizer model (see #437). A GPU is not required for finetuning.

The VCTK model has its own problems, and does not make a good TTS or voice cloning engine. I don't plan to release it. If you find this interesting and want to do your own exploration, get a GPU and start with the tutorial on training models from scratch. After that, you'll have enough experience to train your own models using the VCTK dataset.

ghost · 2021-01-14T11:21:42Z

Closing due to inactivity.

ghost mentioned this issue Jan 11, 2021

AttributeError: module 'numba' has no attribute 'jit' #625

Closed

ghost closed this as completed Jan 14, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to replicate results #624

Unable to replicate results #624

vishal16babu commented Jan 6, 2021 •

edited

Loading

ghost commented Jan 7, 2021

vishal16babu commented Jan 7, 2021

ghost commented Jan 7, 2021

ghost commented Jan 14, 2021

Unable to replicate results #624

Unable to replicate results #624

Comments

vishal16babu commented Jan 6, 2021 • edited Loading

ghost commented Jan 7, 2021

P240 validation samples: validation.zip (492KB, 7 .wav files)

vishal16babu commented Jan 7, 2021

ghost commented Jan 7, 2021

ghost commented Jan 14, 2021

vishal16babu commented Jan 6, 2021 •

edited

Loading