Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate results #624

Closed
vishal16babu opened this issue Jan 6, 2021 · 4 comments
Closed

Unable to replicate results #624

vishal16babu opened this issue Jan 6, 2021 · 4 comments

Comments

@vishal16babu
Copy link

vishal16babu commented Jan 6, 2021

I tried replicating the results mentioned in https://google.github.io/tacotron/publications/speaker_adaptation/ but the results are quite different in terms of voice clarity and intonation.

Original voice:
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/samples/p240_00000.mp3

Result mentioned at https://google.github.io/tacotron/publications/speaker_adaptation/, example 6:
https://google.github.io/tacotron/publications/speaker_adaptation/demos/synthesized/p240_00073.wav

Result obtained by running demo_cli.py:
https://drive.google.com/file/d/1AO6vrOnXtfVoCXCrGsESLKBs9KbsA1e8/view?usp=sharing

@CorentinJ what am I missing here?

@ghost
Copy link

ghost commented Jan 7, 2021

Hi @vishal16babu , I take it you are using Corentin's pretrained models? Use LibriSpeech for a fair comparison with the examples.

If you want to match the VCTK results, you need to follow the procedure outlined in the paper. This means training the synthesizer and vocoder on the VCTK dataset. Here are my results when I do that using this repo with default settings. The cloned voice is comparable to Google's results, though their audio quality is much better.

Encoder: Pretrained (1.56M steps)
Synthesizer: VCTK (242k steps)
Vocoder: VCTK (300k steps)

P240 validation samples: validation.zip (492KB, 7 .wav files)

@vishal16babu
Copy link
Author

@blue-fish thanks a lot for the update. Are the models you used to generate these results uploaded somewhere? If not can you add them at #400 (comment) . It will be very helpful.
I was under the impression that I could use one of these pre-trained models for cloning unseen voices(voices not part of the dataset on which they are trained) but the results are much worse for that case.
Results on audios that do not belong to any datasets:
https://drive.google.com/drive/folders/1zFqKqkdnvG9c2c-7i7x_XHuh0YfBcr_H?usp=sharing

@ghost
Copy link

ghost commented Jan 7, 2021

SV2TTS is not that good for cloning unseen voices. For better results, collect 10+ minutes of transcribed speech samples, make a dataset, and finetune the pretrained synthesizer model (see #437). A GPU is not required for finetuning.

The VCTK model has its own problems, and does not make a good TTS or voice cloning engine. I don't plan to release it. If you find this interesting and want to do your own exploration, get a GPU and start with the tutorial on training models from scratch. After that, you'll have enough experience to train your own models using the VCTK dataset.

@ghost
Copy link

ghost commented Jan 14, 2021

Closing due to inactivity.

@ghost ghost closed this as completed Jan 14, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant