-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to replicate results #624
Comments
Hi @vishal16babu , I take it you are using Corentin's pretrained models? Use LibriSpeech for a fair comparison with the examples. If you want to match the VCTK results, you need to follow the procedure outlined in the paper. This means training the synthesizer and vocoder on the VCTK dataset. Here are my results when I do that using this repo with default settings. The cloned voice is comparable to Google's results, though their audio quality is much better. Encoder: Pretrained (1.56M steps) P240 validation samples: validation.zip (492KB, 7 .wav files) |
@blue-fish thanks a lot for the update. Are the models you used to generate these results uploaded somewhere? If not can you add them at #400 (comment) . It will be very helpful. |
SV2TTS is not that good for cloning unseen voices. For better results, collect 10+ minutes of transcribed speech samples, make a dataset, and finetune the pretrained synthesizer model (see #437). A GPU is not required for finetuning. The VCTK model has its own problems, and does not make a good TTS or voice cloning engine. I don't plan to release it. If you find this interesting and want to do your own exploration, get a GPU and start with the tutorial on training models from scratch. After that, you'll have enough experience to train your own models using the VCTK dataset. |
Closing due to inactivity. |
I tried replicating the results mentioned in https://google.github.io/tacotron/publications/speaker_adaptation/ but the results are quite different in terms of voice clarity and intonation.
Original voice:
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/samples/p240_00000.mp3
Result mentioned at https://google.github.io/tacotron/publications/speaker_adaptation/, example 6:
https://google.github.io/tacotron/publications/speaker_adaptation/demos/synthesized/p240_00073.wav
Result obtained by running demo_cli.py:
https://drive.google.com/file/d/1AO6vrOnXtfVoCXCrGsESLKBs9KbsA1e8/view?usp=sharing
@CorentinJ what am I missing here?
The text was updated successfully, but these errors were encountered: