Update on maintaining this project #364

CorentinJ · 2020-06-19T12:44:23Z

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

CorentinJ · 2020-06-19T12:53:46Z

First things first, the biggest issue for me with this project is the hecking tensorflow code. Tensorflow sucks, and it sucks just as much to install it let alone install an older version.

I believe it would lower the entry barrier for new users if the version of that package were to be upgraded. I've seen a PR for that but that's only for the collab version it seems. A PR for the entire repo would be appreciated.

Ideally, we'd replace all of the synthesizer code with pytorch code (there are several open source pytorch synthesizers out there), but that's a lot of work.

If anybody is willing to pick up on either of these things, let me know.

CorentinJ · 2020-06-19T12:56:05Z

Second thing: webrtcvad. That package is hell to install on windows. There are alternatives for noise removal out there. There's also the possibility of not using it at all, but for both LibriSpeech and LibriTTS I would recommend it.

ghost · 2020-06-19T22:32:44Z

I'd like #331 merged to enable CPU support by default. It also simplifies the install process for those with a goal of running demo_cli.py for evaluation purposes.

Some kind of API or improved CLI would be a worthwhile and easy enhancement for the community to pursue. Good usability will help keep this repo as the focal point for development of open-source SV2TTS. This is really neat stuff, many thanks for sharing your code and pre-trained models under a permissive license.

CorentinJ · 2020-06-20T08:20:10Z

I'll give a review to #331 tomorrow and probably will make some changes as well.

ghost · 2020-06-22T08:45:08Z

Thank you for reviewing #331. In response I have submitted #366 which addresses your comments and carefully removes all unnecessary changes from the PR. When you have time please review and merge that one instead.

ghost · 2020-06-24T23:57:30Z

Opened #375 to propose a workaround for webrtcvad.

JakubKoralewski · 2020-06-30T18:49:03Z

I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.

Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day.

It would be awesome if you could point out some alternatives, maybe people would start using them instead. I'm not knowledgeable at all in this field so I don't know how to find anything on my own and how to compare which repos are good and which ones work best for what.

I think having a load and click free GUI app is the appeal of your software.

CorentinJ · 2020-06-30T19:54:46Z

I think having a load and click free GUI app is the appeal of your software.

Yeah, I can't say I expected it to have that big of an impact on the popularity of this repo when I wrote it. Too bad it only looks easy, but still is out of reach for most people with little experience in programming.

Prior to becoming my colleague, fatchord wrote not only WaveRNN but also a Tacotron 1 implementation (which, by the way, is not proved inferior to Tacotron 2): https://github.com/fatchord/WaveRNN

NVIDIA has a Tacotron 2 implementation: https://github.com/NVIDIA/tacotron2

Mozilla as well, with more frequent updates & features: https://github.com/mozilla/TTS

I would also check paperswithcode.com and ignore my repo and the ones above if you're looking for something else; perhaps something more recent, as neural TTS is still very much growing. https://paperswithcode.com/task/text-to-speech-synthesis

cantrell · 2020-07-02T17:25:59Z

Hi, @CorentinJ. This is a fantastic project which I've had a lot of fun playing around with.

The biggest challenge with using other projects seems to be data sets. All the other projects I've found are most easily trained on the LJSpeech data set whereas this one can generate unique results with a small sample of audio. Are you aware of any other projects that can be used to clone speech with small audio samples? Thanks!

CorentinJ · 2020-07-02T19:36:22Z

@cantrell You've got to understand the way voice cloning works in this repo. The Tacotron 2 architecture in my repo barely differs from the usual Tacotron 2. The only thing that's added is a way to condition it on a speaker's voice, which is a very minor addition. Hence why it should be simple to transfer that over to an existing Tacotron 2 implementation. The Mozilla repo has ongoing (or maybe finished?) work on that, so that's one alternative.

Do understand that it's not a matter of training the model on only 5 seconds of audio, it's an entirely different procedure which does not involve any training.

cantrell · 2020-07-02T20:02:04Z

Got it. Thanks, @CorentinJ. I'll take a closer look (and/or check out the Mozilla implementation).

dathudeptrai · 2020-07-03T04:37:28Z

@CorentinJ @cantrell Can you guys take a look our recent TTS framework here (https://github.com/TensorSpeech/TensorflowTTS). We supported Tacotron2, FastSpeech, FastSpeech2, Multiban-melgan on native Tensorflow implementation. We also have a plan to support other languages, tflite for mobile, tensorrt for Deploy server. Almost supported model are real-time now.

audio samples: https://tensorspeech.github.io/TensorflowTTS/
colab demo: https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

I can make pull request if you want :D.

CorentinJ · 2020-07-03T08:15:35Z

@dathudeptrai cool, do go ahead, but remember that you'll have to ensure that the data compatibility between wavernn and the synthesizer must be held, and that you will have to provide new pretrained weights for both these models.

dathudeptrai · 2020-07-03T08:33:33Z

@CorentinJ I think it's not hard to convert pretrained tacotron2 here to my tensorflow2 implementation since my implementation based on the tacotron2 code used here.

ghost · 2020-07-05T14:00:42Z

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

macriluke · 2020-07-05T19:12:36Z

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

I was going off of this bit of the thesis:

The prosody is however sometimes unnatural, with pauses at unexpected locations in the sentence, or the lack of pauses where they are expected. This is
particularly noticeable with the embedding of some speakers who talk slowly, showing
that the speaker encoder does capture some form of prosody. The lack of punctuation
in LibriSpeech is partially responsible for this, forcing the model to infer punctuation
from the text alone. This issue was highlighted by the authors as well, and can be
heard on some of their samples of LibriSpeech speakers. The limits we imposed on
the duration of utterances in the dataset (1.6s - 11.25s) are likely also problematic.
Sentences that are too short will be stretched out with long pauses, and for those that
are too long the voice will be rushed.

It looks like maybe I made the wrong assumption of the meaning of the word "pauses" here, as I see in #53 It's mentioned that this is an issue introduced through the code.

EDIT: I will say that while the wooshing and long pauses aren't this common on other pretrained tacotrons, I have heard them on mid-training evaluations of different synthesis models, so the real cause could potentially be both training and code here.

CorentinJ · 2020-07-06T03:30:10Z

This issue of large gaps is something that also occurred at Resemble.AI, and that I have worked on and fixed. It's a serious amount of work, I'll give you the big lines:

Use LibriTTS instead of LibriSpeech in order to have punctuation.
LibriTTS needs to be curated of speakers with bad prosody.
You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate sentences longer than seen in training).
The attention paradigm needs to be replaced, forward attention is poor.

Liujingxiu23 · 2020-07-06T10:04:46Z

@CorentinJ
It's a pity that you decide no to update this project any more.
I have followed your work since latter half of 2019.
For the encoder part, I removed the Relu Activation function of the last linear layer and train with 18k speakers(Chinese+English) for about 2~3 month. I using "Resemblyzer-master" tool to analysize the embedding generated by the model as well as my own tool. I guess the encoder is ready.
For the systhsizer, Can your help me and give me some advices?

My target lanuage is Chinese, I did not have enough TTS corpus to train the synthesizer, only asr corpus can be found. For example , the aishell, but the quality is not so good. Do you have any suggesion to preprocess the wavs ？
When giving a target wav, the end2end systhesized wavs have some characteristic of the target timbre， but they are just similar in a low level. Do you have any suggesion to improve the similarity？ How is your result? Could you share some of your best result?

CorentinJ · 2020-07-06T22:59:45Z

@Liujingxiu23 I don't have any suggestion regarding your data. As for the audio quality, you can improve it by finetuning both tacotron and the vocoder on a single speaker. To improve the quality of voice cloning in general, there's a lot more working, starting with the list I gave above.

mueller91 · 2020-07-17T12:01:44Z

Dear @CorentinJ , thank you for your amazing work and you continued support here. I have a few questions:
a) Would you still apply denoising to LibriTTS? I find that the samples are high quality, and the data itself has already been cleaned.
b) Can i train on both LibriTTS and VCTK? If so, what should i look out for?
c) When training speaker encoder (SE), i find that there is a difference in the difficulty of the datasets: VCTK, LibriTTS, Mozilla Commonvoice are 'easy' for the SE, and it achieves low loss and low EER quickly. However, VoxCeleb{1,2} are much harder.
-> Should i train on each data set separately, and once the model has 'trained out' on the easier datasets, skip them in favor of more iterations on voxceleb?

CorentinJ · 2020-07-17T12:16:27Z

a) Yes I would. For having manually curated LibriTTS myself, I can definitely say that a lot of speakers are very noisy. Do a little data exploration to convince yourself of that: pick 100 random samples and listen to all of them. There are still many issues with this improved version of LibriSpeech: inconsistent volume, background noise, poor mic quality, mic bumps, ... Regarding denoising alone, here's a sample from LibriTTS and its denoised version:
https://puu.sh/G839T.wav
https://puu.sh/G839Y.wav

b) Yes you can, some gotchas:

Ensure that your preprocessed data is sampled to the same sample rate
Ensure you normalize volume
Beware of balance: compare the size of LibriTTS vs that of VCTK and compare the number of speakers. You might need to prune away some data from LibriTTS

c) I don't know if it's worth the effort. The voice encoder is a nice example of "throw more resources at it and it'll keep improving", if you merge your datasets (although again, balance might be an issue given the size of voxceleb) and train for long enough it should perform well anyway.

mueller91 · 2020-07-17T17:08:57Z

Thank you so much for your answer. Two follow-up questions:
a) Why would dataset balance be an issue? Assume i have 10 times more samples from LibriTTS than from VCTK - if the input format, sampling rate and preprocessing is the same, why should this imbalance matter? (provided the clips are of somewhat same quality w.r.t to noise). Same for for SP.
b) You mentioned manually curating LibriTTS. Could you elaborate what you did in a bit more detail? Are there any papers, tools, etc. you can point me to? Did you listen to all audiofiles? (I cannot imagine this)

Again, than you so much for your answers. At my university (Munich, Germany), nobody is doing speech synthesis - i'm a bit on my own here.

CorentinJ · 2020-07-17T18:49:00Z

a) It's a matter of what you want. If you want to reach VCTK quality, then LibriTTS samples vastly outnumbering VCTK samples is going to cancel that out due to sampling being uniform. In a classical multispeaker model with a speaker table (i.e. an embedding layer), it would still make sense to have a 10 to 1 ratio if your goal was only to encode a voice for these speakers in the speaker table;

b) I can't elaborate too much, no. Just know that some of the data is of poor quality, and some is great. A bit of data exploration should give you an idea.

DanChristos · 2020-08-11T15:49:19Z

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

You wanted the popularity of this repo to go down because you couldn't handle the requests? That's kinda absurd, people's interest are good thing, more developers means less work on one person's shoulders. ;)

mbdash · 2020-09-03T00:50:55Z

@CorentinJ

If tensorflow is entirely removed from this repo, I will change that message for sure.

I still get a lot of feedback from people who spent hours trying to set things up.

In my opinion,
It is actually not that hard to setup on Ubuntu.
On windows... well... good luck. (for now)

I hope this will help reduce complaints :

WIKI

Installation - Ubuntu-20.04

Installation - Windows-10 TODO

CodingRox82 · 2020-09-28T04:05:01Z

I want to implement something like this for voice-to-voice. Basically, I want to record a voice and then use this as a basis for masking N voices, where N >> 1. Some questions:

If you're planning to work on a serious project, my strong advice: find another TTS repo.: @CorentinJ , would this comment still apply if I don't need the part that reads and creates audio from a given text?
I understand that the impressive part of this repo is that it can clone a voice given only 5 seconds of audio, but in general does the output improve with training on more (and more diverse) data? What I wanted to have a professional speaker record hours of data to serve as input audio - would the output improve in quality?

mbdash · 2020-09-28T14:28:40Z

@CodingRox82 Hi,
if you are seriously interested in Voice to Voice / Voice changer / Voice Transfer / "insert any other description that involves converting the audio from 1 speaker to another without passing through TTS";

Would you be interested in joining a small group with common interest?
We are currently working on creating a polished dataset.
Our small group have different but overlapping interests for the good of this repo and others that can provide voice to voice, bypassing TTS.

If you are interested, leave a comment in #474

ghost · 2020-10-12T20:29:22Z

@CorentinJ Thanks for providing the statement of direction in #543 (comment)

In that context it's not worth my time continuing to provide technical support as I have the last few months. It was initially helpful to identify common pain points but now it's mainly down to getting rid of tensorflow, and people asking for an exe. To help potential developers I suggest disallowing the use of the issues board for tech support and requests for help with projects since it dilutes the development effort. I've donated a lot of my time trying to build some sense of community, but unfortunately it is not attracting and retaining the type of people who can push this project forward.

Tensorflow has this issue policy, and it could help to implement something similar. I realize this will be unpopular because a lot of individuals want help and tech support, but it needs to be understood that you get what you pay for with open source.

If you open a GitHub Issue, here is our policy: 1. It must be a bug/performance issue or a feature request or a build issue or a documentation issue (for small doc fixes please send a PR instead). 2. Make sure the Issue Template is filled out. 3. The issue should be related to the repo it is created in.

Here's why we have this policy: We want to focus on the work that benefits the whole community, e.g., fixing bugs and adding features. Individual support should be sought on Stack Overflow or other non-GitHub channels. It helps us to address bugs and feature requests in a timely manner.

CodingRox82 · 2020-10-12T21:23:05Z

Sorry for the late reply @blue-fish . I'm definitely interested in using this. I like your idea of creating a pre-compiled version to give people to test out. I'm going to start tinkering around with this to try to get it to work and if I find the time to learn how to create a distributable precompiled version I'll give it a shot.

CorentinJ · 2020-10-12T21:30:56Z

@blue-fish Thanks a lot for your valuable help and time. I did come to the same conclusions as you. A lot of the users coming through are highly unexperienced.

I have been wanting to make things simpler just for the sake of reducing the number of technical support requests, but my awkward position makes it hard for me to stay involved.

macriluke · 2020-10-13T13:01:39Z

Let me know if I'm up to date on this-

blue-fish finished the effort to implement and train in pytorch in his fork.
on review it was decided that the quality of the tensorflow model was still better overall quality.
sometimes with the tensorflow model the stop token prediction fails and results in large gaps in the synthesis.
sometimes with pytorch model will quit in the middle of synthesis, something to do with the attention model?

CorentinJ · 2020-10-13T13:25:52Z

The stop token prediction (whether the model knows when to end the generation) on the tensorflow model is usually good, the long pauses is more of a dataset/data representation and attention mechanism issue.

The pytorch model is the one to fail at predicting stop tokens - indeed due to its attention mechanism - and hence why it stops during generation.

macriluke · 2020-10-13T13:44:58Z

Ah okay I had it almost exactly backwards.

So following blue-fish's instructions in #538 to retrain tensorflow on libri-tts/libri-Speech should resolve the long pauses and also won't have the stop token issue?

CorentinJ · 2020-10-29T08:23:49Z

Recent similar projects:
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/espnet/espnet

CorentinJ · 2021-01-11T15:06:50Z

Another similar project:
https://github.com/nii-yamagishilab/multi-speaker-tacotron

eyewebs · 2021-01-15T00:45:20Z

Recent similar projects:
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/espnet/espnet

Can I also clone voices with these repo's using a small audio clip of 3-5 minutes? This repo needs a 5 second audio clip, but for resemable.ai a larger sample with voice is better. Now resemable ask voice verification, something I can't do.

Are there repo's that can also use a longer voice sample of, for example, 5 minutes, that sound better than this repo? if so, which ones have the best result?

I would like to pay the person who can help me make good voice clones from 3-5 minute samples. really need it. blue-fish, I see you're very active here. Help me? :)

pablodz · 2021-01-24T21:48:44Z

May you add some maintainers to the repo, create an announcement and ask for help. It happened before with others repositories

BrentonBadGoy · 2021-01-27T14:19:50Z

That's a very good work, congrats.
I don't know if I'm a the good place to post this but it give an american accent to the cloned voice although the speaker I want to clone have a British accent, is it the encoder, the synthesizer, the vocoder or the three ? Is there a way to change this without having a Nvidia Gpu to train the models ? Or is there already models trained with British accent available ?
Also I noticed the pronunciation is wrong sometimes and it even miss totally some words, is there a way to change this ? Maybe it's due to the ponctuation no taken in account ?

CorentinJ pinned this issue Jun 19, 2020

This comment has been minimized.

Sign in to view

ghost mentioned this issue Jun 22, 2020

Tensorflow v2 compatibility #370

Closed

CorentinJ mentioned this issue Jul 3, 2020

wandb instrumentation #159

Closed

ghost mentioned this issue Jul 4, 2020

Anyone willing to pick this up? #332

Closed

CorentinJ mentioned this issue Jul 10, 2020

poor performance in compare to the main paper? #411

Closed

ghost mentioned this issue Aug 9, 2020

Training a new encoder model #458

Closed

This was referenced Aug 11, 2020

Pytorch synthesizer #447

Closed

Will a larger value for partials_n_frames be better? #403

Closed

ghost mentioned this issue Aug 25, 2020

compared with Melgan? #182

Closed

Repository owner deleted a comment from steven850 Sep 13, 2020

Repository owner deleted a comment from XCanG Sep 28, 2020

ghost mentioned this issue Oct 16, 2020

Why the generated voice sounds so unreal? #564

Closed

Update on maintaining this project #364

Update on maintaining this project #364

Comments

CorentinJ commented Jun 19, 2020

CorentinJ commented Jun 19, 2020

CorentinJ commented Jun 19, 2020

ghost commented Jun 19, 2020

CorentinJ commented Jun 20, 2020 • edited Loading

ghost commented Jun 22, 2020

This comment has been minimized.

This comment has been minimized.

ghost commented Jun 24, 2020

JakubKoralewski commented Jun 30, 2020

CorentinJ commented Jun 30, 2020

cantrell commented Jul 2, 2020

CorentinJ commented Jul 2, 2020

cantrell commented Jul 2, 2020

dathudeptrai commented Jul 3, 2020

CorentinJ commented Jul 3, 2020

dathudeptrai commented Jul 3, 2020

ghost commented Jul 5, 2020 • edited by ghost Loading

macriluke commented Jul 5, 2020 • edited Loading

CorentinJ commented Jul 6, 2020 • edited Loading

Liujingxiu23 commented Jul 6, 2020

CorentinJ commented Jul 6, 2020

mueller91 commented Jul 17, 2020

CorentinJ commented Jul 17, 2020

mueller91 commented Jul 17, 2020 • edited Loading

CorentinJ commented Jul 17, 2020

DanChristos commented Aug 11, 2020

mbdash commented Sep 3, 2020

WIKI

Installation - Ubuntu-20.04

Installation - Windows-10 TODO

CodingRox82 commented Sep 28, 2020

mbdash commented Sep 28, 2020 • edited Loading

ghost commented Oct 12, 2020

CodingRox82 commented Oct 12, 2020

CorentinJ commented Oct 12, 2020

macriluke commented Oct 13, 2020

CorentinJ commented Oct 13, 2020

macriluke commented Oct 13, 2020

CorentinJ commented Oct 29, 2020

CorentinJ commented Jan 11, 2021

eyewebs commented Jan 15, 2021

pablodz commented Jan 24, 2021

BrentonBadGoy commented Jan 27, 2021

CorentinJ commented Jun 20, 2020 •

edited

Loading

ghost commented Jul 5, 2020 •

edited by ghost

Loading

macriluke commented Jul 5, 2020 •

edited

Loading

CorentinJ commented Jul 6, 2020 •

edited

Loading

mueller91 commented Jul 17, 2020 •

edited

Loading

mbdash commented Sep 28, 2020 •

edited

Loading