-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update on maintaining this project #364
Comments
First things first, the biggest issue for me with this project is the hecking tensorflow code. Tensorflow sucks, and it sucks just as much to install it let alone install an older version. I believe it would lower the entry barrier for new users if the version of that package were to be upgraded. I've seen a PR for that but that's only for the collab version it seems. A PR for the entire repo would be appreciated. Ideally, we'd replace all of the synthesizer code with pytorch code (there are several open source pytorch synthesizers out there), but that's a lot of work. If anybody is willing to pick up on either of these things, let me know. |
Second thing: webrtcvad. That package is hell to install on windows. There are alternatives for noise removal out there. There's also the possibility of not using it at all, but for both LibriSpeech and LibriTTS I would recommend it. |
I'd like #331 merged to enable CPU support by default. It also simplifies the install process for those with a goal of running demo_cli.py for evaluation purposes. Some kind of API or improved CLI would be a worthwhile and easy enhancement for the community to pursue. Good usability will help keep this repo as the focal point for development of open-source SV2TTS. This is really neat stuff, many thanks for sharing your code and pre-trained models under a permissive license. |
I'll give a review to #331 tomorrow and probably will make some changes as well. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Opened #375 to propose a workaround for webrtcvad. |
It would be awesome if you could point out some alternatives, maybe people would start using them instead. I'm not knowledgeable at all in this field so I don't know how to find anything on my own and how to compare which repos are good and which ones work best for what. I think having a load and click free GUI app is the appeal of your software. |
Yeah, I can't say I expected it to have that big of an impact on the popularity of this repo when I wrote it. Too bad it only looks easy, but still is out of reach for most people with little experience in programming. Prior to becoming my colleague, fatchord wrote not only WaveRNN but also a Tacotron 1 implementation (which, by the way, is not proved inferior to Tacotron 2): https://github.com/fatchord/WaveRNN NVIDIA has a Tacotron 2 implementation: https://github.com/NVIDIA/tacotron2 Mozilla as well, with more frequent updates & features: https://github.com/mozilla/TTS I would also check paperswithcode.com and ignore my repo and the ones above if you're looking for something else; perhaps something more recent, as neural TTS is still very much growing. https://paperswithcode.com/task/text-to-speech-synthesis |
Hi, @CorentinJ. This is a fantastic project which I've had a lot of fun playing around with. The biggest challenge with using other projects seems to be data sets. All the other projects I've found are most easily trained on the LJSpeech data set whereas this one can generate unique results with a small sample of audio. Are you aware of any other projects that can be used to clone speech with small audio samples? Thanks! |
@cantrell You've got to understand the way voice cloning works in this repo. The Tacotron 2 architecture in my repo barely differs from the usual Tacotron 2. The only thing that's added is a way to condition it on a speaker's voice, which is a very minor addition. Hence why it should be simple to transfer that over to an existing Tacotron 2 implementation. The Mozilla repo has ongoing (or maybe finished?) work on that, so that's one alternative. Do understand that it's not a matter of training the model on only 5 seconds of audio, it's an entirely different procedure which does not involve any training. |
Got it. Thanks, @CorentinJ. I'll take a closer look (and/or check out the Mozilla implementation). |
@CorentinJ @cantrell Can you guys take a look our recent TTS framework here (https://github.com/TensorSpeech/TensorflowTTS). We supported Tacotron2, FastSpeech, FastSpeech2, Multiban-melgan on native Tensorflow implementation. We also have a plan to support other languages, tflite for mobile, tensorrt for Deploy server. Almost supported model are real-time now. audio samples: https://tensorspeech.github.io/TensorflowTTS/ I can make pull request if you want :D. |
@dathudeptrai cool, do go ahead, but remember that you'll have to ensure that the data compatibility between wavernn and the synthesizer must be held, and that you will have to provide new pretrained weights for both these models. |
@CorentinJ I think it's not hard to convert pretrained tacotron2 here to my tensorflow2 implementation since my implementation based on the tacotron2 code used here. |
@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it? Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment) |
I was going off of this bit of the thesis:
It looks like maybe I made the wrong assumption of the meaning of the word "pauses" here, as I see in #53 It's mentioned that this is an issue introduced through the code. EDIT: I will say that while the wooshing and long pauses aren't this common on other pretrained tacotrons, I have heard them on mid-training evaluations of different synthesis models, so the real cause could potentially be both training and code here. |
This issue of large gaps is something that also occurred at Resemble.AI, and that I have worked on and fixed. It's a serious amount of work, I'll give you the big lines:
|
@CorentinJ
|
@Liujingxiu23 I don't have any suggestion regarding your data. As for the audio quality, you can improve it by finetuning both tacotron and the vocoder on a single speaker. To improve the quality of voice cloning in general, there's a lot more working, starting with the list I gave above. |
Dear @CorentinJ , thank you for your amazing work and you continued support here. I have a few questions: |
a) Yes I would. For having manually curated LibriTTS myself, I can definitely say that a lot of speakers are very noisy. Do a little data exploration to convince yourself of that: pick 100 random samples and listen to all of them. There are still many issues with this improved version of LibriSpeech: inconsistent volume, background noise, poor mic quality, mic bumps, ... Regarding denoising alone, here's a sample from LibriTTS and its denoised version: b) Yes you can, some gotchas:
c) I don't know if it's worth the effort. The voice encoder is a nice example of "throw more resources at it and it'll keep improving", if you merge your datasets (although again, balance might be an issue given the size of voxceleb) and train for long enough it should perform well anyway. |
Thank you so much for your answer. Two follow-up questions: Again, than you so much for your answers. At my university (Munich, Germany), nobody is doing speech synthesis - i'm a bit on my own here. |
a) It's a matter of what you want. If you want to reach VCTK quality, then LibriTTS samples vastly outnumbering VCTK samples is going to cancel that out due to sampling being uniform. In a classical multispeaker model with a speaker table (i.e. an embedding layer), it would still make sense to have a 10 to 1 ratio if your goal was only to encode a voice for these speakers in the speaker table; b) I can't elaborate too much, no. Just know that some of the data is of poor quality, and some is great. A bit of data exploration should give you an idea. |
You wanted the popularity of this repo to go down because you couldn't handle the requests? That's kinda absurd, people's interest are good thing, more developers means less work on one person's shoulders. ;) |
In my opinion, I hope this will help reduce complaints : WIKIInstallation - Ubuntu-20.04Installation - Windows-10 TODO |
I want to implement something like this for voice-to-voice. Basically, I want to record a voice and then use this as a basis for masking N voices, where N >> 1. Some questions:
|
@CodingRox82 Hi, Would you be interested in joining a small group with common interest? If you are interested, leave a comment in #474 |
@CorentinJ Thanks for providing the statement of direction in #543 (comment) In that context it's not worth my time continuing to provide technical support as I have the last few months. It was initially helpful to identify common pain points but now it's mainly down to getting rid of tensorflow, and people asking for an exe. To help potential developers I suggest disallowing the use of the issues board for tech support and requests for help with projects since it dilutes the development effort. I've donated a lot of my time trying to build some sense of community, but unfortunately it is not attracting and retaining the type of people who can push this project forward. Tensorflow has this issue policy, and it could help to implement something similar. I realize this will be unpopular because a lot of individuals want help and tech support, but it needs to be understood that you get what you pay for with open source.
|
Sorry for the late reply @blue-fish . I'm definitely interested in using this. I like your idea of creating a pre-compiled version to give people to test out. I'm going to start tinkering around with this to try to get it to work and if I find the time to learn how to create a distributable precompiled version I'll give it a shot. |
@blue-fish Thanks a lot for your valuable help and time. I did come to the same conclusions as you. A lot of the users coming through are highly unexperienced. I have been wanting to make things simpler just for the sake of reducing the number of technical support requests, but my awkward position makes it hard for me to stay involved. |
Let me know if I'm up to date on this-
|
The stop token prediction (whether the model knows when to end the generation) on the tensorflow model is usually good, the long pauses is more of a dataset/data representation and attention mechanism issue. The pytorch model is the one to fail at predicting stop tokens - indeed due to its attention mechanism - and hence why it stops during generation. |
Ah okay I had it almost exactly backwards. So following blue-fish's instructions in #538 to retrain tensorflow on libri-tts/libri-Speech should resolve the long pauses and also won't have the stop token issue? |
Recent similar projects: |
Another similar project: |
Can I also clone voices with these repo's using a small audio clip of 3-5 minutes? This repo needs a 5 second audio clip, but for resemable.ai a larger sample with voice is better. Now resemable ask voice verification, something I can't do. Are there repo's that can also use a longer voice sample of, for example, 5 minutes, that sound better than this repo? if so, which ones have the best result? I would like to pay the person who can help me make good voice clones from 3-5 minute samples. really need it. blue-fish, I see you're very active here. Help me? :) |
May you add some maintainers to the repo, create an announcement and ask for help. It happened before with others repositories |
That's a very good work, congrats. |
We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.
The text was updated successfully, but these errors were encountered: