Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the synthesizer's gaps in spectrograms #53

Closed
TheButlah opened this issue Jul 19, 2019 · 23 comments
Closed

Fixing the synthesizer's gaps in spectrograms #53

TheButlah opened this issue Jul 19, 2019 · 23 comments
Labels
bug Something isn't working

Comments

@TheButlah
Copy link

TheButlah commented Jul 19, 2019

Hello, and thank you for the great work! One of the limitations that I have noticed is that the synthesizer starts to have long gaps in speech if the input text length is short. @CorentinJ do you have any ideas why this is or how I could fix it? I'll also probably ask on Rayhane's repo if I can reproduce the issue on his synthesizer.

Am I correct in assuming that the issue is caused by the stop prediction in Taco2 not having a high enough activation, which results in long spectrograms?

@CorentinJ
Copy link
Owner

CorentinJ commented Jul 19, 2019

I doubt it's because of the stop prediction. The stop prediction only occurs after the spectrogram is generated. Yes, this is an issue of the synthesizer. It would have to be replaced by a better one (eliminating other problems with that) such as fatchord's, but I just don't have the time to do it.

@TheButlah
Copy link
Author

TheButlah commented Jul 19, 2019

I was referring to the stop prediction in Tacotron 2 (synthesizer not vocoder), I wasn't aware that stop prediction was used in WaveRNN as it can just stop outputting when it runs out of spectrogram frames to condition on.

What do you mean by "the stop prediction only occurs after the spectrogram is generated"?

@CorentinJ
Copy link
Owner

I wasn't talking about the vocoder. Tacotron's decoder being autoregressive, the first stop token above the threshold value will be predicted when the spectrogram is done being generated, by definition. Thus is has no impact on previous frames, in fact its output is not fed back to the model IIRC. I don't see how the stop token could be the issue.

@TheButlah
Copy link
Author

TheButlah commented Jul 19, 2019

Ah yes I see what you mean. That makes sense, I agree that it has to be another issue.

One idea that I had was annealing the level of teacher forcing that takes place during training. I suspect that the issue is that due to the synthesizer being autoregressive, any errors (deviation from true mel frame) are going to compound on each other as they get fed into the predictions for the next Mel frame. Teacher forcing accelerates training convergence because it removes the ability of these errors to propogate, but I would expect that the network would never learn to account for its own errors because it always was fed real data during training. Hence annealing the probability that the spectrogram frame is teacher forced might get the best of both worlds.

What do you think?

@CorentinJ
Copy link
Owner

I think the issue is elsewhere, as in most likely a bug from my end or rayhane's work. I've talked with someone else whose work also stems from rayahane's and he's got the same problem. Meanwhile, other implementations elsewhere (mozilla, nvidia, fatchord) of tacotron/tacotron2 do not have that issue.

@TheButlah
Copy link
Author

Where is fatchord's implementation? I don't see it on his github

@CorentinJ
Copy link
Owner

It's included with his WaveRNN, the same I use: https://github.com/fatchord/WaveRNN

@TheButlah
Copy link
Author

TheButlah commented Jul 20, 2019

Oh I thought that was just a fork of Keithitos. Regardless, I'll look into using a different implementation and/or try to figure out whats wrong with rahayne's. Thanks for the help!

Repository owner deleted a comment from mrgloom Jul 25, 2019
Repository owner deleted a comment from mrgloom Jul 25, 2019
@TheButlah
Copy link
Author

TheButlah commented Aug 15, 2019

For what It's worth, Ive been working extensively on @fatchord's repo adding improvements to it. I've trained models on it and no longer experience the gaps in the audio we have observed using Rayhane's repo. However, the synthesizer is still somewhat sensitive to sentence length, particularly long sentences. Sentences four words or more in length are fine, but once sentences start to get really long, you get the same stammering you can observe in @CorentinJ 's repo. So yes, switching to @fatchord's synthesizer would probably be a big improvement, but you would also have to add to it the capability to do multi-speaker training, as right now it only has single-speaker capability.

I can also confirm that its an issue with the attention mechanism, not the stop token or anything else. @fatchord's repo just stops generating when the spectrogram frame is below a certain audio threshold. No stop tokens involved. You can also look at the attention graph and clearly see that the failure cases are due to the attention getting stuck on a particular time step and never progressing.

@TheButlah
Copy link
Author

@CorentinJ actually on going back through my synthesized recordings from @Rayhane-mamah's repo, I haven't been able to observe any of the gaps I observe in your repo. I think its actually unique to this repository

@TheButlah TheButlah reopened this Aug 21, 2019
@TheButlah
Copy link
Author

TheButlah commented Aug 21, 2019

200K-logs-eval.zip (Rayhane Taco2, Griffin-Lim)
Archive.zip(Fatchord Taco1, Fatchord WaveRNN)
Both @fatchord and @Rayhane-mamah repos do not exhibit gaps in middle of spectrograms like this repo does.

They both exhibit failure in the case of especially long sentences, which is expected. Taco 2 appears to fare much better in this case.

@CorentinJ
Copy link
Owner

Oh I'm well aware the issue is present in this repo only. It's something I must have introduced while modifying rayhane's tacotron. Considering I hate to work with that codebase, I have in mind to switch to fatchord's tacotron to try and fix this bug at the same time. But as I said, I really don't have the time to work on that now, as I have work and university projects that take priority. If someone wants to work on that in a separate branch, I can definitely look over that from time to time.

As for long sentences, it's just a matter of the attention mechanism implemented. By splitting sentences on punctuation, you're fine with most sentences anyway.

@TheButlah
Copy link
Author

Makes sense! I agree that I like fatchord's synthesizer more as its easier to work with, although I think it would perform better qualitatively if it were tacotron 2 instead of taco1. Maybe someone will do a fork for it at some point to upgrade it.

@ghost
Copy link

ghost commented Jul 5, 2020

Thank you for referencing the issue @macriluke. I am going to reopen this issue since I have some interest in fixing it. Another possibility is that it goes away in #370 when @dathudeptrai modifies the tensorflowTTS/tacotron2 code to work with this repo.

@ghost
Copy link

ghost commented Jul 19, 2020

I found a very low-tech fix for this, which is to always run "trim_long_silences" on the vocoder output. The function uses webrtcvad and is found in encoder/audio.py. Will submit a PR when I get a chance.

@Choons
Copy link

Choons commented Aug 6, 2020

an even lower tech solution I use-- insert "scat" words/syllables at the beginning and end of the sentence and somehow it fixes the gaps. For instance, the sentence "I have something important to tell you" gaps terribly on its own, but "skee diddly bop I have something important to tell you action jackson" renders perfectly. Then I just trim the "scat" off in Audacity. Perhaps that can provide a hint what is wrong in the code.

@ghost ghost added the bug Something isn't working label Aug 13, 2020
@ghost
Copy link

ghost commented Aug 25, 2020

Confirm that the issue of gaps in spectrograms will be resolved if we merge fatchord's tacotron1 in #472. The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

@ghost
Copy link

ghost commented Sep 2, 2020

The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

As mentioned in #472 (comment) the gaps in LibriSpeech/TTS can be resolved by using voice activation detection to trim silences. See #501 for the process.

@ghost
Copy link

ghost commented Sep 30, 2020

Would like to highlight this again:

The presence of gaps depends on the training data.

Trained a new synthesizer with a curated dataset, in #538 (tensorflow) and #472 (comment) (pytorch). This fixes the issue with gaps.

@ghost ghost closed this as completed Sep 30, 2020
@Choons
Copy link

Choons commented Sep 30, 2020

Wow, bluefish, you have done some incredible work on this! Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

And if it's a choice between the two solutions, which one do you recommend as best performing?

@ghost
Copy link

ghost commented Sep 30, 2020

Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

Most users today will want #538 because we haven't formally switched to the pytorch synthesizer. Once #472 is merged we will update the pretrained models wiki page to point to pytorch.

And if it's a choice between the two solutions, which one do you recommend as best performing?

They're about the same in performance. They have different quirks since the tacotron is different (tacotron 1 vs 2). In tensorflow (Rayhane-taco2) the stop token prediction sometimes fails and it synthesizes a huge silence until the decoder limit is reached. In pytorch (fatchord-taco1) the attention may get stuck on a certain character and making inference quit suddenly. Pick your poison. The attention mechanism needs to be improved.

@Choons
Copy link

Choons commented Sep 30, 2020

Understood. I'm glad you have taken on improving this voice project. I have tried to use other voice cloning implementations, but could never get them working as well as this one, even with the gap problem. I will experiment with both of your solutions and report back in this post how well they work for me.

@ghost
Copy link

ghost commented Sep 30, 2020

Feedback is appreciated @Choons , it's always helpful to hear from those who are using the software and models.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants