Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I remove the dropout on forward function? #481

Open
EuphoriaCelestial opened this issue May 4, 2021 · 17 comments
Open

Can I remove the dropout on forward function? #481

EuphoriaCelestial opened this issue May 4, 2021 · 17 comments

Comments

@EuphoriaCelestial
Copy link

As rafaelvalle mentioned here #336 (comment) ; the dropout caused Tacotron model to "say the same phrase in multiple ways". In theory, this is a very interesting, innovation idea to make the voice more human like.
But I found out it also caused some problem, because of the randomness variable, with 1 input sentences, the model sometime give out errors like skipping words, unable to end the audio, repeating a part of sentence. It doesnt happen all the time, like 2-3 times out of 10 inferences; which make it impossible to debug because I dont know when it will broke
So, the main point is I want to remove this feature. How can I do this safely? Because rafaelvalle said I cant just set p=0 to remove it

@ntdat017
Copy link

ntdat017 commented May 4, 2021

In my opinion, I think you could choose some best random weight replace to dropout that make consistent mel.

@EuphoriaCelestial
Copy link
Author

In my opinion, I think you could choose some best random weight replace to dropout that make consistent mel.

@ntdat017 can you please explain more on this? how can I do it?

@m-toman
Copy link

m-toman commented May 5, 2021

I would say this is a bug turned features, actually and there have been multiple attempts to get rid of dropout during inference.
See for example here: mozilla/TTS#50 (comment)
This "dropping out the dropout" (randomizing dropout probability during training) worked for me when I tried it back then but the results were still not really convincing. As also shown in that thread, there seems to be a batch norm approach that works.

But honestly I just moved on, even Google now runs experiments without attention. https://arxiv.org/abs/2010.04301
Most others did already (DurIAN, the IBM system, FastSpeech, FastPitch, ForwardTacotron etc.) and I feel that's much more robust than messing around with the attention plots and trying all kinds of monotonic attention mechanisms with obscure tricks.

@EuphoriaCelestial
Copy link
Author

EuphoriaCelestial commented May 7, 2021

https://arxiv.org/abs/2010.04301

@m-toman where can I find an implementation of this paper? or a TTS project without attention as you mentioned?

@ntdat017
Copy link

ntdat017 commented May 7, 2021

https://arxiv.org/abs/2010.04301

@m-toman where can I find an implementation of this paper? or a TTS project without attention as you mentioned?

I think that paper from google haven't implemented yet.

In my opinion, I think you could choose some best random weight replace to dropout that make consistent mel.

@ntdat017 can you please explain more on this? how can I do it?

In my way, I random a boolean mask that have probability ~50%, then, change the dropout layer in prenet (at link) by my mask in inference phrase, of course the boolean mask should be choose carefully. In that way, I have consistent mel during inference time, and can debug easily.

@m-toman
Copy link

m-toman commented May 7, 2021

Well, I like https://github.com/as-ideas/ForwardTacotron as it's rather simple and slim, no transformers attention etc

But there's also https://github.com/NVIDIA/Nemo implementing different methods

https://github.com/espnet/espnet a few

Also https://github.com/TensorSpeech/TensorFlowTTS

Most got Fastspeech though. Glow TTS is also quite interesting.

Oh and https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

Personally I do alignment using HTK (for example there's a script in Merlin) but there are different options.

@EuphoriaCelestial
Copy link
Author

@m-toman Thanks for those links, I want to ask a few more questions
I've tried Fastspeech (from this repo: https://github.com/xcmyz/FastSpeech ) before having error:

File "/FastSpeech/modules.py", line 72, in LR output = alignment @ x RuntimeError: invalid argument 6: wrong matrix size at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:499

and I dont fully understand what alignments.zip file does, how can I generate those alignments myself, since I am working with another language. And should I use Fastspeech or Fastspeech 2? what is the difference between those two?

@EuphoriaCelestial
Copy link
Author

In my way, I random a boolean mask that have probability ~50%, then, change the dropout layer in prenet (at link) by my mask in inference phrase, of course the boolean mask should be choose carefully. In that way, I have consistent mel during inference time, and can debug easily.

@ntdat017 Can I PM you for more detail how to do this? this is a little beyond my level xD

@m-toman
Copy link

m-toman commented May 7, 2021

@m-toman Thanks for those links, I want to ask a few more questions
I've tried Fastspeech (from this repo: https://github.com/xcmyz/FastSpeech ) before having error:

File "/FastSpeech/modules.py", line 72, in LR output = alignment @ x RuntimeError: invalid argument 6: wrong matrix size at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:499

and I dont fully understand what alignments.zip file does, how can I generate those alignments myself, since I am working with another language. And should I use Fastspeech or Fastspeech 2? what is the difference between those two?

I think this repo expects alignments from external, like extracted from Taco2. Not sure how the others do it. Think in the ForwardTacotron repo there is now another method.

@ntdat017
Copy link

ntdat017 commented May 7, 2021

In my way, I random a boolean mask that have probability ~50%, then, change the dropout layer in prenet (at link) by my mask in inference phrase, of course the boolean mask should be choose carefully. In that way, I have consistent mel during inference time, and can debug easily.

@ntdat017 Can I PM you for more detail how to do this? this is a little beyond my level xD

@EuphoriaCelestial Sure, you can pm me at ntdat017@gmail.com.

And should I use Fastspeech or Fastspeech 2? what is the difference between those two?

I think you could use Fastspeech 2, easy to training.

But honestly I just moved on, even Google now runs experiments without attention. https://arxiv.org/abs/2010.04301
Most others did already (DurIAN, the IBM system, FastSpeech, FastPitch, ForwardTacotron etc.) and I feel that's much more robust than messing around with the attention plots and trying all kinds of monotonic attention mechanisms with obscure tricks.

@m-toman In my experiment, almost currently non-autoregressive models have lower performance than autoregressive model. How about your experience?

@EuphoriaCelestial
Copy link
Author

@m-toman In my experiment, almost currently non-autoregressive models have lower performance than autoregressive model. How about your experience?

I have a question, which model is non-autoregressive and which is autoregressive?

@Syed044
Copy link

Syed044 commented May 21, 2021

In my opinion, I think you could choose some best random weight replace to dropout that make consistent mel.

can you reply to my question

Hi,

I m new to deep learning, I need to understand 3 things from this project. Please excuse my clumsy question but I need to know the answers.

can I train on my own dataset which is hindi language and text is in latin.( hindi written in english)
python train.py --output_directory=outdir --log_directory=logdir ( what path for the dataset? where do i define the path for my dataset?)
after completing the training which I m assuming it will give me checkpoint file. how do use it or get pretrainned.pt flile?
I m new to this so I need to understand, last question. I've 2 rtx3090 with nvlink and i m using windows 10 and anaconda how do i use both the gpu to train.

Please answer these question.

Regards,
Sid

In my way, I random a boolean mask that have probability ~50%, then, change the dropout layer in prenet (at link) by my mask in inference phrase, of course the boolean mask should be choose carefully. In that way, I have consistent mel during inference time, and can debug easily.
@ntdat017 Can I PM you for more detail how to do this? this is a little beyond my level xD

@EuphoriaCelestial Sure, you can pm me at ntdat017@gmail.com.

And should I use Fastspeech or Fastspeech 2? what is the difference between those two?

I think you could use Fastspeech 2, easy to training.

But honestly I just moved on, even Google now runs experiments without attention. https://arxiv.org/abs/2010.04301
Most others did already (DurIAN, the IBM system, FastSpeech, FastPitch, ForwardTacotron etc.) and I feel that's much more robust than messing around with the attention plots and trying all kinds of monotonic attention mechanisms with obscure tricks.

@m-toman In my experiment, almost currently non-autoregressive models have lower performance than autoregressive model. How about your experience?
Hi,

I m new to deep learning, I need to understand 3 things from this project. Please excuse my clumsy question but I need to know the answers.

can I train on my own dataset which is hindi language and text is in latin.( hindi written in english)
python train.py --output_directory=outdir --log_directory=logdir ( what path for the dataset? where do i define the path for my dataset?)
after completing the training which I m assuming it will give me checkpoint file. how do use it or get pretrainned.pt flile?
I m new to this so I need to understand, last question. I've 2 rtx3090 with nvlink and i m using windows 10 and anaconda how do i use both the gpu to train.

Please answer these question.

Regards,
Sid

@Syed044
Copy link

Syed044 commented May 21, 2021

Hi,

I m new to deep learning, I need to understand 3 things from this project. Please excuse my clumsy question but I need to know the answers.

can I train on my own dataset which is hindi language and text is in latin.( hindi written in english)
python train.py --output_directory=outdir --log_directory=logdir ( what path for the dataset? where do i define the path for my dataset?)
after completing the training which I m assuming it will give me checkpoint file. how do use it or get pretrainned.pt flile?
I m new to this so I need to understand, last question. I've 2 rtx3090 with nvlink and i m using windows 10 and anaconda how do i use both the gpu to train.

Please answer these question.

Regards,
Sid

@EuphoriaCelestial
Copy link
Author

can I train on my own dataset which is hindi language and text is in latin.( hindi written in english)

of course, just change characters list in text/symbols.py and text/cmudict.py to make sure all character in your dataset is included, change cleaner and some file path in hparams.py (just start with basic cleaner). Create a dataset with the same folder structure like LJS and you are good to go

python train.py --output_directory=outdir --log_directory=logdir ( what path for the dataset? where do i define the path for my dataset?)

in hparams.py

after completing the training which I m assuming it will give me checkpoint file. how do use it or get pretrainned.pt flile?

just use the checkpoint file, no need to export .pt file, they are basically the same type

I've 2 rtx3090 with nvlink and i m using windows 10 and anaconda how do i use both the gpu to train.

enable distributed training in hparams.py

@Syed044
Copy link

Syed044 commented May 21, 2021

Thank you so very much for a quick reply. I really appreciate that.

@sabat84
Copy link

sabat84 commented Apr 11, 2022

@EuphoriaCelestial Hi. I am using Nvidia\Tacotron 2 to train my own data (20 hour of Kurdish data) which is different from English language I have some questions:

  1. should I use pre-trained english model to train my model? or I have to train from scratch? I tried 3 times to train from scratch with batch size 40 but the model didn't converge.
  2. I changed characters list in text/symbols.py but didn't change valid_symbol list in text/cmudict.py?

@EuphoriaCelestial
Copy link
Author

Kurdish

  1. Technically, you can use the pre-trained English model to start with any language. In some rare situation you will encounter audio quality problem, word skipping, loop, ... but most of the time, it worked well for me
  2. I don't quite sure what do you mean, but you only need to change character list in text/symbols.py, the text/cmudict.py file is just an addition, only required for English in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants