Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Closed
Charlottecuc opened this issue Jun 23, 2020 · 31 comments
Closed

Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 #169

Charlottecuc opened this issue Jun 23, 2020 · 31 comments
Labels
question Further information is requested

Comments

@Charlottecuc
Copy link

Hi. I trained the tacotron2 model (https://github.com/NVIDIA/tacotron2) and Glow-TTS model (https://github.com/jaywalnut310/glow-tts) by using the LJ speech dataset and can successfully synthesize voice by using WaveGlow as vocoder.
However, when I turned to the Parallel WaveGan, the synthzised waveform is quite strange:
Screenshot 2020-06-23 at 4 24 30 PM

Screenshot 2020-06-22 at 2 31 07 PM
Screenshot 2020-06-22 at 2 47 42 PM
(In the training time, the hop_size, sample_rate and window_size were set as the same for the tacotron, WaveGlow and waveGan model.)

I successfully synthesized speech using WaveGan with espnet's FastSpeech, but I failed to use waveGan to synthsize intelligible voice with any model derived from Nivida's Tacotron2 implementation (e.g. Glow-TTS). Could you please give me any advice?
(Because in Nivida's Tacotron2, there is no cmvn to the input mel-spectrogram features, so I didn't calculate the cmvn of the training waves and didn't invert it back at the inference time)

Thank you very much!

@kan-bayashi kan-bayashi added the question Further information is requested label Jun 24, 2020
@kan-bayashi
Copy link
Owner

Let me confirm some points:

  • Did you check the range of Mel basis? We use 80-7600.
  • Not sure but I use librosa to extract mel-spectrogram. Maybe Nvidia's extractor is different. This may cause the mismatch.
  • In inference, how did you perform normalization? txt -> [Taco2] -> Mel -> ? -> [PWG]?

@Charlottecuc
Copy link
Author

Charlottecuc commented Jun 24, 2020

Thank you for your reply.

  • Yes. I checked the range of Mel basis. It's 80-7600.

  • Nvidia uses scipy.io.wavfile.read to read wave files and librosa is not used if input_data.device.type == "cuda" (otherwise, librosa will be used) . (https://github.com/jaywalnut310/glow-tts/blob/00a482d06ebbffbd3518a43480cd79e7b47ebbe2/stft.py#L78)
    Screenshot 2020-06-24 at 12 48 54 PM
    Screenshot 2020-06-24 at 1 05 12 PM

  • Since I used different datasets for Text2Mel and Mel2Wav models, I firstly read stats.h5 of pretrained WaveGan and got new_mean and new_square from stats.h5. Then, (Mel - new_mean) / new_square

Thanks!

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jun 24, 2020

I checked the following code.
https://github.com/jaywalnut310/glow-tts/blob/00a482d06ebbffbd3518a43480cd79e7b47ebbe2/commons.py#L164-L181
It seems that it does not apply log for mel-spectrogram and performs dynamic range compression.
They applied log as a dynamic compression. But we use log10.

return np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))

Why don't you try the following procedure?

txt -> [Taco2] -> Mel -> [de-compression] -> [log10] -> [cmvn] -> [PWG]

@seantempesta
Copy link

I'm having the same problem, but I don't understand the [cmvn] step you referenced above. I tried de-compressing the mel and then applying log10, and I can kind of make out words in the audio, but it's super noisy. Would you mind elaborating?

Here's what I've got:

from audio_processing import dynamic_range_decompression

# generate the MEL using Glow-TTS
with torch.no_grad():
    noise_scale = .667
    length_scale = 1.0
    (c, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
   
# Decompress and log10 the output
decompressed = dynamic_range_decompression(c)
decompressed_log10 = np.log10(decompressed.cpu()).cuda()

# Run the PWG vocoder and play the output
with torch.no_grad():
    xx = (decompressed_log10,)
    y = pqmf.synthesis(vocoder(*xx)).view(-1)      
    
from IPython.display import display, Audio
display(Audio(y.view(-1).cpu().numpy(), rate=config["sampling_rate"]))

@kan-bayashi
Copy link
Owner

[cmvn] means the mean-var normalization using stats.h5 of PWG model.
In your case,

# load PWG statistics
mu = read_hdf5("/path/to/stats.h5", "mean")
var = read_hdf5("/path/to/stats.h5", "scale")
sigma = np.sqrt(var)

# mean-var normalization
decompressed_log10_norm = (decompressed_log10 - mu) / sigma

# then input to vocoder
...  

@seantempesta
Copy link

seantempesta commented Jun 25, 2020

@kan-bayashi You are amazing! For anyone else running into this, you have to change the tensor shapes for mu and var to get this to work. This is what I did (please let me know if this isn't right):

from audio_processing import dynamic_range_decompression
from parallel_wavegan.utils import read_hdf5

# config
stats_path = '/path/to/stats.h5'

# generate the MEL using Glow-TTS
with torch.no_grad():
    noise_scale = .667
    length_scale = 1.0
    (c, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
   
# Decompress and log10 the output
decompressed = dynamic_range_decompression(c)
decompressed_log10 = np.log10(decompressed.cpu()).cuda()

# mean-var normalization
mu = read_hdf5(stats_path, "mean")
var = read_hdf5(stats_path, "scale")
sigma = np.sqrt(var)
decompressed_log10_norm = (decompressed_log10 - torch.from_numpy(mu).view(1, -1, 1).cuda()) / torch.from_numpy(sigma).view(1, -1, 1).cuda()

# Run the PWG vocoder and play the output
with torch.no_grad():
    xx = (decompressed_log10_norm,)
    y = pqmf.synthesis(vocoder(*xx)).view(-1)      
    
from IPython.display import display, Audio
display(Audio(y.view(-1).cpu().numpy(), rate=config["sampling_rate"]))

@kan-bayashi
Copy link
Owner

@seantempesta Great!

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jun 25, 2020

Now we can combine with Nividia's tacotron2-based models.
I will close this issue and write some note in README.

@Charlottecuc Charlottecuc changed the title Cannot use WaveGAN with Glow-TTS and Nivida's Tacotron2 Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 Jun 28, 2020
@Charlottecuc
Copy link
Author

Charlottecuc commented Jun 28, 2020

Hi @seantempesta. I copied your code and got the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-60-39f0bece3260> in <module>
     18 with torch.no_grad():
     19     xx = (decompressed_log10_norm,)
---> 20     y = pqmf.synthesis(vocoder(*xx)).view(-1)

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

TypeError: forward() missing 1 required positional argument: 'c'

Could you give me some advice?(e.g. In pqmf.synthesis(vocoder(*xx)), what's your vocoder function look like? Why don't you use use_noise_input as in https://github.com/kan-bayashi/ParallelWaveGAN/blob/1f5899732f78aac3883441c191b0870466a420f0/parallel_wavegan/bin/decode.py ?)
Thanks!

@kan-bayashi
Copy link
Owner

@Charlottecuc There are three vocoder models in this repository, PWG, MelGAN, and multi-band MelGAN.
The input is different for each model:

  • PWG: mel-spectrogram and noise
  • MelGAN and multi-band MelGAN: mel-spectrogram.

And only the multi-band MelGAN needs PQMF filter as a post-processing to convert 4 ch signal into 1 ch signal.

@seantempesta used multi-band MelGAN, so the input is only c and pqmf.synthesis is applied.
In your @Charlottecuc case, you want to use PWG. So you do not need to use PQMF.
Please remove pqmf.synthesis.

@Charlottecuc
Copy link
Author

@kan-bayashi Great!!!!!Thank you for your advice.

@kan-bayashi
Copy link
Owner

I've never met such a problem.
Maybe you made something wrong (e.g., sampling rate).

@Charlottecuc
Copy link
Author

Solved. Thank you :)

@ly1984
Copy link

ly1984 commented Jul 24, 2020

@kan-bayashi the inference voice with lots of noise, could you please take a look?

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 24, 2020

@ly1984 Please check the hyperparameters of mel-spectrogram extraction. Maybe you use different fmax and fmin.

@ly1984
Copy link

ly1984 commented Jul 24, 2020

I've found the WaveGAN fmin: 80 fmax: 7600 and Glow-TTS "mel_fmin": 0.0, "mel_fmax": 8000.0,
should I retrain the glow-its model with the same parameter with WaveGAN?

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 24, 2020

Yes. You need to retrain PWG or Glow-TTS to match the configuration.

@rijulg
Copy link

rijulg commented Aug 19, 2020

@kan-bayashi I am trying to use your models with mel spectrogram output from Nvidia's models, and although the above suggested methods get some results, the results are rather lackluster. Here's a colab notebook of an experiment
https://colab.research.google.com/drive/1uOLIzWHF4FbRScuIeUQiMQyCE_zpphBe?usp=sharing
In the above experiment I have the same audio processed in following ways:

  • audio -> pwgan_mel -> pwgan_inference (vctk_multi_band_melgan.v2) -> audio [OK]
  • audio -> pwgan_mel -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Good]
  • audio -> tacotron2_mel -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Terrible]
  • audio -> tacotron2_mel -> mean-var normalization -> pwgan_inference (libritts_parallel_wavegan.v1.long) -> audio [Poor]

I observe the following from the experiment:

  • vctk_multi_band_melgan.v2 trained model performs slightly worse than libritts_parallel_wavegan.v1.long on my selected audio.
  • tacotron2_mel output does not work with the trained models, as expected
  • tacotron2_mel output with mean-var normalization does not produce high quality outputs
  • tacotron2_mel lengths are (1200/1024) times that of pwgan_mel (this is expected because of the window size difference)

Can you please comment on, and help me identify any problems with

  1. Identifying cause of quality difference between the 2 trained models
    • I suspect it is a combination of the dataset used to train and the total steps involved in training
  2. Reason for poor quality of tacotron2_mel even after mean-var normalization
    • I suspect that the conversion is not able to match up with the difference in fft_size and window_size

@kan-bayashi
Copy link
Owner

kan-bayashi commented Aug 19, 2020

@rijulg Did you check this comment?
#169 (comment)
The base of log is different, so you need to convert [taco2_mel] -> exp -> log10 -> mean-var norm.

And of course to the best quality, you need to match the feature extraction setting (e.g. FFT, shift).

@rijulg
Copy link

rijulg commented Aug 19, 2020

@kan-bayashi yes, I am indeed doing the log base conversion; I guess I (mistakenly) considered the log conversion part of your mean-var normalization process so did not mention it separately.

@kan-bayashi
Copy link
Owner

In your code, the range of mel-basis is different.
That is the reason of quality degradation.

@rijulg
Copy link

rijulg commented Aug 19, 2020

Ah, alright. Just to confirm there is no way of scaling right? Leaving the only option of retraining the models?

@kan-bayashi
Copy link
Owner

Unfortunately, you need to retrain :(

@Zarbuvit
Copy link

Zarbuvit commented Oct 1, 2020

@seantempesta I tried your fix for GlowTTS inference with multiband melgan using the Mozilla-TTS multiband melgan, and though it did take away the noisy background but it left me with garbled up words.
Did this ever happen to you?
I tried, and I would prefer, to use the multiband melgan model provided by kan-bayashi but I couldnt figure out how to load the model properly and kept running into tensor size incompatibility issues. If it is possible for you to share how you got vocoder from your inference line y = pqmf.synthesis(vocoder(*xx)).view(-1) I would be very greatful - it will probably solve for me what I have had problems with for the past week.

@Zarbuvit
Copy link

Zarbuvit commented Oct 8, 2020

I ended up trying another method provided in the repo https://github.com/rishikksh20/melgan. I got the same garbled voice results.
I then realized that I had a personal mistake of mixing up my training and inference models and that was the cause for the garbling. Once I fixed this everything worked perfect!
This is all to say, I am not sure if this would have solved the garbling problem I had here, I didn't check, but I am almost certain that @seantempesta code did actually work for me and any problem I had was on my side of things.
I am sorry for any problems or wasted time I may have caused.

@lucashueda
Copy link

Since the sound itself is not affected by a wav normalization (audio /= (1 << (16 -1)), is there a way to use a PWGAN trained without wav norm to synthetize tacotron2 model output trained with wav norm?

Additionally, someone know if wav norm is needed to converge tacotron2? I tried without wav norm to match a internal PWGAN trained without wav norm, but tacotron2 run 30k steps without attention alignment.

@kan-bayashi
Copy link
Owner

@lucashueda I did not understand ‘wav norm’ you mentioned. Did you use wav with the scale from -66536 to 66536?

@lucashueda
Copy link

lucashueda commented Nov 23, 2020

@lucashueda I did not understand ‘wav norm’ you mentioned. Did you use wav with the scale from -66536 to 66536?

With "wav norm" I mean " audio /= (1 << (16-1)) " to make a 16bit PCM file between -1 and +1. But i realize that different wav readers read these files differently, I was just confused with the "bin/preprocess.py" file where if I put the input_dir argument it just calls "load_wav" to read a wav file but if you pass a kaldi style file it performs the "wav norm", but as I saw the soundfile package already performs the normalization.

@kan-bayashi
Copy link
Owner

Both cases will normalize the audio from -1 to 1.
The 'wav norm' is needed sicne kaldiio output is not normalized.

@wizardk
Copy link

wizardk commented Apr 27, 2021

[cmvn] means the mean-var normalization using stats.h5 of PWG model.
In your case,

# load PWG statistics
mu = read_hdf5("/path/to/stats.h5", "mean")
var = read_hdf5("/path/to/stats.h5", "scale")
sigma = np.sqrt(var)

# mean-var normalization
decompressed_log10_norm = (decompressed_log10 - mu) / sigma

# then input to vocoder
...  

Hi @kan-bayashi , why do you need to do np.sqrt(var) ? In the compute_statistics.py, you have saved scale_ instead of var_.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Apr 28, 2021

Oh, scale is std so we should use scale as sigma here.
Thank you for pointing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants