Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not compatible with nvidia-tacotron? #175

Closed
chazo1994 opened this issue Jun 30, 2020 · 11 comments
Closed

Not compatible with nvidia-tacotron? #175

chazo1994 opened this issue Jun 30, 2020 · 11 comments
Labels
question Further information is requested

Comments

@chazo1994
Copy link

chazo1994 commented Jun 30, 2020

I trained Multiband-Melgan model and intergrate with Nvidia-tacotron2 model, I also use this comment to make it work. But the results voice is bad with discontinuous pitch. The melspechtrogram below show the difference melspectrogram of output wave files of tacotron2+waveglow and tacotron2+MB_melgan (tacotron2+waveglow have great audio output). I try to replace the preprocess of this repos by nvidia-tacotron2 repos, but the results is same.

Tacotron2+waveglow:
waveglow

Tacotron2+Mb_Melgan:
mbmelgan

Tacotron2+Mb_Melgan (Replaced preprocess):
mbmelgan_replaced_preprocess

I also attached results audio.
results.zip

@kan-bayashi Can you have any idea to fix this problem?

@kan-bayashi kan-bayashi added the question Further information is requested label Jun 30, 2020
@kan-bayashi
Copy link
Owner

I want to clarify what is the problem.
When you synthesize the audio with natural features, how is the quality?
If the quality is still bad, we need to tune the hyperparameters of MB-MelGAN training.

@chazo1994
Copy link
Author

When you synthesize the audio with natural features, how

The quality of audio which generated from natural features is very good.

@kan-bayashi
Copy link
Owner

Could you share the sample of MB-MelGAN with natural features?
If the audio sounds good, I think there are something mismatched between the models.
Please describe the feature extraction setting.

@chazo1994
Copy link
Author

@kan-bayashi Oke, I will report it tomorrow. Hope you help me.

@chazo1994
Copy link
Author

chazo1994 commented Jul 1, 2020

@kan-bayashi
Here is the sample of MB-MelGAN with natural features (the audio sound is very good):
sample.zip
Melspectrogram of sample:
melFromnatural
I also compare the audio and melspec output of nvidia-tacotron2 in tranning phase, and the input of MB-MelGAN in trainning phase, both audio and melspec is the same between nvidia-tacotron2 and MB-MelGan if I replace MB-MelGAN preprocessing stage.
diff

The Feature extraction setting of MB-MelGAN:

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 22050     # Sampling rate.
fft_size: 1024           # FFT size.
hop_size: 256            # Hop size.
win_length: 1024         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: true       # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048    # Frame size in trimming.
trim_hop_size: 512       # Hop size in trimming.
format: "hdf5"           # Feature file format. "npy" or "hdf5" is supported.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 1, 2020

I also compare the audio and melspec output of nvidia-tacotron2 in tranning phase, and the input of MB-MelGAN in trainning phase, both audio and melspec is the same between nvidia-tacotron2 and MB-MelGan if I replace MB-MelGAN preprocessing stage.

OK. How did you perform normalization?
Did you use nomralized features for both text2mel and vocoder models?

@chazo1994
Copy link
Author

OK. How did you perform normalization?

In the case of use this comment I keep normalized features in trainning phase of MB-MelGan and use original preprocess of nvidia-tacotron2. In the inference phase, I generate melspec from tacotron2 and then convert it by using that code to compatible with Melgan, finally I generate audio from converted melspec.

In the case of replace preprocess of Mb-MelGan by nvidia-tacotron2 preprocess, I remove normalize procedure of MB-melgan both tranning and infererence stage.

In addition, I generate audio from one generated melspec output of nvidia-tacotron2 with Waveglow and MB-Melgan, and I see that the pulse of MB-Melgan output audio is not continuous:
pulsembmelgan-waveglow

@kan-bayashi
Copy link
Owner

kan-bayashi commented Jul 1, 2020

In the case of replace preprocess of Mb-MelGan by nvidia-tacotron2 preprocess, I remove normalize procedure of MB-melgan both training and inference stage.

OK. Then, did you use the same files to train the vocoder and the model?
If you just replace the function, please try to generate audio using the mel-spectrogram file which exactly used for the training of tacotron2.

In addition, I generate audio from one generated melspec output of nvidia-tacotron2 with Waveglow and MB-Melgan, and I see that the pulse of MB-Melgan output audio is not continuous:

What is the difference compared to the sample you shared?
When I heard your sample, the audio quality is clearly different between GT and generated features.
So I wonder that there is a bug in your code.
But if the quality degradation is reasonable, that may be the problem of MB-MelGAN.

@chazo1994
Copy link
Author

If you just replace the function, please try to generate audio using the mel-spectrogram file which exactly used for the training of tacotron2.

I did this way, and get the bad audio quality (same that quality result of tacotron2+MB-Melgan), so I will debug this point and report results here. Please, wait my response.

When I heard your sample, the audio quality is clearly different between GT and generated features.

the quality of GT audio and generated audio by MB-MelGan from natural features is same.

@kan-bayashi
Copy link
Owner

I did this way, and get the bad audio quality (same that quality result of tacotron2+MB-Melgan), so I will debug this point and report results here. Please, wait my response.

Then, there is a bug in your code.
Please carefully check the difference (e.g., log_e vs log_10).

@chazo1994
Copy link
Author

The problem is resolved, I realized that the cause of the problem was because I kept the same fmax and fmin as the default configuration of this repos while fmax and fmin of nividia-tacotron2 is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants