confused about source speaker id in style and rhythm transfer #18

JeffpanUK · 2019-12-11T13:41:26Z

Hi, I'm a little confused about the speaker id in the reference audio and text. When doing the style and rhythm transfer, the given reference speaker ids will be re-ordered as 0,1,2,...
data_utils.py
and inference script

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In that case, for the same audio like:
"audio_10|text_10|10"
in 2 different filelists

A.txt
audio_10|text_10|10
audio_0|text_0|0

B.txt
audio_10|text_10|10
audio_20|text_20|20

The reference speaker id(10) will be set as mellotron_id=1 and 0 respectively. It would be sure to cause the attention_map(A.K.A rhythm in Mellotron) to be different.

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

Is it as expected ? Or I've misunderstand somewhere?

The text was updated successfully, but these errors were encountered:

rafaelvalle · 2019-12-11T17:29:45Z

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

JeffpanUK · 2019-12-12T04:20:26Z

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

Thanks for your reply. I've noticed this part of codes in training.
However, my concern is that in the inference stage, when we tried to get the rhythm from some reference audios, we need to load the reference filelist with the TextMelLoader. I found that there is no speaker id dictionary given to the TextMelLoader.

arpabet_dict = cmudict.CMUDict('data/cmu_dictionary')
audio_paths = 'data/examples_filelist.txt'
dataloader = TextMelLoader(audio_paths, hparams)
datacollate = TextMelCollate(1)

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In this case, the mellotron speaker ids are related to the number of speakers in the reference filelist. Then, we do mellotron.forward to get the reference rhythm as below:

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

where the x contains ref_text, ref_mel, ref_f0 and ref_melltron_speaker_ids, and the generated rhythm will changed if the number of speakers in the reference filelist changed, for the same reference audio.

rafaelvalle · 2019-12-12T07:54:40Z

During experiments, we noticed that the rhythm (alignment map) we get from Tacotron seems to be independent of providing the correct speaker id. You can try, for example, to provide different speaker ids while using Tacotron as a forced aligner and observe if there is a significant difference.

rafaelvalle · 2020-01-12T11:43:05Z

Closing due to inactivity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confused about source speaker id in style and rhythm transfer #18

confused about source speaker id in style and rhythm transfer #18

JeffpanUK commented Dec 11, 2019

rafaelvalle commented Dec 11, 2019 •

edited

Loading

JeffpanUK commented Dec 12, 2019 •

edited

Loading

rafaelvalle commented Dec 12, 2019

rafaelvalle commented Jan 12, 2020

confused about source speaker id in style and rhythm transfer #18

confused about source speaker id in style and rhythm transfer #18

Comments

JeffpanUK commented Dec 11, 2019

rafaelvalle commented Dec 11, 2019 • edited Loading

JeffpanUK commented Dec 12, 2019 • edited Loading

rafaelvalle commented Dec 12, 2019

rafaelvalle commented Jan 12, 2020

rafaelvalle commented Dec 11, 2019 •

edited

Loading

JeffpanUK commented Dec 12, 2019 •

edited

Loading