Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confused about source speaker id in style and rhythm transfer #18

Open
JeffpanUK opened this issue Dec 11, 2019 · 4 comments
Open

confused about source speaker id in style and rhythm transfer #18

JeffpanUK opened this issue Dec 11, 2019 · 4 comments

Comments

@JeffpanUK
Copy link

Hi, I'm a little confused about the speaker id in the reference audio and text. When doing the style and rhythm transfer, the given reference speaker ids will be re-ordered as 0,1,2,...
data_utils.py
and inference script

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In that case, for the same audio like:
"audio_10|text_10|10"
in 2 different filelists

A.txt
audio_10|text_10|10
audio_0|text_0|0
B.txt
audio_10|text_10|10
audio_20|text_20|20

The reference speaker id(10) will be set as mellotron_id=1 and 0 respectively. It would be sure to cause the attention_map(A.K.A rhythm in Mellotron) to be different.

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

Is it as expected ? Or I've misunderstand somewhere?

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Dec 11, 2019

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

@JeffpanUK
Copy link
Author

JeffpanUK commented Dec 12, 2019

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

Thanks for your reply. I've noticed this part of codes in training.
However, my concern is that in the inference stage, when we tried to get the rhythm from some reference audios, we need to load the reference filelist with the TextMelLoader. I found that there is no speaker id dictionary given to the TextMelLoader.

arpabet_dict = cmudict.CMUDict('data/cmu_dictionary')
audio_paths = 'data/examples_filelist.txt'
dataloader = TextMelLoader(audio_paths, hparams)
datacollate = TextMelCollate(1)

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In this case, the mellotron speaker ids are related to the number of speakers in the reference filelist. Then, we do mellotron.forward to get the reference rhythm as below:

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

where the x contains ref_text, ref_mel, ref_f0 and ref_melltron_speaker_ids, and the generated rhythm will changed if the number of speakers in the reference filelist changed, for the same reference audio.

@rafaelvalle
Copy link
Contributor

During experiments, we noticed that the rhythm (alignment map) we get from Tacotron seems to be independent of providing the correct speaker id. You can try, for example, to provide different speaker ids while using Tacotron as a forced aligner and observe if there is a significant difference.

@rafaelvalle
Copy link
Contributor

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants