Padding and Masking Settings during Training #126

Ching-Yee-Chan · 2024-09-30T05:30:39Z

Due diligence

I have done my due diligence in trying to find the answer myself.

Topic

The paper

Question

My question is about how maximum sequence length is set and how masked loss is computed during batched training of Moshi. Specifically, I suppose that the batched input and target of an SFT training step be in the below format:

My questions are as follows:

Are losses at position ① summed into the final loss? If so, what are the ground truth labels of these positions? (We guess that there may be three types of padding tokens in the text layer, and the choice of the last token target may affect how Moshi determine the ending point of its response?)
Are losses at position ② summed into the final loss? (i.e, did you apply a padding mask during batched training and how?)
Are losses at position ③ summed into the final loss?
How did you set the maximum length T of a batch sequence? (Say, if running on an 80GB GPU)?

Thanks!

Ching-Yee-Chan added the question Further information is requested label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding and Masking Settings during Training #126

Padding and Masking Settings during Training #126

Ching-Yee-Chan commented Sep 30, 2024 •

edited

Loading

Padding and Masking Settings during Training #126

Padding and Masking Settings during Training #126

Comments

Ching-Yee-Chan commented Sep 30, 2024 • edited Loading

Due diligence

Topic

Question

Ching-Yee-Chan commented Sep 30, 2024 •

edited

Loading