You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have done my due diligence in trying to find the answer myself.
Topic
The paper
Question
My question is about how maximum sequence length is set and how masked loss is computed during batched training of Moshi. Specifically, I suppose that the batched input and target of an SFT training step be in the below format:
My questions are as follows:
Are losses at position ① summed into the final loss? If so, what are the ground truth labels of these positions? (We guess that there may be three types of padding tokens in the text layer, and the choice of the last token target may affect how Moshi determine the ending point of its response?)
Are losses at position ② summed into the final loss? (i.e, did you apply a padding mask during batched training and how?)
Are losses at position ③ summed into the final loss?
How did you set the maximum length T of a batch sequence? (Say, if running on an 80GB GPU)?
Thanks!
The text was updated successfully, but these errors were encountered:
Due diligence
Topic
The paper
Question
My question is about how maximum sequence length is set and how masked loss is computed during batched training of Moshi. Specifically, I suppose that the batched input and target of an SFT training step be in the below format:
My questions are as follows:
Thanks!
The text was updated successfully, but these errors were encountered: