You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying MoE with MoE Example in Megatron-DeepSpeed(microsoft) and saw these comments:
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
Also MoE loss is ignored in ParallelTransformerLayerPipe
class ParallelTransformerLayerPipe(ParallelTransformerLayer):
def forward(self, inputs, **kwargs):
...
# HACK: currently MoE model does not support pipeline parallel, so
# here we just ignore the moe_loss returned by forward()
return super().forward(hidden_states, attention_mask, **kwargs)[0]
elif len(inputs) == 2:
....
# HACK: currently MoE model does not support pipeline parallel, so
# here we just ignore the moe_loss returned by forward()
return super().forward(*inputs, **kwargs)[0], attention_mask
else:
raise RuntimeError('Received more inputs than understood.')
Here are my questions:
MoE can run with TP but without convergence,right?
MoE can't support PP because MoE checkpointing not support dividing to multi pipeline stage?
Since these issues(can't run with tp/pp), only model <=1.3B be trained with MoE in MoE Example , right?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I was trying MoE with MoE Example in Megatron-DeepSpeed(microsoft) and saw these comments:
Also MoE loss is ignored in ParallelTransformerLayerPipe
Here are my questions:
Waiting for reply, thanks!
Beta Was this translation helpful? Give feedback.
All reactions