why does MoE not support TP and PP #3776

cccc0der · 2023-06-21T02:51:40Z

cccc0der
Jun 21, 2023

Hi,

I was trying MoE with MoE Example in Megatron-DeepSpeed(microsoft) and saw these comments:

## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1

## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64

Also MoE loss is ignored in ParallelTransformerLayerPipe

class ParallelTransformerLayerPipe(ParallelTransformerLayer):

    def forward(self, inputs, **kwargs):
            ...
            # HACK: currently MoE model does not support pipeline parallel, so
            # here we just ignore the moe_loss returned by forward()
            return super().forward(hidden_states, attention_mask, **kwargs)[0]
        elif len(inputs) == 2:
            ....
            # HACK: currently MoE model does not support pipeline parallel, so
            # here we just ignore the moe_loss returned by forward()
            return super().forward(*inputs, **kwargs)[0], attention_mask
        else:
            raise RuntimeError('Received more inputs than understood.')

Here are my questions:

MoE can run with TP but without convergence，right?
MoE can't support PP because MoE checkpointing not support dividing to multi pipeline stage?
Since these issues(can't run with tp/pp), only model <=1.3B be trained with MoE in MoE Example , right?

Waiting for reply, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why does MoE not support TP and PP #3776

{{title}}

Replies: 0 comments

Select a reply

why does MoE not support TP and PP #3776

cccc0der Jun 21, 2023

Replies: 0 comments

cccc0der
Jun 21, 2023