Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

gyou2021 · 2024-09-18T10:15:59Z

Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance.

delock · 2024-09-19T02:23:37Z

Hi @Yejing-Lai , do you want to provide some comments on this PR for Qwen2-MoE AutoTP support?

Yejing-Lai · 2024-09-19T14:02:54Z

Could you try to modify this line if it can meet your needs? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/auto_tp.py#L336

gyou2021 requested review from awan-10 and arashb as code owners September 18, 2024 10:15

Enabled Qwen2-MoE Tensor Parallism (TP) inference

08f728d

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

gyou2021 commented Sep 18, 2024

delock commented Sep 19, 2024

Yejing-Lai commented Sep 19, 2024

Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

Are you sure you want to change the base?

Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

Conversation

gyou2021 commented Sep 18, 2024

delock commented Sep 19, 2024

Yejing-Lai commented Sep 19, 2024