Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gyou2021
Copy link

Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance.

@delock
Copy link
Collaborator

delock commented Sep 19, 2024

Hi @Yejing-Lai , do you want to provide some comments on this PR for Qwen2-MoE AutoTP support?

@Yejing-Lai
Copy link
Contributor

Could you try to modify this line if it can meet your needs? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/auto_tp.py#L336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants