Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NewComm] No.10 compatiable upgrade for distributed_fused_lamb op #57424

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

BeingGod
Copy link
Contributor

PR types

Others

PR changes

APIs

Description

compatiable upgrade for distributed_fused_lamb op
#57102

@BeingGod
Copy link
Contributor Author

单测目前存在问题,distributed_fused_lamb_test_base.py中采用fleet.init初始化并行环境,导致comm_context_manager找不到ring_id。麻烦看一下是应该在fleet.init中加入新的通信库初始化还是修改单测? @GhostScreaming

@paddle-bot paddle-bot bot added the contributor External developers label Sep 17, 2023
@luotao1 luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Sep 18, 2023
@GhostScreaming
Copy link
Contributor

换一个初始化方式。如果使用新通信库,使用paddle.distributed.collective._init_parallel_env("nccl")的方式进行初始化。

@@ -270,7 +270,10 @@ def setUpClass(cls):
paddle.enable_static()
paddle.set_flags({'FLAGS_cudnn_deterministic': True})
_clip_by_global_norm_using_mp_type(True)
fleet.init(role_maker=get_role_maker())
if os.environ.get("FLAGS_dynamic_static_unified_comm") == "1":
fleet.init(role_maker=get_role_maker())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

判断写反了吧,设置FLAGS_dynamic_static_unified_comm = 1 的时候,应该用paddle.distributed.collective._init_parallel_env("nccl")的方式初始化。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -228,5 +231,19 @@ void NCCLCommContext::GroupStart() {
}
void NCCLCommContext::GroupEnd() { NCCL_CHECK(phi::dynload::ncclGroupEnd()); }

#if NCCL_VERSION_CODE >= 21100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加一点注释信息吧,解释一下这个函数是干啥的,直接看名字很难弄懂功能。可以附上链接:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/ops.html,把里面Op功能的解释,整点到注释里。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

@hitywt hitywt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@GhostScreaming GhostScreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit 3fd69fa into PaddlePaddle:develop Sep 21, 2023
iosmers pushed a commit to iosmers/Paddle that referenced this pull request Sep 21, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
@BeingGod BeingGod deleted the comm_distributed_fused_lamb branch September 25, 2023 11:39
Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 16, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers HappyOpenSource 快乐开源活动issue与PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants