Support both use_calc_stream and sync_op in send recv APIs #46023

HermitSun · 2022-09-14T03:55:56Z

PR types

New features

PR changes

APIs

Describe

In the new communication library, we designed ProcessGroup to manage different communication group. Inside each process_group has its own stream which all communications in this group will be done on this stream. For high level API, like distributed.all_reduce, we use use_calc_stream to indicate whether this operation is sync or not. Notice that frequently add unnecessary cuda events may lead to low performance on some model. In order to achieve high performance, this pr add a new API name distributed.stream.all_reduce. This new API provided use_calc_stream and sync_op both.

sync_op, indicate whether communication is sync or not.
use_calc_stream, do communicate on calc stream, save the time of switching stream. Only work when sync_op is true.

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc

LiYuRio · 2022-09-14T06:54:21Z

paddle/fluid/pybind/distributed_py.cc

+                int numel = (*dense).numel();
+                int send_numel = numel / nranks;
+                int offset = send_numel * rank_id;


这个返回的应该是int64，大一点的tensor这就越界了，后面再统一改吧

gongweibao

LGTM

XieYunshen

LGTM
单测LABEL设置

XiaoguangHu01

LGTM

…dle#46023)

* Support both use_calc_stream and sync_op in send recv APIs (#46023) * add batch_norm prim2orig rule Co-authored-by: Wen Sun <35923278+HermitSun@users.noreply.github.com>

…dle#46023)

* Support both use_calc_stream and sync_op in send recv APIs (#46023) * Support both use_calc_stream and sync_op in allgather API (#46295) * Support both use_calc_stream and sync_op in collective communication API (#46761) * Move group and all reduce from collective to communication (#45848) * Completes bfloat16 dtype for collective api in eager mode (#45844) * Fix collective APIs cannot be recognized when building docs (#46962) Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>

feat(distributed/communication/stream): add send recv api

f9ac56d

LiYuRio reviewed Sep 14, 2022

View reviewed changes

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Outdated Show resolved Hide resolved

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Outdated Show resolved Hide resolved

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Outdated Show resolved Hide resolved

refactor(distributed/communication/stream): update stream usage

da329a4

LiYuRio reviewed Sep 14, 2022

View reviewed changes

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Show resolved Hide resolved

LiYuRio reviewed Sep 14, 2022

View reviewed changes

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Outdated Show resolved Hide resolved

fix(distributed/communication/stream): fix incorrect stream usage

6ec6e2e

LiYuRio reviewed Sep 14, 2022

View reviewed changes

refactor(distributed/communication/stream): remove useless code

45a54f1

HermitSun force-pushed the collective-stream-sendrecv branch from fd9e228 to 45a54f1 Compare September 14, 2022 07:32

refactor(distributed/communication/stream): update py module import

a84f06f

LiYuRio approved these changes Sep 15, 2022

View reviewed changes

gongweibao approved these changes Sep 15, 2022

View reviewed changes

XieYunshen approved these changes Sep 15, 2022

View reviewed changes

liuTINA0907 approved these changes Sep 15, 2022

View reviewed changes

XiaoguangHu01 approved these changes Sep 15, 2022

View reviewed changes

FeixLiu merged commit ae00f42 into PaddlePaddle:develop Sep 15, 2022

HermitSun deleted the collective-stream-sendrecv branch September 16, 2022 00:28

cxxly pushed a commit to cxxly/Paddle that referenced this pull request Sep 26, 2022

Support both use_calc_stream and sync_op in send recv APIs (PaddlePad…

9c0ce9b

…dle#46023)

HermitSun mentioned this pull request Oct 11, 2022

Update collective ops docs PaddlePaddle/docs#5237

Merged

HermitSun added a commit to HermitSun/Paddle that referenced this pull request Oct 12, 2022

Support both use_calc_stream and sync_op in send recv APIs (PaddlePad…

587e1c2

…dle#46023)

HermitSun mentioned this pull request Oct 12, 2022

[Cherry-pick] Collective communication APIs #46922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support both use_calc_stream and sync_op in send recv APIs #46023

Support both use_calc_stream and sync_op in send recv APIs #46023

HermitSun commented Sep 14, 2022

LiYuRio Sep 14, 2022

gongweibao left a comment

XieYunshen left a comment

XiaoguangHu01 left a comment

Support both use_calc_stream and sync_op in send recv APIs #46023

Support both use_calc_stream and sync_op in send recv APIs #46023

Conversation

HermitSun commented Sep 14, 2022

PR types

PR changes

Describe

LiYuRio Sep 14, 2022

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment