Completes bfloat16 dtype for collective api in eager mode #45844

HermitSun · 2022-09-07T09:41:17Z

PR types

New features

PR changes

OPs

Describe

This pr completes the basic function of communication framework, support various data types.

Supports bfloat16 dtype for collective ops in NCCL and GLOO backends.

通信框架功能进一步补全，通信操作支持传输丰富的数据类型。

支持动态图场景下使用NCCL、GLOO后端进行bfloat16类型的数据传输。
对应文档的中文api修改，见 Update collective ops docs docs#5237

paddle-bot · 2022-09-07T09:41:20Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc

LiYuRio · 2022-09-08T09:13:27Z

paddle/fluid/distributed/collective/ProcessGroupGloo.cc

@@ -88,6 +88,9 @@ namespace distributed {
    case experimental::DataType::BOOL:       \
      func<bool>(args);                      \
      break;                                 \
+    case experimental::DataType::BFLOAT16:   \
+      func<bfloat16>(args);                  \


这样是不是有问题，gloo支持bfloat吗

不加这行就会报错，加上这行之后就跑起来了🤔从昨天的测试结果来看好像没问题

不是，不加这行肯定会报错，就我看gloo内部好像不支持bf16，比较好奇为什么这么可以过测试

有没有一种可能，paddle 的 bf16 tensor 里面装的实际上是 uint16，种种迹象表明他在 host 上好像并没有真正用 bf16？因为用的实际上是 uint16 所以能跑起来

那nccl里支持的bfloat16和直接用uint16传有啥区别吗

nccl好像会判断cuda然后看能不能用bf16，gloo可能直接就用uint16了？

…ive ops

gongweibao

LGTM

gongweibao · 2022-09-09T07:43:48Z

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc

@@ -756,6 +756,9 @@ void* GetPointerByOffset(void* raw_pointer,
  } else if (type == experimental::DataType::BOOL) {
    return reinterpret_cast<void*>(reinterpret_cast<bool*>(raw_pointer) +
                                   offset);
+  } else if (type == experimental::DataType::BFLOAT16) {


AllReduce uint16 data?

As the code below, they use uint16 to represent bf16 for some reason🤔

Paddle/paddle/phi/common/bfloat16.h

Lines 74 to 79 in 75528ad

#if defined(PADDLE_CUDA_BF16)

__nv_bfloat16 tmp = __float2bfloat16(val);

x = *reinterpret_cast<uint16_t*>(&tmp);

#else

std::memcpy(&x, reinterpret_cast<char*>(&val) + 2, 2);

#endif

And it seems that we cannot use to_tensor or cast to get a uint16 tensor.

This issue mentioned the uint16 problem, #34927

XieYunshen

LGTM
缩短了通信库相关单测超时时间

XiaoguangHu01

LGTM

…le#45844)

* Support both use_calc_stream and sync_op in send recv APIs (#46023) * Support both use_calc_stream and sync_op in allgather API (#46295) * Support both use_calc_stream and sync_op in collective communication API (#46761) * Move group and all reduce from collective to communication (#45848) * Completes bfloat16 dtype for collective api in eager mode (#45844) * Fix collective APIs cannot be recognized when building docs (#46962) Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>

This was referenced Sep 7, 2022

Update collective ops docs PaddlePaddle/docs#5237

Merged

[Discard] Extends all_reduce op unit test timeout #45589

Closed

LiYuRio reviewed Sep 8, 2022

View reviewed changes

LiYuRio previously approved these changes Sep 9, 2022

View reviewed changes

HermitSun dismissed LiYuRio’s stale review via ad49787 October 9, 2022 12:53

HermitSun force-pushed the collective-bf16 branch from eab6052 to ad49787 Compare October 9, 2022 12:53

LiYuRio previously approved these changes Oct 10, 2022

View reviewed changes

HermitSun added 7 commits October 10, 2022 11:59

feat(python/distributed/collective): add bfloat16 support for collect…

ca83fbc

…ive ops

chore(python/distributed/collective): update tests timeout

68865ed

fix(python/distributed/collective): add nccl version hack

4c2aac1

revert(python/distributed/collective): remove bfloat16 tests temporarily

c4c6260

refactor(python/distributed/collective): remove useless version macro

35eec2b

revert(python/distributed/collective): recover temporary bfloat16 tests

74d862a

style(python/distributed/collective): please newer codestyle

911d02e

HermitSun dismissed LiYuRio’s stale review via 911d02e October 10, 2022 04:00

HermitSun force-pushed the collective-bf16 branch from 812d1ef to 911d02e Compare October 10, 2022 04:00

LiYuRio approved these changes Oct 10, 2022

View reviewed changes

gongweibao approved these changes Oct 11, 2022

View reviewed changes

liuTINA0907 approved these changes Oct 11, 2022

View reviewed changes

XieYunshen approved these changes Oct 11, 2022

View reviewed changes

XiaoguangHu01 approved these changes Oct 11, 2022

View reviewed changes

sljlp merged commit e4eb8d3 into PaddlePaddle:develop Oct 11, 2022

HermitSun deleted the collective-bf16 branch October 11, 2022 11:11

HermitSun added a commit to HermitSun/Paddle that referenced this pull request Oct 12, 2022

Completes bfloat16 dtype for collective api in eager mode (PaddlePadd…

ae3d48d

…le#45844)

HermitSun mentioned this pull request Oct 12, 2022

[Cherry-pick] Collective communication APIs #46922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completes bfloat16 dtype for collective api in eager mode #45844

Completes bfloat16 dtype for collective api in eager mode #45844

HermitSun commented Sep 7, 2022 •

edited

Loading

paddle-bot bot commented Sep 7, 2022

LiYuRio Sep 8, 2022

HermitSun Sep 8, 2022 •

edited

Loading

LiYuRio Sep 8, 2022

HermitSun Sep 8, 2022 •

edited

Loading

LiYuRio Sep 8, 2022

HermitSun Sep 8, 2022 •

edited

Loading

gongweibao left a comment

gongweibao Sep 9, 2022

HermitSun Oct 11, 2022

HermitSun Oct 11, 2022

HermitSun Oct 18, 2022

XieYunshen left a comment

XiaoguangHu01 left a comment

	#if defined(PADDLE_CUDA_BF16)
	__nv_bfloat16 tmp = __float2bfloat16(val);
	x = reinterpret_cast<uint16_t>(&tmp);
	#else
	std::memcpy(&x, reinterpret_cast<char*>(&val) + 2, 2);
	#endif

Completes bfloat16 dtype for collective api in eager mode #45844

Completes bfloat16 dtype for collective api in eager mode #45844

Conversation

HermitSun commented Sep 7, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Sep 7, 2022

LiYuRio Sep 8, 2022

Choose a reason for hiding this comment

HermitSun Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

LiYuRio Sep 8, 2022

Choose a reason for hiding this comment

HermitSun Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

LiYuRio Sep 8, 2022

Choose a reason for hiding this comment

HermitSun Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

gongweibao Sep 9, 2022

Choose a reason for hiding this comment

HermitSun Oct 11, 2022

Choose a reason for hiding this comment

HermitSun Oct 11, 2022

Choose a reason for hiding this comment

HermitSun Oct 18, 2022

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

HermitSun commented Sep 7, 2022 •

edited

Loading

HermitSun Sep 8, 2022 •

edited

Loading

HermitSun Sep 8, 2022 •

edited

Loading

HermitSun Sep 8, 2022 •

edited

Loading