Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completes bfloat16 dtype for collective api in eager mode #45844

Merged
merged 7 commits into from
Oct 11, 2022

Conversation

HermitSun
Copy link
Contributor

@HermitSun HermitSun commented Sep 7, 2022

PR types

New features

PR changes

OPs

Describe

This pr completes the basic function of communication framework, support various data types.

  • Supports bfloat16 dtype for collective ops in NCCL and GLOO backends.

通信框架功能进一步补全,通信操作支持传输丰富的数据类型。

@paddle-bot
Copy link

paddle-bot bot commented Sep 7, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle/fluid/distributed/collective/ProcessGroupNCCL.cc Outdated Show resolved Hide resolved
@@ -88,6 +88,9 @@ namespace distributed {
case experimental::DataType::BOOL: \
func<bool>(args); \
break; \
case experimental::DataType::BFLOAT16: \
func<bfloat16>(args); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样是不是有问题,gloo支持bfloat吗

Copy link
Contributor Author

@HermitSun HermitSun Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不加这行就会报错,加上这行之后就跑起来了🤔从昨天的测试结果来看好像没问题

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是,不加这行肯定会报错,就我看gloo内部好像不支持bf16,比较好奇为什么这么可以过测试

Copy link
Contributor Author

@HermitSun HermitSun Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有一种可能,paddle 的 bf16 tensor 里面装的实际上是 uint16,种种迹象表明他在 host 上好像并没有真正用 bf16?因为用的实际上是 uint16 所以能跑起来

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那nccl里支持的bfloat16和直接用uint16传有啥区别吗

Copy link
Contributor Author

@HermitSun HermitSun Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl好像会判断cuda然后看能不能用bf16,gloo可能直接就用uint16了?

LiYuRio
LiYuRio previously approved these changes Sep 9, 2022
LiYuRio
LiYuRio previously approved these changes Oct 10, 2022
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -756,6 +756,9 @@ void* GetPointerByOffset(void* raw_pointer,
} else if (type == experimental::DataType::BOOL) {
return reinterpret_cast<void*>(reinterpret_cast<bool*>(raw_pointer) +
offset);
} else if (type == experimental::DataType::BFLOAT16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllReduce uint16 data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the code below, they use uint16 to represent bf16 for some reason🤔

#if defined(PADDLE_CUDA_BF16)
__nv_bfloat16 tmp = __float2bfloat16(val);
x = *reinterpret_cast<uint16_t*>(&tmp);
#else
std::memcpy(&x, reinterpret_cast<char*>(&val) + 2, 2);
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems that we cannot use to_tensor or cast to get a uint16 tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue mentioned the uint16 problem, #34927

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
缩短了通信库相关单测超时时间

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sljlp sljlp merged commit e4eb8d3 into PaddlePaddle:develop Oct 11, 2022
@HermitSun HermitSun deleted the collective-bf16 branch October 11, 2022 11:11
HermitSun added a commit to HermitSun/Paddle that referenced this pull request Oct 12, 2022
XiaoguangHu01 pushed a commit that referenced this pull request Oct 17, 2022
* Support both use_calc_stream and sync_op in send recv APIs (#46023)

* Support both use_calc_stream and sync_op in allgather API (#46295)

* Support both use_calc_stream and sync_op in collective communication API (#46761)

* Move group and all reduce from collective to communication (#45848)

* Completes bfloat16 dtype for collective api in eager mode (#45844)

* Fix collective APIs cannot be recognized when building docs (#46962)

Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants