-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completes bfloat16 dtype for collective api in eager mode #45844
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -88,6 +88,9 @@ namespace distributed { | |||
case experimental::DataType::BOOL: \ | |||
func<bool>(args); \ | |||
break; \ | |||
case experimental::DataType::BFLOAT16: \ | |||
func<bfloat16>(args); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样是不是有问题,gloo支持bfloat吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不加这行就会报错,加上这行之后就跑起来了🤔从昨天的测试结果来看好像没问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是,不加这行肯定会报错,就我看gloo内部好像不支持bf16,比较好奇为什么这么可以过测试
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有没有一种可能,paddle 的 bf16 tensor 里面装的实际上是 uint16,种种迹象表明他在 host 上好像并没有真正用 bf16?因为用的实际上是 uint16 所以能跑起来
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那nccl里支持的bfloat16和直接用uint16传有啥区别吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl好像会判断cuda然后看能不能用bf16,gloo可能直接就用uint16了?
eab6052
to
ad49787
Compare
812d1ef
to
911d02e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -756,6 +756,9 @@ void* GetPointerByOffset(void* raw_pointer, | |||
} else if (type == experimental::DataType::BOOL) { | |||
return reinterpret_cast<void*>(reinterpret_cast<bool*>(raw_pointer) + | |||
offset); | |||
} else if (type == experimental::DataType::BFLOAT16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllReduce uint16 data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the code below, they use uint16
to represent bf16 for some reason🤔
Paddle/paddle/phi/common/bfloat16.h
Lines 74 to 79 in 75528ad
#if defined(PADDLE_CUDA_BF16) | |
__nv_bfloat16 tmp = __float2bfloat16(val); | |
x = *reinterpret_cast<uint16_t*>(&tmp); | |
#else | |
std::memcpy(&x, reinterpret_cast<char*>(&val) + 2, 2); | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it seems that we cannot use to_tensor
or cast
to get a uint16 tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue mentioned the uint16 problem, #34927
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
缩短了通信库相关单测超时时间
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Support both use_calc_stream and sync_op in send recv APIs (#46023) * Support both use_calc_stream and sync_op in allgather API (#46295) * Support both use_calc_stream and sync_op in collective communication API (#46761) * Move group and all reduce from collective to communication (#45848) * Completes bfloat16 dtype for collective api in eager mode (#45844) * Fix collective APIs cannot be recognized when building docs (#46962) Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>
PR types
New features
PR changes
OPs
Describe
This pr completes the basic function of communication framework, support various data types.
通信框架功能进一步补全,通信操作支持传输丰富的数据类型。