-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reduce_sum_op float16 #32966
Support reduce_sum_op float16 #32966
Conversation
Thanks for your contribution! |
paddle/fluid/operators/kron_op.h
Outdated
@@ -301,7 +301,10 @@ template <typename T> | |||
struct IdentityFunctor { | |||
HOSTDEVICE explicit inline IdentityFunctor() {} | |||
|
|||
HOSTDEVICE inline T operator()(const T& x) const { return x; } | |||
template <typename T2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么使用T2
?可以考虑其他字母?T还有什么用呢?
@@ -38,7 +38,10 @@ template <typename T> | |||
struct IdentityFunctor { | |||
HOSTDEVICE explicit inline IdentityFunctor() {} | |||
|
|||
HOSTDEVICE inline T operator()(const T& x) const { return x; } | |||
template <typename T2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for op benchmark ci
@@ -241,7 +241,10 @@ template <typename T> | |||
struct IdentityFunctor { | |||
HOSTDEVICE explicit inline IdentityFunctor() {} | |||
|
|||
HOSTDEVICE inline T operator()(const T& x) const { return x; } | |||
template <typename U> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个疑问:类定义里面的模板类型T
是不是没用了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对的,加上这个U是因为如果没有就只能接受float16的参数,在使用float来累加时编译就会报错。之所以还留着这个T是为了兼容性考虑,若去掉则需要改动所有调用IdentityFunctor的地方。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for op benchmark ci
PR types
New features
PR changes
OPs
Describe
起因
混合精度训练pure_fp16训练不支持grad clip的优化器,其中一个问题在于
reduce_sum_op
遇到paddle::platform::float16
类型编译会报错。问题排查
从编译日志分析,编译出错在
cub::Reduce
处,也就是说cub::Reduce
不支持paddle::platform::float16
类型。再进一步分析,编译报错原因是
paddle::platform::float16 to float
转换报错,仅实验float(float16_num)
和static_cast<float>(float16_num)
都不会报错,但float num = float16_num
会报错,且报错信息一致,因此可以认为cub::Reduce
报错的原因也是因为内部实现有将float16
类型数据直接赋值=
给了float
数据。解决办法
方案1
既然是类型转换出错,那么一定是输入输出类型不同,即输入为
paddle::platform::float16
输出为float
。分析代码,在TensorReduceFunctor::apply
处,输入类型为Tx
,TransformOp
读入数据并处理后返回也是Tx
,但输出类型为Ty
,因此当Ty
不等于Tx
时就存在类型转换。一个简单的办法就是给TransformOp
套层转换,使得输入Tx
,输出Ty
:方案2
方案1当然可行,且不用修改其它代码,十分方便。但问题是输入输出都是
float16
时存在较大的精度误差,因此最好的办法是当输入为float16
时使用精度较高的float
类型进行累加。反应在代码里,即增加MPType
用于中间计算类型,当输入为float16
时设为float
,否则不变。当然,由于
cub::Reduce
内部的实现我们无法修改,且cub::Reduce
也并没有提供类似MPType
这种参数,因此我们只能自己实现一个ReduceKernel1D
,ReduceKernel1D
内部调用cub::BlockReduce
进行计算,但计算输入为MPType
。同时为保证一致性,给所有其它自写ReduceKernel
、ReduceKernel2D
添加MPType
以确保float16
下的计算精度。修改为commit 5596a88。
存在的问题
TransformOp
,即允许operator()
接受其它类型的输入,由于许多op都用到了TensorReduce
函数,因此需要修改所有这些opTensorReduce
对应的TransformOp::operator()
,改动范围非常大。cub::Reduce
还要快,因此相比cub::Reduce
可能会有性能损失。ReduceKernel1D
外还需要改动已有的ReduceKernel2D
、ReduceKernel
,这部分改动量也比较大。有没有什么办法可以避免以上三个问题呢?比如
TensorReduceFunctor::apply
处特例化输入输出都是float16
时先强制TensorReduce
输出为float
类型,然后cast为float16
类型?