-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Div and FloorDiv functor in elementwise system #33053
Conversation
Thanks for your contribution! |
…Paddle into Adding_div_functors
Sorry to inform you that bc2b805's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
5e62170
to
e2692ee
Compare
d610983
to
3b5c6b2
Compare
3b5c6b2
to
2f72406
Compare
PADDLE_ENFORCE(argv[1] != 0, | ||
"InvalidArgument: divide by zero " | ||
"encountered in floor-divide ops, please check.\n"); | ||
return argv[0] / argv[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不用调trunc
函数了吗?那和div计算就没区别了啊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floor_div 处理的数据类型仅有 int32和 int64,返回值如果也是 int32 或 int64的话,会有自动截断操作。本地也搭建了测试脚本发现两者的计算结果相同。
import numpy as np
import paddle as pd
npx = np.array([9, -11, -8, 7])
npy = np.array([2, 3, 3, 2])
x = pd.to_tensor(npx, dtype='int32')
y = pd.to_tensor(npy, dtype='int32')
z2 = pd.divide(x, y)
z1 = pd.floor_divide(x, y)
result = pd.subtract(z1, z2) # result = [0, 0, 0, 0]
目前对比的结果显示,针对 floor_div 的计算,Paddle和Pytorch的计算结果是一样的,具体表现是在针对 floor_div(-8, 3)
这种计算时,
paddle和 pytorch计算的规则生成的结果是 -2,向中心点floor;
Numpy 和 tensorflow的结果是 -3,向x负半轴方向floor。
- 感觉可以再讨论一下是否需要修改Paddle中floor_div的计算规则。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
找op负责人确认下,或者当前pr先保留trunc
。
if (numel / (sm_nums << 1) < ELEMENTWISE_BLOCK_SIZE) { | ||
threads = platform::RoundToPowerOfTwo(numel / (sm_nums << 1)); | ||
} else if (numel / (sm_nums << 2) < ELEMENTWISE_BLOCK_SIZE) { | ||
threads = platform::RoundToPowerOfTwo(numel / (sm_nums << 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L41和L43写的一样?另外这2行分别加个注释说明下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L43出现笔误,下个commit修改掉。
* than number of SMs. Hence, SMs is took into consideration within this | ||
* function to determine the right number of threads per block, | ||
*/ | ||
template <typename Enable = void> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要用模板呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里用的是偷懒的做法,增加模板或者函数头增加static 关键字,防止了编译过程出现函数重定义的问题。实际修改方法应该是将新开一个.cu文件,然后把函数放进.cu里面完成实例化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不应该啊,文件开头不是有#pragma once
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
查了一下stackoverflow 这个问题不能通过 #pragma once 解决。总结出来的问题有以下两种:
(1)仍旧另起一个.cu文件实例化函数体;
(2)加inline 或者 static 关键字;
偷了个懒,选择加了个inline 关键字解决。
if (address % vec4 == 0) { | ||
return 4; | ||
return std::min(vec4, valid_vec_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里,float16
的向量化长度可能返回8
,但是你没有检查地址是否满足vec8
对齐的要求。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这点确实考虑漏掉了,按照建议修改,下个Commit提交。
#pragma unroll | ||
for (int j = 0; j < ET; ++j) { | ||
ins[j] = ins_ptr[j][i]; | ||
} | ||
out_ptr[i] = func(ins); | ||
out_vec.val[i] = func(ins); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们的AlignedVector
或许可以改成继承Array
类,定义AlignedArray
。
Paddle/paddle/fluid/framework/array.h
Lines 24 to 35 in 42c1297
template <typename T, size_t N> | |
class Array { | |
public: | |
static constexpr size_t kSize = N; | |
HOSTDEVICE inline Array() {} | |
template <typename... Args> | |
HOSTDEVICE inline explicit Array(const T &val, Args... args) { | |
static_assert(N == sizeof...(Args) + 1, "Invalid argument"); | |
UnrollVarArgsAssign<T>::Run(data_, val, args...); | |
} |
3a107ae
to
85d954c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm for paddle enforce
PR types
Performance optimization
PR changes
OPs
Describe
Basing on new elementwise + broadcast system support binary functors below :
Div
Floor_div
The performance variation is below:
The explicit comparison of floor_div is below: