Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Div and FloorDiv functor in elementwise system #33053

Merged
merged 29 commits into from
Jun 12, 2021

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented May 21, 2021

PR types

Performance optimization

PR changes

OPs

Describe

  1. Basing on new elementwise + broadcast system support binary functors below :
    Div
    Floor_div

  2. The performance variation is below:
    截屏2021-06-01 下午8 06 18

The explicit comparison of floor_div is below:

x.shape y.shape data type Paddle dev /us Paddle Opti /us Pytorch /us Perf diff (with respect to pytorch)
[16, 128, 8] [16, 128, 8] int64 2.0700 1.814 2.006 slow 3.19% -> fast 9.57%
[300, 128, 100] [300, 128, 100] int32 57.514 56.785 56.587 slow 1.64% -> slow 0.35%
[[300, 128, 100] [1] int64 78.224 76.359 76.550 slow 2.19 -> fast 0.25%
  1. As can be seen in the table, the time cost of most of test cases reflect the great improvment in logical ops after optimization of Elementwise and Broadcast op.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@JamesLim-sy JamesLim-sy changed the title Support Div binary functors in elementwise system Support Div Floor_div binary functors in elementwise system May 23, 2021
@paddle-bot-old
Copy link

Sorry to inform you that bc2b805's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@JamesLim-sy JamesLim-sy changed the title Support Div Floor_div binary functors in elementwise system Support Pow Div Floor_div functors in elementwise system Jun 1, 2021
@JamesLim-sy JamesLim-sy changed the title Support Pow Div Floor_div functors in elementwise system Support Div Floor_div functors in elementwise system Jun 1, 2021
@JamesLim-sy JamesLim-sy changed the title Support Div Floor_div functors in elementwise system Support Div functor in elementwise system Jun 5, 2021
@JamesLim-sy JamesLim-sy changed the title Support Div functor in elementwise system Support Div and FloorDiv functor in elementwise system Jun 8, 2021
PADDLE_ENFORCE(argv[1] != 0,
"InvalidArgument: divide by zero "
"encountered in floor-divide ops, please check.\n");
return argv[0] / argv[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不用调trunc函数了吗?那和div计算就没区别了啊

Copy link
Contributor Author

@JamesLim-sy JamesLim-sy Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floor_div 处理的数据类型仅有 int32和 int64,返回值如果也是 int32 或 int64的话,会有自动截断操作。本地也搭建了测试脚本发现两者的计算结果相同。

import numpy  as np
import paddle as pd

npx = np.array([9, -11, -8, 7])
npy = np.array([2,   3,  3, 2])
x = pd.to_tensor(npx, dtype='int32')
y = pd.to_tensor(npy, dtype='int32')

z2 = pd.divide(x, y)
z1 = pd.floor_divide(x, y)
result = pd.subtract(z1, z2)  # result = [0, 0, 0, 0]

目前对比的结果显示,针对 floor_div 的计算,Paddle和Pytorch的计算结果是一样的,具体表现是在针对 floor_div(-8, 3) 这种计算时,
paddle和 pytorch计算的规则生成的结果是 -2,向中心点floor;
Numpy 和 tensorflow的结果是 -3,向x负半轴方向floor。

  • 感觉可以再讨论一下是否需要修改Paddle中floor_div的计算规则。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

找op负责人确认下,或者当前pr先保留trunc

if (numel / (sm_nums << 1) < ELEMENTWISE_BLOCK_SIZE) {
threads = platform::RoundToPowerOfTwo(numel / (sm_nums << 1));
} else if (numel / (sm_nums << 2) < ELEMENTWISE_BLOCK_SIZE) {
threads = platform::RoundToPowerOfTwo(numel / (sm_nums << 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L41和L43写的一样?另外这2行分别加个注释说明下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L43出现笔误,下个commit修改掉。

* than number of SMs. Hence, SMs is took into consideration within this
* function to determine the right number of threads per block,
*/
template <typename Enable = void>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要用模板呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里用的是偷懒的做法,增加模板或者函数头增加static 关键字,防止了编译过程出现函数重定义的问题。实际修改方法应该是将新开一个.cu文件,然后把函数放进.cu里面完成实例化。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不应该啊,文件开头不是有#pragma once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

查了一下stackoverflow 这个问题不能通过 #pragma once 解决。总结出来的问题有以下两种:
(1)仍旧另起一个.cu文件实例化函数体;
(2)加inline 或者 static 关键字;
偷了个懒,选择加了个inline 关键字解决。

if (address % vec4 == 0) {
return 4;
return std::min(vec4, valid_vec_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里,float16的向量化长度可能返回8,但是你没有检查地址是否满足vec8对齐的要求。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这点确实考虑漏掉了,按照建议修改,下个Commit提交。

#pragma unroll
for (int j = 0; j < ET; ++j) {
ins[j] = ins_ptr[j][i];
}
out_ptr[i] = func(ins);
out_vec.val[i] = func(ins);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们的AlignedVector或许可以改成继承Array类,定义AlignedArray

template <typename T, size_t N>
class Array {
public:
static constexpr size_t kSize = N;
HOSTDEVICE inline Array() {}
template <typename... Args>
HOSTDEVICE inline explicit Array(const T &val, Args... args) {
static_assert(N == sizeof...(Args) + 1, "Invalid argument");
UnrollVarArgsAssign<T>::Run(data_, val, args...);
}

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm for paddle enforce

@Xreki Xreki merged commit fcd93b3 into PaddlePaddle:develop Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants