Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize check_finite_and_unscale_op #31954

Merged

Conversation

thisjiang
Copy link
Contributor

@thisjiang thisjiang commented Mar 30, 2021

PR types

Performance optimization

PR changes

OPs

Describe

起因:

CheckFiniteAndUnscale 在新ernie doc模型timeline中占比高达5.7%,timeline中显示check_finite_and_unscale_op在一次运行中多次调用了CheckFiniteAndUnscale,最多调用了300次,且其中包含多个小kernel,存在优化点。

代码分析:

原有代码中存在一个for循环:

for (size_t i = 0; i < xs.size(); ++i) {
  ...
  CheckFiniteAndUnscale<<<...>>>(xs[i]->data<T>(),...,outs[i]->data<T>())
  ...
}

xsouts均为一个vector<Tensor*>,不论tensor大小,for循环对于其中的每个tensor都要调用一次CheckFiniteAndUnscale

优化

优化方法1:

commit id:b2eba11
显然,融合(fused)kernel,将外部for循环去掉,改为无论xs.size()大小均只用调用一次kernel效果应该最为明显。

难点:

  1. xsouts均为host端的vector<Tensor*>变量,需拷贝到device端。
  2. xsouts中的Tensor数据位置不是连续的,如何判断当前线程处理的是哪个Tensor中的哪个数据?

优化点:

  1. 通过memory::Alloc分配两个大小为xs.size()的将指针数组,用于分别存储xsouts中每个Tensor的数据的起始地址,并最终通过memory::Copy拷贝到device端。
  2. 假设所有Tensor中的数据都是一维连续展开的,排列方式与kernel的线程排列方式一致,存储每个Tensor展开后的起始索引值 ---即该Tensor之前所有Tensor大小之和。这样我们就可以将线程id与数据位置相对应:若当前线程索引大于某个Tensor的起始索引值,且小于下一个Tensor的起始索引值,则说明当前线程处理的是这个Tensor,Tensor内部索引值即为线程索引值减去该Tensor起始索引值。
  3. 实现方案为通过memory::Alloc分配一个大小为xs.size()int64_t数组starts,其中每个元素记录的是Tensor的起始索引值。通过memory::Copy拷贝到device端后,由于该数组会经常用到,为避免多次访问global memory带来的访存开销,在kernel中将该数组存储到shared memory中,以降低访存延迟。
  4. 同样的,为避免多次访问global memory带来的访存开销,将found_infscale放在寄存器上计算。
  5. 由于所有Tensor的总大小非常大,因此若kernel中每个线程只处理一个数据,常常会导致launch kernel时的grid值特别大,大量时间花在了block切换上。因此改为每个线程处理20个数据

优化效果:

ernie_doc 模型速度(V100-SXM2-16GB机器) FP32 AMP 加速比
优化前 4.34 steps/s 8.49 steps/s 1.95
优化1 4.34 steps/s 8.91 steps/s 2.05
ResNet50 AMP 模型速度(V100-SXM2-32GB机器) 优化前 优化1
ips 1339 images/sec 1347 images/sec
timeline占比 优化前 优化1
ernie_doc AMP(BS=2048) 5.7% 1.5%
ResNet50 AMP 0.3 % 0.2 %

ResNet50收敛性验证

模型地址:ResNet50_fp16.sh
train loss
train avg loss
test avg loss
test acc1
test acc5

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wzzju wzzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wzzju wzzju merged commit fdf63b4 into PaddlePaddle:develop Apr 13, 2021
@thisjiang thisjiang deleted the optimize-check_finite_and_unscale branch April 13, 2021 03:25
for (int64_t idx = tid; idx < num; idx += gridDim.x * blockDim.x) {
// get the xs's index of thread
int xs_index = pre_xs_index;
while (idx < s_starts[xs_index]) xs_index++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in line 48 may not be triggered forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已在PR32554中删除该行

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants