Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

Conversation

thisjiang
Copy link
Contributor

@thisjiang thisjiang commented Apr 27, 2021

PR types

Performance optimization

PR changes

OPs

Describe

起因:

CheckFiniteAndUnscale 类似,timeline中显示update_loss_scaling_op在一次运行中多次调用了FillIf,最多调用了300次,且其中包含多个小kernel,存在优化点。

代码分析:

同样的,原有代码中存在一个for循环:

for (size_t i = 0; i < xs.size(); ++i) {
	...
	FillIf<<<...>>>(outs[i]->mutable_data<T>(),...);
	...
}

outs为一个vector<Tensor*>,无论tensor多大,for循环对其中的每个tensor都需要调用一次FillIf

优化

优化方法1:

commit id:ad79dff
显然,融合(fused)kernel,将外部for循环去掉,改为无论xs.size()大小均只用调用一次kernel效果应该最为明显。

基本思路与PR31954相同,这里不再赘叙。需要额外提一句的是,由于该FillIf只是将value一个个赋值给outs中的值,因此若一个thread只处理一个数据会导致线程数过多,计算资源利用率低,为改善这种现象,因此这里设置为一个线程处理50个数据以降低warp切换开销。

优化2:

commit id:527779a

  1. 删除了check_finite_and_unscaleupdate_loss_scaling_opkernel中的无用行while (id < s_starts[index]) index++;,经验证,此行在两kernel中都不会被走到。
  2. 优化了check_finite_and_unscaleupdate_loss_scaling_opkernel中变量的命名,使之更清晰明了。
  3. 添加了若干注释,方便后来者理解和维护。

优化效果:

ernie_doc 模型速度(V100-SXM2-16GB机器单卡) FP32 AMP 加速比
优化前(BS=2048) 4.48 sequence/s 9.78 sequence/s 2.18
优化1(BS=2048) 4.48 sequence/s 9.85 sequence/s 2.19
ernie_doc op cost 优化前 优化1
update_loss_scaling_op 1.406 ms 0.685 ms
ResNet50 AMP模型速度(V100-SXM2-16GB机器单卡) 优化前 优化1
10~510 step平均ips(BS=208) 1415 images/sec 1416 images/sec
10~510 step平均ips(BS=128) 1331 images/sec 1331 images/sec
timeline占比 优化前 优化1
ernie_doc AMP(BS=2048) 1% 0.7%
ResNet50 AMP(BS=208) 0.2% <0.1%
ResNet50 AMP(BS=128) 0.4% <0.1%

ResNet50收敛性验证

模型地址:[ResNet50_fp16.sh]train loss
train avg loss
test avg loss
test acc 1
test acc 5

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wzzju wzzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lanxianghit lanxianghit merged commit 33703da into PaddlePaddle:release/2.1 Apr 28, 2021
@thisjiang thisjiang deleted the cherrypick-optimize-update_loss_scaling branch April 28, 2021 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants