[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

thisjiang · 2021-04-27T06:22:55Z

PR types

Performance optimization

PR changes

OPs

Describe

起因：

与CheckFiniteAndUnscale 类似，timeline中显示update_loss_scaling_op在一次运行中多次调用了FillIf，最多调用了300次，且其中包含多个小kernel，存在优化点。

代码分析：

同样的，原有代码中存在一个for循环：

for (size_t i = 0; i < xs.size(); ++i) {
	...
	FillIf<<<...>>>(outs[i]->mutable_data<T>(),...);
	...
}

outs为一个vector<Tensor*>，无论tensor多大，for循环对其中的每个tensor都需要调用一次FillIf。

优化

优化方法1：

commit id：ad79dff
显然，融合（fused）kernel，将外部for循环去掉，改为无论xs.size()大小均只用调用一次kernel效果应该最为明显。

基本思路与PR31954相同，这里不再赘叙。需要额外提一句的是，由于该FillIf只是将value一个个赋值给outs中的值，因此若一个thread只处理一个数据会导致线程数过多，计算资源利用率低，为改善这种现象，因此这里设置为一个线程处理50个数据以降低warp切换开销。

优化2：

commit id：527779a

删除了check_finite_and_unscale和update_loss_scaling_opkernel中的无用行while (id < s_starts[index]) index++;，经验证，此行在两kernel中都不会被走到。
优化了check_finite_and_unscale和update_loss_scaling_opkernel中变量的命名，使之更清晰明了。
添加了若干注释，方便后来者理解和维护。

优化效果：

ernie_doc 模型速度(V100-SXM2-16GB机器单卡）	FP32	AMP	加速比
优化前(BS=2048)	4.48 sequence/s	9.78 sequence/s	2.18
优化1(BS=2048)	4.48 sequence/s	9.85 sequence/s	2.19

ernie_doc op cost	优化前	优化1
`update_loss_scaling_op`	1.406 ms	0.685 ms

`ResNet50 AMP`模型速度(V100-SXM2-16GB机器单卡)	优化前	优化1
10~510 step平均ips（BS=208）	1415 images/sec	1416 images/sec
10~510 step平均ips（BS=128）	1331 images/sec	1331 images/sec

timeline占比	优化前	优化1
ernie_doc AMP(BS=2048)	1%	0.7%
ResNet50 AMP（BS=208）	0.2%	<0.1%
ResNet50 AMP（BS=128）	0.4%	<0.1%

ResNet50收敛性验证

模型地址：[ResNet50_fp16.sh]

…=develop

…evelop

…t_ to local_

paddle-bot-old · 2021-04-27T06:22:57Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wzzju

LGTM.

thisjiang added 4 commits April 27, 2021 06:15

optimize update_loss_scaling_op by fused for loop to one kernel, test…

f9ccd7a

…=develop

remove useless while loop and optimize variable name, test=develop

dbf8b29

optimize variable name from out_addrs_tensor to out_addrs_mem, test=d…

f6371da

…evelop

optimize variable name for readable by change prefix identifier from …

ea3f4e2

…t_ to local_

wzzju approved these changes Apr 28, 2021

View reviewed changes

lanxianghit approved these changes Apr 28, 2021

View reviewed changes

lanxianghit merged commit 33703da into PaddlePaddle:release/2.1 Apr 28, 2021

thisjiang deleted the cherrypick-optimize-update_loss_scaling branch April 28, 2021 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

thisjiang commented Apr 27, 2021 •

edited

Loading

paddle-bot-old bot commented Apr 27, 2021

wzzju left a comment

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606

Conversation

thisjiang commented Apr 27, 2021 • edited Loading

PR types

PR changes

Describe

起因：

代码分析：

优化

优化方法1：

优化2：

优化效果：

ResNet50收敛性验证

paddle-bot-old bot commented Apr 27, 2021

wzzju left a comment

Choose a reason for hiding this comment

thisjiang commented Apr 27, 2021 •

edited

Loading