[Cherry-pick] Optimize update_loss_scaling_op(#32554) #32606
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
OPs
Describe
起因:
与
CheckFiniteAndUnscale
类似,timeline中显示update_loss_scaling_op
在一次运行中多次调用了FillIf
,最多调用了300次,且其中包含多个小kernel,存在优化点。代码分析:
同样的,原有代码中存在一个
for
循环:outs
为一个vector<Tensor*>
,无论tensor多大,for
循环对其中的每个tensor
都需要调用一次FillIf
。优化
优化方法1:
commit id:ad79dff
显然,融合(fused)kernel,将外部
for
循环去掉,改为无论xs.size()
大小均只用调用一次kernel效果应该最为明显。基本思路与PR31954相同,这里不再赘叙。需要额外提一句的是,由于该
FillIf
只是将value
一个个赋值给outs
中的值,因此若一个thread只处理一个数据会导致线程数过多,计算资源利用率低,为改善这种现象,因此这里设置为一个线程处理50个数据以降低warp切换开销。优化2:
commit id:527779a
check_finite_and_unscale
和update_loss_scaling_op
kernel中的无用行while (id < s_starts[index]) index++;
,经验证,此行在两kernel中都不会被走到。check_finite_and_unscale
和update_loss_scaling_op
kernel中变量的命名,使之更清晰明了。优化效果:
update_loss_scaling_op
ResNet50 AMP
模型速度(V100-SXM2-16GB机器单卡)ResNet50收敛性验证
模型地址:[ResNet50_fp16.sh]