Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

Merged
merged 6 commits into from
Feb 25, 2022

Conversation

sneaxiy
Copy link
Collaborator

@sneaxiy sneaxiy commented Feb 24, 2022

PR types

Performance optimization

PR changes

OPs

Describe

Use MultiTensorApply to improve the L2-Norm calculation in DistributedFusedLamb optimizer.

优化前,DistributedFusedLamb在计算Parameter L2-Norm和Trust Ratio Div L2-Norm的时候调用的cub::DeviceSegmentedReduce,改用MultiTensorApply的方式,并优化每个launch kernel时的最大Tensor数量、chunk数量等参数。

BERT Large(batch_size = 56, max_seq_len = 512, pure_fp16):

  • Paddle Baseline(使用cub::DeviceSegmentedReduce
总调用次数 总时间 单个batch调用次数 单个batch总时间
cub::DeviceSegmentedReduce 2648 7182050678 2 5424509.576
  • NV的性能数据MaxTensorNumPerLaunch=110MaxChunkNumPerLaunch=320
总调用次数 总时间 单个batch调用次数 单个batch总时间
MultiTensorApply Kernel 1962 + 1962 90753225 + 67747444 6 242355.763
Cleanup Kernel 1308 7537783 2 11525.6621
合计 - - - 253881.425
  • Paddle的性能数据MaxTensorNumPerLaunch=110MaxChunkNumPerLaunch=320(对齐NV配置),优于NV 0.7%-2.2%,基本可以认为是持平。相比Paddle Baseline提升约95%。
总调用次数 总时间 单个batch调用次数 单个batch总时间
GPU 0 MultiTensorApply Kernel 3972 158703257 6 239733.017
GPU 0 Cleanup Kernel 1324 6489006 2 9802.12387
GPU 0 合计 - - - 249535.14
GPU 1 MultiTensorApply Kernel 3972 158083541 6 238796.89
GPU 1 Cleanup Kernel 1324 6246289 2 9435.48187
GPU 1 合计 - - - 248232.372
GPU 2 MultiTensorApply Kernel 3972 158550442 6 239502.178
GPU 2 Cleanup Kernel 1324 6351557 2 9594.49698
GPU 2 合计 - - - 249096.675
GPU 3 MultiTensorApply Kernel 3972 160482464 6 242420.64
GPU 3 Cleanup Kernel 1324 6396205 2 9661.94109
GPU 3 合计 - - - 252082.582
GPU 4 MultiTensorApply Kernel 3972 157911387 6 238536.838
GPU 4 Cleanup Kernel 1324 6313354 2 9536.78852
GPU 4 合计 - - - 248073.627
GPU 5 MultiTensorApply Kernel 3972 158206590 6 238982.764
GPU 5 Cleanup Kernel 1324 6352998 2 9596.67372
GPU 5 合计 - - - 248579.438
GPU 6 MultiTensorApply Kernel 3972 158026604 6 238710.882
GPU 6 Cleanup Kernel 1324 6412573 2 9686.66616
GPU 6 合计 - - - 248397.548
GPU 7 MultiTensorApply Kernel 3972 158762293 6 239822.195
GPU 7 Cleanup Kernel 1324 6391204 2 9654.38671
GPU 7 合计 - - - 249476.582
  • Paddle的性能数据MaxTensorNumPerLaunch=50MaxChunkNumPerLaunch=680,优于NV 14%。相比Paddle Baseline提升约96%。
总调用次数 总时间 单个batch调用次数 单个batch总时间
GPU 0 MultiTensorApply Kernel 1324 137146700 2 207170.242
GPU 0 Cleanup Kernel 1324 6688593 2 10103.6148
GPU 0 合计 - - - 217273.856
GPU 1 MultiTensorApply Kernel 1324 137585724 2 207833.42
GPU 1 Cleanup Kernel 1324 6483174 2 9793.3142
GPU 1 合计 - - - 217626.734
GPU 2 MultiTensorApply Kernel 1324 137302011 2 207404.85
GPU 2 Cleanup Kernel 1324 6562869 2 9913.6994
GPU 2 合计 - - - 217318.55
GPU 3 MultiTensorApply Kernel 1324 137531462 2 207751.453
GPU 3 Cleanup Kernel 1324 6615911 2 9993.82326
GPU 3 合计 - - - 217745.276
GPU 4 MultiTensorApply Kernel 1324 137312205 2 207420.249
GPU 4 Cleanup Kernel 1324 6511524 2 9836.13897
GPU 4 合计 - - - 217256.388
GPU 5 MultiTensorApply Kernel 1324 137457972 2 207640.441
GPU 5 Cleanup Kernel 1324 6539170 2 9877.9003
GPU 5 合计 - - - 217518.341
GPU 6 MultiTensorApply Kernel 1324 137166650 2 207200.378
GPU 6 Cleanup Kernel 1324 6617388 2 9996.05438
GPU 6 合计 - - - 217196.432
GPU 7 MultiTensorApply Kernel 1324 137427733 2 207594.763
GPU 7 Cleanup Kernel 1324 6563153 2 9914.1284
GPU 7 合计 - - - 217508.891

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@sneaxiy sneaxiy changed the title [WIP] Add MultiTensorApply to calculate L2-Norm Add MultiTensorApply to calculate L2-Norm Feb 24, 2022
@sneaxiy sneaxiy changed the title Add MultiTensorApply to calculate L2-Norm Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer Feb 25, 2022
Copy link
Contributor

@limin2021 limin2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sneaxiy sneaxiy merged commit d32a010 into PaddlePaddle:develop Feb 25, 2022
@sneaxiy sneaxiy deleted the add_multi_tensor_apply_l2_norm branch February 25, 2022 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants