Optimization of elementwise CUDA kernel #30801

JamesLim-sy · 2021-02-01T03:22:02Z

PR types

Performance optimization

PR changes

OPs

Describe

Design Notion:

Owing to that the addition is just reading 2 elements from respective space and write back the addition result back to another space, the serial operations above are quite independent, therefore the restrict words should benefit the performance optimization.
Using Grid-loop stride writing forms to slightly optimized performance.
The GPU forward performance comparison chart is below.

Test Case	paddle baseline/us	paddle current /us	pytorch /us	perf diff (With respect to pytorch)	perf diff
x.shape=[32, 128, 768] y.shape=[768]	33.857	33.693	32.228	slower 5.05% -> slower 4.55%	perf increase ↑
x.shape=[16, 2048, 7, 7] y.shape=[16, 2048, 1, 1]	18.821	18.594	18.267	slower 3.03% -> slower 1.79%	perf increase ↑
x.shape=[16, 1, 513, 513] y.shape=[1]	44.577	44.519	42.565	slower 4.73% -> slower 4.83%	perf decrease ↓

paddle-bot-old · 2021-02-01T03:22:07Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…usly performance improvement

… changing the threads settings for ElementwiseKernel.

…provment.

paddle-bot-old · 2021-02-12T03:24:57Z

Sorry to inform you that 8e19ebe's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

JamesLim-sy added 2 commits February 1, 2021 05:06

[WIP]: adding the shared-memory mechnism into the elementwise CUDA OP

5ff95cc

shared-memory mechanism into the kernel, but works not fine, no obvio…

5016afc

…usly performance improvement

JamesLim-sy force-pushed the lms_elementwise_dev branch from d03deb5 to 5016afc Compare February 1, 2021 12:21

[WIP]: 1.adding a new primitive elementwise broadcast CUDA kernel; 2.…

34dff86

… changing the threads settings for ElementwiseKernel.

JamesLim-sy closed this Feb 2, 2021

JamesLim-sy reopened this Feb 2, 2021

JamesLim-sy closed this Feb 2, 2021

JamesLim-sy reopened this Feb 2, 2021

[WIP]: usage of __restrict__ words, to test the perf, no much perf im…

8e19ebe

…provment.

JamesLim-sy added 2 commits March 2, 2021 08:47

Merge branch 'develop' into lms_elementwise_dev

b58ca0b

Fix the bug, 2020-03-02

ab68bce

JamesLim-sy changed the title ~~[WIP]: adding the shared-memory mechnism into the elementwise CUDA OP~~ Optimization of elementwise CUDA kernel Mar 2, 2021

JamesLim-sy added 2 commits March 8, 2021 17:30

Merge branch 'develop' into lms_elementwise_dev

57f5549

Adding the grid-stride loop and __restrict__ keyword.

ede012c

Xreki approved these changes Mar 9, 2021

View reviewed changes

Xreki merged commit 45c7d90 into PaddlePaddle:develop Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of elementwise CUDA kernel #30801

Optimization of elementwise CUDA kernel #30801

JamesLim-sy commented Feb 1, 2021 •

edited

Loading

paddle-bot-old bot commented Feb 1, 2021

paddle-bot-old bot commented Feb 12, 2021

Optimization of elementwise CUDA kernel #30801

Optimization of elementwise CUDA kernel #30801

Conversation

JamesLim-sy commented Feb 1, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 1, 2021

paddle-bot-old bot commented Feb 12, 2021

JamesLim-sy commented Feb 1, 2021 •

edited

Loading