Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of elementwise CUDA kernel #30801

Merged
merged 8 commits into from
Mar 10, 2021

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented Feb 1, 2021

PR types

Performance optimization

PR changes

OPs

Describe

Design Notion:

  1. Owing to that the addition is just reading 2 elements from respective space and write back the addition result back to another space, the serial operations above are quite independent, therefore the restrict words should benefit the performance optimization.
  2. Using Grid-loop stride writing forms to slightly optimized performance.
  3. The GPU forward performance comparison chart is below.
Test Case paddle baseline/us paddle current /us pytorch /us perf diff (With respect to pytorch) perf diff
x.shape=[32, 128, 768]
y.shape=[768]
33.857 33.693 32.228 slower 5.05% ->
slower 4.55%
perf increase ↑
x.shape=[16, 2048, 7, 7]
y.shape=[16, 2048, 1, 1]
18.821 18.594 18.267 slower 3.03% ->
slower 1.79%
perf increase ↑
x.shape=[16, 1, 513, 513]
y.shape=[1]
44.577 44.519 42.565 slower 4.73% ->
slower 4.83%
perf decrease ↓

@paddle-bot-old
Copy link

paddle-bot-old bot commented Feb 1, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… changing the threads settings for ElementwiseKernel.
@JamesLim-sy JamesLim-sy closed this Feb 2, 2021
@JamesLim-sy JamesLim-sy reopened this Feb 2, 2021
@JamesLim-sy JamesLim-sy closed this Feb 2, 2021
@JamesLim-sy JamesLim-sy reopened this Feb 2, 2021
@paddle-bot-old
Copy link

Sorry to inform you that 8e19ebe's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@JamesLim-sy JamesLim-sy changed the title [WIP]: adding the shared-memory mechnism into the elementwise CUDA OP Optimization of elementwise CUDA kernel Mar 2, 2021
@Xreki Xreki merged commit 45c7d90 into PaddlePaddle:develop Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants