This repository has been archived by the owner on Jan 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 115
[Perf]Polish UniformRandom And Split it into ScheduleBlock #1357
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thanks for your contribution! |
fix codegen refine unittest and add float64 kernel
zhhsplendid
approved these changes
May 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
zhhsplendid
approved these changes
May 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
zhhsplendid
pushed a commit
to PaddlePaddle/Paddle
that referenced
this pull request
May 9, 2023
[CINN]Adjust Bert unittest loss ground truth, see: PaddlePaddle/CINN#1357
BiynXu
added a commit
to BiynXu/CINN
that referenced
this pull request
May 11, 2023
…addlePaddle#1357)" This reverts commit 658615e.
Aurelius84
added a commit
to Aurelius84/CINN
that referenced
this pull request
May 11, 2023
…addlePaddle#1357)" This reverts commit 658615e.
lanxianghit
pushed a commit
that referenced
this pull request
May 12, 2023
jiahy0825
pushed a commit
to jiahy0825/CINN
that referenced
this pull request
May 25, 2023
…dle#1357) 本PR因和paddle联编测试需两边修改,现CINN强行合入,待Paddle对应PR合入后CI可正常。
jiahy0825
pushed a commit
to jiahy0825/CINN
that referenced
this pull request
May 25, 2023
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
一、数据现状
二、API 级别验证
优化前
总耗时:fn_xxx + gen_seq + seed_pesudo = 180 + 81 + 12 = 273 us
优化后
总耗时:192 us
三、后续可优化点
3.1 将 state 变量的初始化放到 kernel 外面
Nvidia 的官网明确指出了存在的性能问题,给开发者实现高性能 Kernel 提供了充分的经验指导:
curand_init()
要比curand()
和curand_uniform()
慢!curand_init()
在 offset 比较大时性能也会比小 offset 差!save/load
操作 state 比每次重复创建起始 state 性能要快很多 !对于上述第三点,Nvidia 建议可以将 state 存放到 global memory 中,如下是一个样例代码:
此操作的前提是将 state 变量的初始化放到 kernel 外面。
3.2 借助 curand_uniform4 减少API调用次数
此 PR 里的 device API 在每次调用时,只会生成一个 float/double 的随机数。Nvidia 同样提供了一次可以生成 2个或4个 device API:
__device__ float4 curand_uniform4 (curandStatePhilox4_32_10_t *state); __device__ float4 curand_normal4 (curandStatePhilox4_32_10_t *state);
附:CUDA source code:
相关问题:Why can't templates be within extern "C" blocks?