-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[opt] Cache loop-invariant global vars to local vars #6072
Conversation
✅ Deploy Preview for docsite-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
77a8434
to
01a3532
Compare
By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this. |
55bdb7e
to
17c02a8
Compare
For @YuCrazing 's case in #5046, the original version runs 19fps on my machine, after caching the global vars to local vars, it runs 41fps. So at least it is useful for this case. We may need to test more examples. |
7fc61b8
to
22029e5
Compare
For @turbo0628 's case in #5350 (comment), it indeed becomes slower in my machine... I'm not very familiar with heuristics so I need some time to learn it. Maybe we can make it an option in the compile config first... |
/rebase |
801655a
to
f457a60
Compare
f457a60
to
f8ce5e8
Compare
It's pretty tricky that the performance gap (23 vs 41 fps) is far more significant than L1/L0 difference. It behaves as if L1 cannot properly handle the accesses, and that's why we need this optimization at least for SPH kernels. Also, CUDA backend doesn't have this problem, though the PTX code still uses a lot of Could it be an L1 problem for Vulkan backend? |
Could be the on-GPU code optimization did this for the PTX |
Now after #6136 and #6129 the regression has been fixed. Current FPS of Yu's SPH program on RTX3080 on Vulkan: So should we let this pass default on? |
/rebase |
9eb566a
to
cb9a4f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
for more information, see https://pre-commit.ci
Related issue = fixes #5350
Global variables can't be store-to-load forwarded after
lower-access
pass, so we need to dosimplify
before it. It should speed up the program in all circumstances.Caching loop-invariant global vars to local vars sometimes speeds up the program yet some time lets the program run slower so I let it controlled by the compiler config.
FPS of Yu's program on RTX3080 on Vulkan:
Original: 19fps
Simplified before lower access: 30fps
Cached loop-invariant global vars to local vars: 41fps
This PR does things as follows:
LoopInvariantDetector
fromLoopInvariantCodeMotion
. This class maintains information to detect whether a statement is loop-invariant.GlobalPtrStmt
,ArgLoadStmt
andExternalPtrStmt
out of the loop so that they become loop-invariant.CacheLoopInvariantGlobalVars
to move out loop-invariant global variables that are loop-unique in the offloaded task.cache_loop_invariant_global_vars
afterdemote_atomics
beforedemote_dense_struct_fors
(because loop-uniqueness can't be correctly detected afterdemote_dense_struct_fors
) and add a compiler config flag to control it.full_simplify
beforelower_access
to enable store-to-load forwarding for GlobalPtrs.