Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Cache loop-invariant global vars to local vars #6072

Merged
merged 16 commits into from
Sep 23, 2022

Conversation

lin-hitonami
Copy link
Contributor

@lin-hitonami lin-hitonami commented Sep 15, 2022

Related issue = fixes #5350

Global variables can't be store-to-load forwarded after lower-access pass, so we need to do simplify before it. It should speed up the program in all circumstances.

Caching loop-invariant global vars to local vars sometimes speeds up the program yet some time lets the program run slower so I let it controlled by the compiler config.

FPS of Yu's program on RTX3080 on Vulkan:
Original: 19fps
Simplified before lower access: 30fps
Cached loop-invariant global vars to local vars: 41fps

This PR does things as follows:

  1. Extract a base class LoopInvariantDetector from LoopInvariantCodeMotion. This class maintains information to detect whether a statement is loop-invariant.
  2. Let LICM move GlobalPtrStmt, ArgLoadStmt and ExternalPtrStmt out of the loop so that they become loop-invariant.
  3. Add CacheLoopInvariantGlobalVars to move out loop-invariant global variables that are loop-unique in the offloaded task.
  4. Add pass cache_loop_invariant_global_vars after demote_atomics before demote_dense_struct_fors (because loop-uniqueness can't be correctly detected after demote_dense_struct_fors) and add a compiler config flag to control it.
  5. Add pass full_simplify before lower_access to enable store-to-load forwarding for GlobalPtrs.

@netlify
Copy link

netlify bot commented Sep 15, 2022

Deploy Preview for docsite-preview ready!

Name Link
🔨 Latest commit 02edef4
🔍 Latest deploy log https://app.netlify.com/sites/docsite-preview/deploys/632d29a19091840009bc0523
😎 Deploy Preview https://deploy-preview-6072--docsite-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@lin-hitonami lin-hitonami force-pushed the loop_inv branch 2 times, most recently from 77a8434 to 01a3532 Compare September 16, 2022 09:55
@lin-hitonami lin-hitonami marked this pull request as draft September 16, 2022 10:00
@bobcao3
Copy link
Collaborator

bobcao3 commented Sep 18, 2022

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

@lin-hitonami
Copy link
Contributor Author

lin-hitonami commented Sep 19, 2022

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @YuCrazing 's case in #5046, the original version runs 19fps on my machine, after caching the global vars to local vars, it runs 41fps. So at least it is useful for this case. We may need to test more examples.

@lin-hitonami lin-hitonami force-pushed the loop_inv branch 2 times, most recently from 7fc61b8 to 22029e5 Compare September 19, 2022 08:12
@lin-hitonami lin-hitonami marked this pull request as ready for review September 19, 2022 08:36
@lin-hitonami
Copy link
Contributor Author

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @turbo0628 's case in #5350 (comment), it indeed becomes slower in my machine... I'm not very familiar with heuristics so I need some time to learn it. Maybe we can make it an option in the compile config first...

@lin-hitonami lin-hitonami added the full-ci Run complete set of CI tests label Sep 19, 2022
@lin-hitonami
Copy link
Contributor Author

/rebase

@turbo0628
Copy link
Member

turbo0628 commented Sep 22, 2022

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

It's pretty tricky that the performance gap (23 vs 41 fps) is far more significant than L1/L0 difference. It behaves as if L1 cannot properly handle the accesses, and that's why we need this optimization at least for SPH kernels.

Also, CUDA backend doesn't have this problem, though the PTX code still uses a lot of ld.global.

Could it be an L1 problem for Vulkan backend?

@bobcao3
Copy link
Collaborator

bobcao3 commented Sep 22, 2022

Could be the on-GPU code optimization did this for the PTX

@lin-hitonami
Copy link
Contributor Author

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @turbo0628 's case in #5350 (comment), it indeed becomes slower in my machine... I'm not very familiar with heuristics so I need some time to learn it. Maybe we can make it an option in the compile config first...

Now after #6136 and #6129 the regression has been fixed.

Current FPS of Yu's SPH program on RTX3080 on Vulkan:
Original: 25fps
Simplified before lower access: 41fps
Cached loop-invariant global vars to local vars: 47fps

So should we let this pass default on?

lin-hitonami pushed a commit that referenced this pull request Sep 22, 2022
Issue: #6072
Ref #6136

### Brief Summary

We misread the specs, the `aligned` parameter takes a uint literal, not
a value. We used to feed in a value, but since SPIR-V wouldn't care, we
are actually feeding the value id as an alignment and probably causing
performance regression due to that.
@lin-hitonami
Copy link
Contributor Author

/rebase

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

taichi/transforms/cache_loop_invariant_global_vars.cpp Outdated Show resolved Hide resolved
taichi/transforms/loop_invariant_detector.h Outdated Show resolved Hide resolved
@lin-hitonami lin-hitonami merged commit 8e9d978 into taichi-dev:master Sep 23, 2022
@lin-hitonami lin-hitonami deleted the loop_inv branch September 23, 2022 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Run complete set of CI tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[perf] Enable load store forwarding across loop hierarchy for vulkan performance.
4 participants