[opt] Cache loop-invariant global vars to local vars #6072

lin-hitonami · 2022-09-15T10:36:31Z

Related issue = fixes #5350

Global variables can't be store-to-load forwarded after lower-access pass, so we need to do simplify before it. It should speed up the program in all circumstances.

Caching loop-invariant global vars to local vars sometimes speeds up the program yet some time lets the program run slower so I let it controlled by the compiler config.

FPS of Yu's program on RTX3080 on Vulkan:
Original: 19fps
Simplified before lower access: 30fps
Cached loop-invariant global vars to local vars: 41fps

This PR does things as follows:

Extract a base class LoopInvariantDetector from LoopInvariantCodeMotion. This class maintains information to detect whether a statement is loop-invariant.
Let LICM move GlobalPtrStmt, ArgLoadStmt and ExternalPtrStmt out of the loop so that they become loop-invariant.
Add CacheLoopInvariantGlobalVars to move out loop-invariant global variables that are loop-unique in the offloaded task.
Add pass cache_loop_invariant_global_vars after demote_atomics before demote_dense_struct_fors (because loop-uniqueness can't be correctly detected after demote_dense_struct_fors) and add a compiler config flag to control it.
Add pass full_simplify before lower_access to enable store-to-load forwarding for GlobalPtrs.

netlify · 2022-09-15T10:36:36Z

✅ Deploy Preview for docsite-preview ready!

Name	Link
🔨 Latest commit	`02edef4`
🔍 Latest deploy log	https://app.netlify.com/sites/docsite-preview/deploys/632d29a19091840009bc0523
😎 Deploy Preview	https://deploy-preview-6072--docsite-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

bobcao3 · 2022-09-18T18:36:39Z

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

lin-hitonami · 2022-09-19T07:00:21Z

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @YuCrazing 's case in #5046, the original version runs 19fps on my machine, after caching the global vars to local vars, it runs 41fps. So at least it is useful for this case. We may need to test more examples.

lin-hitonami · 2022-09-19T09:08:42Z

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @turbo0628 's case in #5350 (comment), it indeed becomes slower in my machine... I'm not very familiar with heuristics so I need some time to learn it. Maybe we can make it an option in the compile config first...

lin-hitonami · 2022-09-19T09:30:37Z

/rebase

turbo0628 · 2022-09-22T04:26:45Z

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

It's pretty tricky that the performance gap (23 vs 41 fps) is far more significant than L1/L0 difference. It behaves as if L1 cannot properly handle the accesses, and that's why we need this optimization at least for SPH kernels.

Also, CUDA backend doesn't have this problem, though the PTX code still uses a lot of ld.global.

Could it be an L1 problem for Vulkan backend?

bobcao3 · 2022-09-22T05:58:35Z

Could be the on-GPU code optimization did this for the PTX

lin-hitonami · 2022-09-22T07:15:27Z

By doing this, we are essentially trading memory access latency with register pressure. It might not always be better especially when global variables are likely uniform, and the GPU will do uniform value optimizations. On most recent GPUs the latency difference to L1/L0 is not that high anyways... We might need heuristics and cost functions with this.

For @turbo0628 's case in #5350 (comment), it indeed becomes slower in my machine... I'm not very familiar with heuristics so I need some time to learn it. Maybe we can make it an option in the compile config first...

Now after #6136 and #6129 the regression has been fixed.

Current FPS of Yu's SPH program on RTX3080 on Vulkan:
Original: 25fps
Simplified before lower access: 41fps
Cached loop-invariant global vars to local vars: 47fps

So should we let this pass default on?

Issue: #6072 Ref #6136 ### Brief Summary We misread the specs, the `aligned` parameter takes a uint literal, not a value. We used to feed in a value, but since SPIR-V wouldn't care, we are actually feeding the value id as an alignment and probably causing performance regression due to that.

lin-hitonami · 2022-09-23T02:06:31Z

/rebase

for more information, see https://pre-commit.ci

ailzhang

Thanks!

taichi/transforms/cache_loop_invariant_global_vars.cpp

taichi/transforms/loop_invariant_detector.h

for more information, see https://pre-commit.ci

lin-hitonami mentioned this pull request Sep 16, 2022

[perf] Loop-invariant code motion #2323

Merged

lin-hitonami force-pushed the loop_inv branch 2 times, most recently from 77a8434 to 01a3532 Compare September 16, 2022 09:55

lin-hitonami marked this pull request as draft September 16, 2022 10:00

lin-hitonami force-pushed the loop_inv branch from 55bdb7e to 17c02a8 Compare September 19, 2022 06:21

lin-hitonami force-pushed the loop_inv branch 2 times, most recently from 7fc61b8 to 22029e5 Compare September 19, 2022 08:12

lin-hitonami marked this pull request as ready for review September 19, 2022 08:36

lin-hitonami added the full-ci Run complete set of CI tests label Sep 19, 2022

taichi-gardener force-pushed the loop_inv branch from 801655a to f457a60 Compare September 19, 2022 09:31

lin-hitonami force-pushed the loop_inv branch from f457a60 to f8ce5e8 Compare September 20, 2022 07:03

lin-hitonami requested review from bobcao3, strongoier, ailzhang, k-ye and turbo0628 September 20, 2022 07:30

bobcao3 mentioned this pull request Sep 22, 2022

[vulkan] Fix SPV physical ptr load alignment #6139

Merged

lin-hitonami and others added 3 commits September 23, 2022 02:07

[opt] Cache loop-invariant global vars to local vars

1e439e6

[pre-commit.ci] auto fixes from pre-commit.com hooks

79e7140

for more information, see https://pre-commit.ci

fix

37cbc19

lin-hitonami and others added 10 commits September 23, 2022 02:07

fix

43f6595

fix

b14ba57

fix

6bcfc19

fix

16d6d01

stash

d12dd57

fix

1f09e35

fix

9d5984e

add compile config

1d646b4

[pre-commit.ci] auto fixes from pre-commit.com hooks

45b2775

for more information, see https://pre-commit.ci

default on

cb9a4f0

taichi-gardener force-pushed the loop_inv branch from 9eb566a to cb9a4f0 Compare September 23, 2022 02:07

simplify

0b94539

ailzhang approved these changes Sep 23, 2022

View reviewed changes

taichi/transforms/cache_loop_invariant_global_vars.cpp Outdated Show resolved Hide resolved

taichi/transforms/loop_invariant_detector.h Outdated Show resolved Hide resolved

lin-hitonami and others added 2 commits September 23, 2022 11:31

simplify

c01ccef

[pre-commit.ci] auto fixes from pre-commit.com hooks

02edef4

for more information, see https://pre-commit.ci

lin-hitonami merged commit 8e9d978 into taichi-dev:master Sep 23, 2022

lin-hitonami deleted the loop_inv branch September 23, 2022 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt] Cache loop-invariant global vars to local vars #6072

[opt] Cache loop-invariant global vars to local vars #6072

lin-hitonami commented Sep 15, 2022 •

edited

Loading

netlify bot commented Sep 15, 2022 •

edited

Loading

bobcao3 commented Sep 18, 2022

lin-hitonami commented Sep 19, 2022 •

edited

Loading

lin-hitonami commented Sep 19, 2022

lin-hitonami commented Sep 19, 2022

turbo0628 commented Sep 22, 2022 •

edited

Loading

bobcao3 commented Sep 22, 2022

lin-hitonami commented Sep 22, 2022

lin-hitonami commented Sep 23, 2022

ailzhang left a comment

[opt] Cache loop-invariant global vars to local vars #6072

[opt] Cache loop-invariant global vars to local vars #6072

Conversation

lin-hitonami commented Sep 15, 2022 • edited Loading

netlify bot commented Sep 15, 2022 • edited Loading

✅ Deploy Preview for docsite-preview ready!

bobcao3 commented Sep 18, 2022

lin-hitonami commented Sep 19, 2022 • edited Loading

lin-hitonami commented Sep 19, 2022

lin-hitonami commented Sep 19, 2022

turbo0628 commented Sep 22, 2022 • edited Loading

bobcao3 commented Sep 22, 2022

lin-hitonami commented Sep 22, 2022

lin-hitonami commented Sep 23, 2022

ailzhang left a comment

Choose a reason for hiding this comment

lin-hitonami commented Sep 15, 2022 •

edited

Loading

netlify bot commented Sep 15, 2022 •

edited

Loading

lin-hitonami commented Sep 19, 2022 •

edited

Loading

turbo0628 commented Sep 22, 2022 •

edited

Loading