-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[perf] Enable load store forwarding across loop hierarchy for vulkan performance. #5350
Comments
After some investigation, this seems not related to Ndarray in the taichi/taichi/ir/control_flow_graph.cpp Lines 264 to 294 in 233f017
Let be Ndarray or Field, the pattern will start with a The problem here is that the taichi/taichi/ir/control_flow_graph.cpp Line 291 in 233f017
For example, a dummy kernel like this can be optimized by
Nevertheless, if the access happens e.g., in a loop, the optimization no longer works:
This is regardless of Field/Ndarray. The question here becomes how to adapt the |
OK, so this is not a cause of ndarray vs field performance gap. |
@jim19930609 @strongoier Any suggestions on this? Should we update the
Similar to
|
I think this is definitely some optimization opportunity, which may or may not be taken care of by the backend compilers. However, is this the solution to the current problem you guys are facing? |
Partially yes, if this optimization is possible, Vulkan backend could benefit. (CUDA backend does not need it as the runtime opt is too powerful). |
Then feel free to hack it! |
…r is on related: taichi-dev#5350 Thanks @turbo0628 and @lin-hintonami for catching this! It turns out codegening non-semantic info spirv code isn't no-op when validation is off. It slows down the program execution so we should properly guard it.
…r is on (#6129) related: #5350 Thanks @turbo0628 and @lin-hitonami for catching this! It turns out codegening non-semantic info spirv code isn't no-op when validation is off. It slows down the program execution so we should properly guard it.
Related issue = fixes #5350 Global variables can't be store-to-load forwarded after `lower-access` pass, so we need to do `simplify` before it. It should speed up the program in all circumstances. Caching loop-invariant global vars to local vars sometimes speeds up the program yet some time lets the program run slower so I let it controlled by the compiler config. FPS of Yu's program on RTX3080 on Vulkan: Original: 19fps Simplified before lower access: 30fps Cached loop-invariant global vars to local vars: 41fps **This PR does things as follows:** 1. Extract a base class `LoopInvariantDetector` from `LoopInvariantCodeMotion`. This class maintains information to detect whether a statement is loop-invariant. 2. Let LICM move `GlobalPtrStmt`, `ArgLoadStmt` and `ExternalPtrStmt` out of the loop so that they become loop-invariant. 3. Add `CacheLoopInvariantGlobalVars` to move out loop-invariant global variables that are loop-unique in the offloaded task. 4. Add pass `cache_loop_invariant_global_vars` after `demote_atomics` before `demote_dense_struct_fors` (because loop-uniqueness can't be correctly detected after `demote_dense_struct_fors`) and add a compiler config flag to control it. 5. Add pass `full_simplify` before `lower_access` to enable store-to-load forwarding for GlobalPtrs. <!-- Thank you for your contribution! If it is your first time contributing to Taichi, please read our Contributor Guidelines: https://docs.taichi-lang.org/docs/contributor_guide - Please always prepend your PR title with tags such as [CUDA], [Lang], [Doc], [Example]. For a complete list of valid PR tags, please check out https://github.com/taichi-dev/taichi/blob/master/misc/prtags.json. - Use upper-case tags (e.g., [Metal]) for PRs that change public APIs. Otherwise, please use lower-case tags (e.g., [metal]). - More details: https://docs.taichi-lang.org/docs/contributor_guide#pr-title-format-and-tags - Please fill in the issue number that this PR relates to. - If your PR fixes the issue **completely**, use the `close` or `fixes` prefix so that GitHub automatically closes the issue when the PR is merged. For example, Related issue = close #2345 - If the PR does not belong to any existing issue, free to leave it blank. --> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The
store_to_load_forwarding
transformation in thecfg_optimization
pass ignores ExternalPtrStmt. It causes performance issue in the SPH case. The CUDA backend is insensitive to this optimization, but Vulkan backend cannot handles this well even on the same hardware.I think we can get a ~2.2x speed up for the SPH case when this optimization is enabled.
The related source code section is here.
The text was updated successfully, but these errors were encountered: