-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use full-vectorized load instructions for load vectorization #445
Conversation
@htyu Thanks for looking into this issue. However, I don't think this is the right way to solve the problem.
If there is an issue with global load vectorization in your customized kernel, we are happy to help. |
Thanks for the comments.
What LLVM instructions do you expect to generate with carefully address computation? A full vectorized load or a sequence of shorter loads? I'm also seeing a long load survive the LLVM backend more stably than the latter. Inline assembly should work. But I'm not sure it has side effects on other LLVM optimizations. What problem do you see with the long load? |
I expect a sequence of shorter loads. We haven't tried full vectorized global load at llvm level since we are trying to reused as much code from NV path as possible. There are a lot of failed tests. Can you make them pass first? |
Sure, I'll work on clearing the test failures and making sure it's not affecting the NV path. P.S., the problem I was seeing is that the VectorCombine pass converted the original four i32 loads into four 2xi16 loads. Then the jump threading pass threaded the four 2xi16 loads by getting rid of the redundant mask checks for the first three loads. During the threading, the first three loads were further decomposed into six i16 loads which were vectorized later. The fourth 2x116 load was excluded from the vectorization because it's already vectorized. An 8xi16 load in the first place appeared to be immune from all those issues. |
We root-caused the bad performance of RMS norm #422 to be due to this issue. It seems like the combination of the for-loop and the control-flow based load masking confuses the load/store vectorizer, and we end up with a dword+dword3 load instead of a dword4 load. |
This is exactly the issue I'm fixing here. |
CC +@scxiao |
This is a known issue. Can WA on hand-written code. For compiler generated one, we may consider create a new primitive-based solution later("rely on the user to ensure the correctness"). So far, we are compatible to Triton Nvidia GPU API. And it takes correctness at higher priority than the performance. |
I think in this case we should still be able to fix the compiler for performance without affecting the correctness. One path I'm taking is to fix the LLVM GPU load/store vectorizer, where I saw the scalar evolution pass was not able to infer two addresses was consecutive. FYI, https://discourse.llvm.org/t/how-to-compare-scevs/76174 . Since redundant load masks (relying on the LLVM to be eliminated) where causing issue, I'm also taking another route to avoid generating the redundant checks. Please see if the new version looks reasonable. I'm yet to fix the test failures. The codegen
|
This looks good. And it may have case for the unit case in terms of the predicate etc. Let's make sure that works for all. |
Thanks. But on the second thought, I'm inclined to generating a full vectorized load when possible. This should make it more immune to the llvm uncertainty. It also reduces the size of LLVM IR to improve compile time. Please check my latest version and see if it looks good. |
@htyu Thanks for fixing this issue. cc+ @scxiao |
@htyu I'll land it. |
I'll need to take a deeper look. NV loads come with those cache flags and I'm not sure how to express them on LLVM dialect. But yeah, I'm in general not in favor of using asm volatiles. It'd be great to get rid of them. |
@htyu Sounds good. Keep us posted ! |
* Stablize load vectorization * fix test failures * Shared one mask check when decomposing a load * Revert "fix test failures" This reverts commit 75a461a. * Emit vectorized loads * Fix test failures due to using vectorized load
@htyu Since we are moving our dev work upstream and closing the perf gap between this fork and upstream, could you please upstream this PR? |
Sure, will do. Do you need me to upstream other PRs I made in this repo? |
Yes, that would be great. Thank you very much ~ |
…orization (#3609) Current implementation for load vectorization uses segmented short-vectorized loads instead of a full 128-bit load. Using multiple copies of shorter load creates a dependency on the LLVM backend (esp. the load and store vectorizer) for full vectorization. This could be fragile as I saw in some cases the vector combine pass and the jump threading pass screwed it up and resulted in non-ideal vectorization This is a backport of ROCm#445
Upstreaming PR: triton-lang#3609 |
I'm not quite sure why the existing code for load vectorization is using segmented short-vectorized loads instead of using a full 128-bit load. Using multiple copies of shorter load seems to create a dependency on the LLVM backend (esp. the load and store vectorizer) for full vectorization. This might be fragile as I saw in some cases the vector combine pass and the jump threading pass screwed it up and resulted in non-ideal vectorization.