-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization to put LHS operand in registers for WGMMA before elementwise ops #17
Conversation
Fixed global coalesce issue and the PR should be ready for full review. Prefetch will be added in another PR seeing we're already seeing perf gains with the current changes |
// Can only hoist operand 0 | ||
auto alloc = dotOp.getOperand(0).getDefiningOp<LocalAllocOp>(); | ||
if (!alloc || !alloc.getSrc()) | ||
return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer rewriter.notifyMatchFailure()
.
} | ||
for (int k = 0; k < n1; ++k) | ||
if (isHopper) { | ||
// WGMMA.cpp expects different (m-major) ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply naming a file here is misleading as you could have just changed the expectation in that file. Why does it expect a different ordering? is this about the warp order being different on hopper than on ampere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik m-major is the hardware-expected ordering for both Ampere and Hopper; the reason for the difference here is probably just code convention.
Ampere uses k-major ordering here, but in fact, there is logic when lowering MMAv2 to LLVM to remap the ordering from k-major to m-major.
WGMMA.cpp
has no analogous remapping logic. Before this PR, WGMMA was able to accept operand A as values in registers (i.e. LLVM struct) as produced here. This code lowers convert_layout
from MMA (accumulator) encoding to DotOp encoding, which was for chained MMA's.
Therefore, if I was to change the expectation in WGMMA.cpp
, I would also have to change the expectation for chained MMAs above. This modification, I expect, is probably a lot more effort compared to what I'm doing here, so I've opted to leave the current convention intact and just conditionally set the ordering here instead.
I can change the comment here to be clearer though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Just a comment is helpful and looks good now, thanks :)
third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
Show resolved
Hide resolved
As a general comment, I think it would be nicer to review this & for openai to review it if this was separated into multiple PRs. You don't need to wait to have a performance improvement in every PR. It's enough to have a chain of easily-readable PRs that give a performance improvement in the end. This PR as a whole is doing a lot & could have at least separated the enablement of registers in mmav3 & pipelining improvements for more performance gains |
I've split this into two PRs as you suggested :) - #18 and #19. I will be addressing the above comments in these new PRs from now on. |
UPDATE - split into two PRs per @vwbaker's suggestion: #18, #19. Comments will be addressed in these new PRs instead.
Notes
Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA.
This PR adds an optimization for the case above to use RS GEMM, with the benefit of SMEM pipelining, i.e. copy from GMEM to SMEM, and then from SMEM to RF to finally perform RS WGMMA. The copying from GMEM to SMEM is necessary for coalesced access and allows for pipelining with LDGSTS (and potentially TMA in the future).