Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

ggengnv · 2024-09-19T23:15:44Z

UPDATE - split into two PRs per @vwbaker's suggestion: #18, #19. Comments will be addressed in these new PRs instead.

Notes

Locally, benchmarked on H100 HBM3 700W for a sample workload i8 -> bf16 on two shapes. Seeing 15-20% perf gain.
Prefetching into RF will be put into another PR
Coalesce issue is fixed in the latest commits
I largely reuse existing Ampere logic for SMEM to RF copy with minor modifications for Hopper. More changes might be needed if SMEM bank conflicts becomes an issue or if we can't use this for TMA

Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA.

This PR adds an optimization for the case above to use RS GEMM, with the benefit of SMEM pipelining, i.e. copy from GMEM to SMEM, and then from SMEM to RF to finally perform RS WGMMA. The copying from GMEM to SMEM is necessary for coalesced access and allows for pipelining with LDGSTS (and potentially TMA in the future).

…wgmma

ggengnv · 2024-09-20T21:12:56Z

Fixed global coalesce issue and the PR should be ready for full review. Prefetch will be added in another PR seeing we're already seeing perf gains with the current changes

lib/Dialect/TritonGPU/IR/Dialect.cpp

lib/Dialect/TritonGPU/Transforms/OptimizeDotOperands.cpp

chsigg · 2024-09-23T12:15:31Z

lib/Dialect/TritonGPU/Transforms/OptimizeDotOperands.cpp

+    // Can only hoist operand 0
+    auto alloc = dotOp.getOperand(0).getDefiningOp<LocalAllocOp>();
+    if (!alloc || !alloc.getSrc())
+      return failure();


Prefer rewriter.notifyMatchFailure().

lib/Dialect/TritonGPU/Transforms/OptimizeDotOperands.cpp

include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td

lib/Dialect/TritonGPU/IR/Dialect.cpp

lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp

vwbaker · 2024-09-23T12:46:46Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp

-      }
+      for (int k = 0; k < n1; ++k)
+        if (isHopper) {
+          // WGMMA.cpp expects different (m-major) ordering


Simply naming a file here is misleading as you could have just changed the expectation in that file. Why does it expect a different ordering? is this about the warp order being different on hopper than on ampere?

Afaik m-major is the hardware-expected ordering for both Ampere and Hopper; the reason for the difference here is probably just code convention.

Ampere uses k-major ordering here, but in fact, there is logic when lowering MMAv2 to LLVM to remap the ordering from k-major to m-major.

WGMMA.cpp has no analogous remapping logic. Before this PR, WGMMA was able to accept operand A as values in registers (i.e. LLVM struct) as produced here. This code lowers convert_layout from MMA (accumulator) encoding to DotOp encoding, which was for chained MMA's.

Therefore, if I was to change the expectation in WGMMA.cpp, I would also have to change the expectation for chained MMAs above. This modification, I expect, is probably a lot more effort compared to what I'm doing here, so I've opted to leave the current convention intact and just conditionally set the ordering here instead.

I can change the comment here to be clearer though.

Thanks for the explanation! Just a comment is helpful and looks good now, thanks :)

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp

vwbaker · 2024-09-23T12:59:51Z

As a general comment, I think it would be nicer to review this & for openai to review it if this was separated into multiple PRs. You don't need to wait to have a performance improvement in every PR. It's enough to have a chain of easily-readable PRs that give a performance improvement in the end. This PR as a whole is doing a lot & could have at least separated the enablement of registers in mmav3 & pipelining improvements for more performance gains

ggengnv · 2024-09-23T23:24:16Z

As a general comment, I think it would be nicer to review this & for openai to review it if this was separated into multiple PRs. You don't need to wait to have a performance improvement in every PR. It's enough to have a chain of easily-readable PRs that give a performance improvement in the end. This PR as a whole is doing a lot & could have at least separated the enablement of registers in mmav3 & pipelining improvements for more performance gains

I've split this into two PRs as you suggested :) - #18 and #19. I will be addressing the above comments in these new PRs from now on.

ggengnv · 2024-09-23T23:24:41Z

Closing - split into #18 and #19

ggengnv · 2024-09-23T23:37:04Z

Addressed all above comments in #18 and #19

ggengnv added 24 commits September 19, 2024 17:30

preliminary changes for hoisting DotOpEnc for MMAV3

30787ef

improve DotOpEnc hoisting to use LocalLoad

896e128

Don't pipeline LHS operand for now

07552f4

Add placeholders for SharedToDotOpV3 and fix LLVM lowering issues

60b08ca

Hack to make SharedEncodingAttr work for different width types

ec5a726

dot lower to shared working for Hopper; small refactors

d4ae449

Fix dot op ordering for WGMMA and allow ConstantOp

fefdabc

Revert DecomposeUnsupportedConversions since we use v2 SharedEnc for …

5767cf3

…wgmma

revert ordering changes

a72548f

i8 -> f16 working

0c3b0a8

fix lit test regressions

a8875b0

Add comments for isHopperWidthChange

38d6f45

fix regression for Hopper MMA > DotOp

9c36d4e

Disable hoisting thru downcasts

0568fd8

fix test regression with WGMMA.cpp

a3d65d1

Rewrite OptimizeDotOperands logic to fix for general case

bbb0668

fix another regression and add minor comments

00418d6

Remove redundant LocalAlloc+LocalLoad ops added in OptimizeDotOperands

93ef960

fix bad rebase

3c65087

Pipelining

e7904ee

Add pipeline test

d67b207

Add chained test for dot operand hoisting

bbea8d6

Fix hoisting bug and add comments

d1b64ee

delete draft code

1115cd2

gflegar requested review from Moerafaat and vwbaker September 20, 2024 16:21

gflegar marked this pull request as draft September 20, 2024 17:52

ggengnv added 3 commits September 20, 2024 19:25

Improve comments

15e9cf6

Fix coalescing

47f1e84

fix typo

a2dea0a

ggengnv changed the title ~~[DRAFT] Optimization to put LHS operand in registers for WGMMA before elementwise ops~~ Optimization to put LHS operand in registers for WGMMA before elementwise ops Sep 20, 2024

ggengnv marked this pull request as ready for review September 20, 2024 21:09

More minor comments

c3460a3

chsigg requested changes Sep 23, 2024

View reviewed changes

vwbaker requested changes Sep 23, 2024

View reviewed changes

chsigg mentioned this pull request Sep 23, 2024

Requirements to pass WGMMA LHS operand in registers triton-lang/triton#4785

Open

ggengnv closed this Sep 23, 2024

ggengnv mentioned this pull request Sep 23, 2024

LHS Registers Part 1 - DotOp Hoisting and SMEM-RF Copy Lowering #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

ggengnv commented Sep 19, 2024 •

edited

Loading

ggengnv commented Sep 20, 2024

chsigg Sep 23, 2024

vwbaker Sep 23, 2024

ggengnv Sep 23, 2024 •

edited

Loading

vwbaker Sep 24, 2024

vwbaker commented Sep 23, 2024

ggengnv commented Sep 23, 2024

ggengnv commented Sep 23, 2024

ggengnv commented Sep 23, 2024

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

Conversation

ggengnv commented Sep 19, 2024 • edited Loading

ggengnv commented Sep 20, 2024

chsigg Sep 23, 2024

Choose a reason for hiding this comment

vwbaker Sep 23, 2024

Choose a reason for hiding this comment

ggengnv Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

vwbaker Sep 24, 2024

Choose a reason for hiding this comment

vwbaker commented Sep 23, 2024

ggengnv commented Sep 23, 2024

ggengnv commented Sep 23, 2024

ggengnv commented Sep 23, 2024

ggengnv commented Sep 19, 2024 •

edited

Loading

ggengnv Sep 23, 2024 •

edited

Loading