[MFMA] Support 64x4 and 4x64 tile size #469

binarman · 2024-01-17T19:37:38Z

This PR enables two new MxN tile sizes: 64 x 4 and 4 x 64. Both of them uses mfma 4x4 instructions.

alefimov-amd · 2024-01-17T19:40:52Z

test/TritonGPU/accelerate-matmul-cdna1.mlir

@@ -0,0 +1,800 @@
+// RUN: (! triton-opt %s -split-input-file --tritonamdgpu-accelerate-matmul=arch-generation-name=gfx908 --mlir-pass-pipeline-crash-reproducer=%t 2>/dev/null) | FileCheck --check-prefixes=CHECK %s


This file (and other 2) is generated,
I will add scripts in next PR, just don't want to add too much things in one PR

This PR enables two new MxN tile sizes: 64 x 4 and 4 x 64. Both of them uses mfma 4x4 instructions.

zhanglx13 · 2024-01-22T14:54:09Z

lib/Dialect/TritonGPU/Transforms/Utility.cpp

       {16, 16, 4, 1, ROCDL::mfma_f32_16x16x4f32::getOperationName()}},
      // mfma_f32_4x4x1f32
-      {{4, MfmaTypeId::Fp32TyId, 1},
+      {{4, 4, MfmaTypeId::Fp32TyId, 1},
       {4, 4, 16, 1, ROCDL::mfma_f32_4x4x1f32::getOperationName()}},


I think I made a mistake here.
It should be 4 , 4, 1, 1 instead of 4, 4, 16, 1

Oh, I realized this is for (4x16) x (16x4) --> 4x4. nvm

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMFMA.cpp

zhanglx13 · 2024-01-22T17:15:04Z

lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h

@@ -1233,10 +1239,15 @@ class ConvertTritonGPUOpToLLVMPatternBase {
    auto warpsPerCTA = mfmaLayout.getWarpsPerCTA();

    SmallVector<unsigned> numWarpsPerDim(2);
+    unsigned mDim = mfmaLayout.getMDim();
+    unsigned nDim = mfmaLayout.getNDim();
+    assert((mDim == nDim && (mDim == 32 || mDim == 16 || mDim == 4)) ||


Do we really need this kind of assert everywhere? Can we only check once at the earliest from the codegen?

I agree, this code is redundant most of the time. It just helps add new layouts and do not forget anything crucial.

You can set new (unsupported) m/n combination in accelerate matmul pass, run tests and see where these asserts fire.

zhanglx13

I think this PR is good to go.

zhanglx13 · 2024-01-22T17:55:59Z

@scxiao @vgokhale
The heuristic for picking mfma instruction size is as follows

If the result tile shape is larger than 32x32, pick mfma32
If the tile shape is smaller than 32x32 but larger than 16x16, pick mfma16
if the tile shape is smaller than 4x64 or 64x4, pick mfma4x4
Otherwise, pick mfma4x64 or mfma64x4

However, in the case of FA decode kernel, the tile shape is 16x128. And mfma16 will be picked according to the heuristic.
The tile shape refers to the result tensor shape of tt.dot. This heuristic does not take num_warps into consideration. But we do not have warp layout information when choosing mfma dimensions. Therefore, the only solution here is enable some user input to enforce the choice of mfma4x64 here.

@alefimov-amd In the next PR, can you change chooseMfmaDimensions to pick 4x64 or 64x4 based on the tile shape when matrix_instr_nonkdim is 4?

scxiao · 2024-01-22T18:14:17Z

chooseMfmaDimensions

Can you specify a unique value of chooseMfmaDimensions to choose mfma4x64 and mfma64x4, like 464 and 644?

binarman · 2024-01-22T20:52:02Z

In the next PR, can you change chooseMfmaDimensions to pick 4x64 or 64x4 based on the tile shape when matrix_instr_nonkdim is 4?

sure

Can you specify a unique value of chooseMfmaDimensions to choose mfma4x64 and mfma64x4, like 464 and 644?

This is useful idea, thank you!

This PR refactors the logic of mfma instruction selection. It brings everything from ROCm#441 and parts of ROCm#469 so that we should have full support of mfma32 and mfma16 with all types. But support for mfma4 is not complete yet. We leave it to future PRs. Also in a future PR, we'll add tests for AMD f8 inputs.

@binarman

This PR updates SharedToDotOperandMFMA.cpp and MFMA.cpp. - SharedToDotOperandMFMA.cpp is up to date with triton-mlir as of today, which includes changes until ROCm#482 - Fixed issue with opaque pointers - Fixed API for `getMFMAElemsPerInstrForOperands` and `getMFMARepForOperands` - MFMA.cpp is synced with triton-mlir@6bb04d, which includes changes until ROCm#469 Note to @binarman: changes in other files from ROCm#469 are not included in this PR. We can bring up the support for mfma 64x4 and 4x64 later.

This PR refactors the logic of mfma instruction selection. It brings everything from ROCm#441 and parts of ROCm#469 so that we should have full support of mfma32 and mfma16 with all types. But support for mfma4 is not complete yet. We leave it to future PRs. Also in a future PR, we'll add tests for AMD f8 inputs.

@binarman

This PR updates SharedToDotOperandMFMA.cpp and MFMA.cpp. - SharedToDotOperandMFMA.cpp is up to date with triton-mlir as of today, which includes changes until ROCm#482 - Fixed issue with opaque pointers - Fixed API for `getMFMAElemsPerInstrForOperands` and `getMFMARepForOperands` - MFMA.cpp is synced with triton-mlir@6bb04d, which includes changes until ROCm#469 Note to @binarman: changes in other files from ROCm#469 are not included in this PR. We can bring up the support for mfma 64x4 and 4x64 later.

alefimov-amd requested a review from zhanglx13 January 17, 2024 19:37

alefimov-amd force-pushed the mfma4x64_support_v2 branch from ab65f73 to ae2eff0 Compare January 17, 2024 19:38

alefimov-amd reviewed Jan 17, 2024

View reviewed changes

alefimov-amd mentioned this pull request Jan 17, 2024

[MFMA] Support 64x4 and 4x64 tile size #432

Closed

alefimov-amd requested review from jayfurmanek and scxiao January 17, 2024 19:56

binarman force-pushed the mfma4x64_support_v2 branch from ae2eff0 to 02fd385 Compare January 18, 2024 14:04

alefimov-amd force-pushed the mfma4x64_support_v2 branch from 02fd385 to edada86 Compare January 18, 2024 16:02

[MFMA] Support 64x4 and 4x64 tile size

ceece88

This PR enables two new MxN tile sizes: 64 x 4 and 4 x 64. Both of them uses mfma 4x4 instructions.

alefimov-amd force-pushed the mfma4x64_support_v2 branch from edada86 to ceece88 Compare January 19, 2024 11:56

zhanglx13 reviewed Jan 22, 2024

View reviewed changes

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMFMA.cpp Show resolved Hide resolved

zhanglx13 reviewed Jan 22, 2024

View reviewed changes

zhanglx13 approved these changes Jan 22, 2024

View reviewed changes

Merge branch 'triton-mlir' into mfma4x64_support_v2

6b92412

scxiao approved these changes Jan 22, 2024

View reviewed changes

Merge branch 'triton-mlir' into mfma4x64_support_v2

25b8a3d

alefimov-amd merged commit 6bb04d1 into ROCm:triton-mlir Jan 22, 2024
2 checks passed

zhanglx13 mentioned this pull request Feb 29, 2024

[AMD] Refactor mfma selection triton-lang/triton#3244

Merged

zhanglx13 mentioned this pull request Mar 2, 2024

[AMD] Refactor SharedToDotOperandMFMA triton-lang/triton#3264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MFMA] Support 64x4 and 4x64 tile size #469

[MFMA] Support 64x4 and 4x64 tile size #469

binarman commented Jan 17, 2024

alefimov-amd Jan 17, 2024 •

edited

Loading

zhanglx13 Jan 22, 2024 •

edited

Loading

zhanglx13 Jan 22, 2024

zhanglx13 Jan 22, 2024

binarman Jan 22, 2024

zhanglx13 left a comment

zhanglx13 commented Jan 22, 2024

scxiao commented Jan 22, 2024

binarman commented Jan 22, 2024

		@@ -0,0 +1,800 @@
		// RUN: (! triton-opt %s -split-input-file --tritonamdgpu-accelerate-matmul=arch-generation-name=gfx908 --mlir-pass-pipeline-crash-reproducer=%t 2>/dev/null) \| FileCheck --check-prefixes=CHECK %s

[MFMA] Support 64x4 and 4x64 tile size #469

[MFMA] Support 64x4 and 4x64 tile size #469

Conversation

binarman commented Jan 17, 2024

alefimov-amd Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

zhanglx13 Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

zhanglx13 Jan 22, 2024

Choose a reason for hiding this comment

zhanglx13 Jan 22, 2024

Choose a reason for hiding this comment

binarman Jan 22, 2024

Choose a reason for hiding this comment

zhanglx13 left a comment

Choose a reason for hiding this comment

zhanglx13 commented Jan 22, 2024

scxiao commented Jan 22, 2024

binarman commented Jan 22, 2024

alefimov-amd Jan 17, 2024 •

edited

Loading

zhanglx13 Jan 22, 2024 •

edited

Loading