[AMD] Enable test_dot3d on AMD backend #3298

zhanglx13 · 2024-03-06T20:50:20Z

Port #3056 onto AMD backend

zhanglx13 · 2024-03-08T16:56:13Z

@ThomasRaoux @zahimoud Gental ping for review :)

zahimoud · 2024-03-08T19:06:18Z

lib/Conversion/TritonGPUToLLVM/Utility.cpp

@@ -524,12 +524,13 @@ SmallVector<Value> getMultiDimOffset(Attribute layout, Location loc,
    return multiDimOffset;
  }
  if (auto mfmaLayout = layout.dyn_cast<AMDMfmaEncodingAttr>()) {
+    // TODO: extend to support dot3d


Do we still need this ?

zahimoud · 2024-03-08T19:12:03Z

lib/Conversion/TritonGPUToLLVM/Utility.cpp

    auto multiDimBase =
        emitBaseIndexForLayout(loc, rewriter, layout, type, false);
    SmallVector<SmallVector<unsigned>> offsets;
    assert(rank == 2);
    SmallVector<Value> multiDimOffset(rank);
-    emitMfmaOffsetForCTA(mfmaLayout, offsets, multiDimCTAInRepId[0],
+    emitMfmaOffsetForCTA(mfmaLayout, offsets, 0, multiDimCTAInRepId[0],


is there a way to not hardcode this ?

This part of code is used by lowerDistributedToDistributed, which I feel is not used at all.

One use case is in the epilogue when we need to convert mfma to blocked before tt.store. The optimize_epilogue pass actually removes the conversion and uses mfma layout to do tt.store

For other distributed layouts, like blocked->dotOp or mfma->dotOp, they should either be decomposed or a shortcut.

Do you have a use case for this conversion?

Well, at least for mma, some ops do not support mma so if we have a transpose or scan on the result of a dot, we would have to convert mma to blocked. Might be a similar case for mfma. I would keep support.

This is interesting.
If this conversion happens inside a loop, the traffic to and from shared memory will harm perf a lot. So I guess in this case, you'd rather pay the price for perf than supporting mma for those ops.
I can support it anyway, but do you have a test so that I can verify the results?

Not sure if we have a test, maybe @ThomasRaoux knows.

I believe we should be tested in test_convert2d as we are testing different combination of convert including mma layouts. If this is not tested we should definitely add a case there.

If this conversion happens inside a loop, the traffic to and from shared memory will harm perf a lot. So I guess in this case, you'd rather pay the price for perf than supporting mma for those ops.

well there is always different ways to propagate and reduce the cost but we want functionality first so we should support this no matter what. When we run into performance problems we will address those.

Fair enough. I get the point that we should support conversions between distributed layout. Therefore, we should keep this file.

I believe we should be tested in test_convert2d as we are testing different combination of convert including mma layouts. If this is not tested we should definitely add a case there.

I'd suggest we do it in future PRs, since this PR is about dot3d. WDYT?

lib/Dialect/TritonGPU/IR/Dialect.cpp

include/triton/Conversion/TritonGPUToLLVM/Utility.h

lib/Dialect/TritonGPU/IR/Dialect.cpp

zahimoud · 2024-03-08T21:16:24Z

third_party/amd/lib/TritonAMDGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMFMA.cpp

    return true;
-  else
-    return false;
+  if ((rank == 3) && (order[0] + opIdx == 2))


The way I got over special-casing for rank==2 and rank==3 in SharedToDotOperandMMAv2 is to create a dummy 3D tensor out of the input tensor, do the codegen, an then throw away that 3D tensor, so we only have to think about 3D codegen rather than having if/else everywhere. Can we do the same here ?

When I checked your PR I did not pay attention what you did in this file since I'm not familiar with mma layout. I think this is indeed a good way to get rid of if(rank == ) stuff as much as possible. However, given the size of this PR, can we do it in a future PR as a refactor?

I think we should just do it now. We may forget about it if we delay refactoring to a future PR :)

Fair enough. I'll find some time next week to refactor this part.

third_party/amd/lib/TritonAMDGPUTransforms/DecomposeConversions.cpp

and address review comments

binarman · 2024-03-11T13:26:45Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

+  if (rank == 3)
+    multiDimBase[0] = urem(warpId, i32_val(shape[0]));


Looks like this function is generalized already, or do I miss something?
If it is not, could you add an assert for this here?

something like

Suggested change

if (rank == 3)

multiDimBase[0] = urem(warpId, i32_val(shape[0]));

if (rank == 3) {

assert(_warpsPerCTA[1] == 1 && _warpsPerCTA[2] == 1);

multiDimBase[0] = urem(warpId, i32_val(shape[0]));

}

If you mean that this particular check is not general, I think we can safely declare that dim 0 is slowest one, i.e. warp order is [2, 1, 0] or [2, 0, 1] so you need something like

Suggested change

if (rank == 3)

multiDimBase[0] = urem(warpId, i32_val(shape[0]));

if (rank == 3) {

auto singleDotWarps = _warpsPerCTA[rank - 1] * _warpsPerCTA[rank - 2];

multiDimBase[0] = urem(udiv(warpId, i32_val(singleDotWarps)), i32_val(shape[0]));

}

zhanglx13 · 2024-03-13T02:31:59Z

Waiting for #3171 to land and I'll rebase to add the refactors

zahimoud · 2024-03-26T18:29:07Z

Waiting for #3171 to land and I'll rebase to add the refactors

Are you still working on this ?

zahimoud · 2024-03-26T18:39:10Z

Closing until @zhanglx13 can pick up again.

alefimov-amd · 2024-04-02T13:15:37Z

@zahimoud
FYI: I'll continue this task

zhanglx13 force-pushed the enable_dot3d_amd2 branch 2 times, most recently from 480a50c to 25da738 Compare March 6, 2024 22:58

zhanglx13 added 5 commits March 7, 2024 20:03

Support 3d tensor when emitting offsets for mfma layouts

d5269ab

Support 3D in supportMFMA

e76a3fb

Support dot3d in Dialect.cpp

6f675e3

Enable 3d tensor so that 2d tensor should work

9642491

Batch=1 now works

7cf31ef

zhanglx13 force-pushed the enable_dot3d_amd2 branch 2 times, most recently from 23df613 to 47c95ed Compare March 8, 2024 02:24

zhanglx13 marked this pull request as ready for review March 8, 2024 02:43

zhanglx13 requested review from goostavz, Superjomn, Jokeren and ptillet as code owners March 8, 2024 02:43

zhanglx13 requested a review from zahimoud March 8, 2024 02:43

Support batchSize > 1

6fe12c8

zhanglx13 force-pushed the enable_dot3d_amd2 branch from 47c95ed to 6fe12c8 Compare March 8, 2024 02:54

zhanglx13 requested a review from ThomasRaoux March 8, 2024 02:57

zahimoud reviewed Mar 8, 2024

View reviewed changes

lib/Dialect/TritonGPU/IR/Dialect.cpp Outdated Show resolved Hide resolved

fix getContigPerThread for mfma layout

aa31693

zahimoud reviewed Mar 8, 2024

View reviewed changes

include/triton/Conversion/TritonGPUToLLVM/Utility.h Show resolved Hide resolved

zahimoud reviewed Mar 8, 2024

View reviewed changes

lib/Dialect/TritonGPU/IR/Dialect.cpp Outdated Show resolved Hide resolved

zahimoud reviewed Mar 8, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/DecomposeConversions.cpp Show resolved Hide resolved

Replace amd::DecomposeConversion with common::ReduceDataDuplication

14e5b56

and address review comments

zhanglx13 force-pushed the enable_dot3d_amd2 branch from 25951fd to 14e5b56 Compare March 8, 2024 22:36

binarman reviewed Mar 11, 2024

View reviewed changes

zhanglx13 marked this pull request as draft March 13, 2024 02:30

zahimoud closed this Mar 26, 2024

binarman mentioned this pull request Apr 8, 2024

[AMD] [MFMA] Support dot3d in MFMA layout #3600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable test_dot3d on AMD backend #3298

[AMD] Enable test_dot3d on AMD backend #3298

zhanglx13 commented Mar 6, 2024 •

edited

Loading

zhanglx13 commented Mar 8, 2024

zahimoud Mar 8, 2024

zahimoud Mar 8, 2024

zhanglx13 Mar 8, 2024

zahimoud Mar 8, 2024

zhanglx13 Mar 8, 2024

zahimoud Mar 8, 2024

ThomasRaoux Mar 8, 2024

zhanglx13 Mar 8, 2024

zahimoud Mar 8, 2024

zhanglx13 Mar 8, 2024

ptillet Mar 9, 2024

zhanglx13 Mar 10, 2024

binarman Mar 11, 2024

binarman Mar 11, 2024 •

edited

Loading

zhanglx13 commented Mar 13, 2024

zahimoud commented Mar 26, 2024

zahimoud commented Mar 26, 2024

alefimov-amd commented Apr 2, 2024

		if (rank == 3)
		multiDimBase[0] = urem(warpId, i32_val(shape[0]));

-  if (rank == 3)
-    multiDimBase[0] = urem(warpId, i32_val(shape[0]));
+  if (rank == 3) {
+    auto singleDotWarps = _warpsPerCTA[rank - 1] * _warpsPerCTA[rank - 2];
+    multiDimBase[0] = urem(udiv(warpId, i32_val(singleDotWarps)), i32_val(shape[0]));
+  }

[AMD] Enable test_dot3d on AMD backend #3298

[AMD] Enable test_dot3d on AMD backend #3298

Conversation

zhanglx13 commented Mar 6, 2024 • edited Loading

zhanglx13 commented Mar 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

zhanglx13 commented Mar 13, 2024

zahimoud commented Mar 26, 2024

zahimoud commented Mar 26, 2024

alefimov-amd commented Apr 2, 2024

zhanglx13 commented Mar 6, 2024 •

edited

Loading

binarman Mar 11, 2024 •

edited

Loading