[AMD][WMMA] Support dot3d #3674

binarman · 2024-04-16T15:28:54Z

This PR enables support of 3d dot for RDNA GPUs.

binarman · 2024-04-16T15:48:37Z

joviliast · 2024-04-16T17:03:25Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

@@ -1649,7 +1676,7 @@ AMDWmmaEncodingAttr::getShapePerCTATileForDotOperands(ArrayRef<int64_t> shape,
 unsigned AMDWmmaEncodingAttr::getTotalElemsPerThreadForOperands(
    ArrayRef<int64_t> shape, Type eltTy, int kWidth, int opIdx) const {
  auto rep = getWMMARepForOperands(shape, eltTy, kWidth, opIdx);
-  return rep[0] * rep[1] * kWidth;
+  return rep[0] * rep[1] * rep[2] * kWidth;


Could we use something like
return product(rep) * kWidth; ?

joviliast · 2024-04-16T17:14:49Z

third_party/amd/lib/TritonAMDGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandHelper.cpp

+  unsigned mfmaInstrNonK = elemsPerInstr[opIdx == 0 ? 0 : 1];
+  unsigned mfmaInstrK = elemsPerInstr[opIdx == 0 ? 1 : 0];


As long as it became a common logic, could you please rename it?

joviliast · 2024-04-16T17:26:04Z

LGTM
Thank you for this PR.

Have you run test_dot locally on navi ?

binarman · 2024-04-16T18:13:48Z

Have you run test_dot locally on navi ?

yes

fp16->fp32 tests pass
fp16->fp16 some tests pass, some fail due to mismatches (few values exceed tolerance threshold)
fp32->fp32 are not supported in WMMA, they go through FMA pipeline and fail
int8 tests do not pass, but this is expected, test_dot do not work at the moment as well

joviliast

Thanks. LGTM

antiagainst

Can we add some lit tests? At the moment we don't have CI for RDNA GPUs so test_core.py is effectively not checked. It may regress at any time. lit tests is checking the compiler transformation and can make sure we have some guarantee. It's also easier to read and fix lit tests than full blown integration runtime tests. So lit tests are typically the first line of defense for quality.

binarman · 2024-05-01T14:17:06Z

@antiagainst
hi!
From my experience, most of dot issues are related to wrong indexing/address computations, which requires large and complex lit test to check. Such test will be extremely fragile, very complex and hard to read.

@joviliast is working on same code, so even if I add lit test, it will probably break in near future adding more redundant work to him(or me, it depends who will merge changes first).

I can implement some basic test, which will check that there are no crashes, but in my opinion this test does not guarantee much.

P.s. We have some basic llir interpreter which can help checking changes from this PR, but at this point it requires some massive work. I prefer to invest time in this task, if correctness on Navi aligns with our team priorities.

binarman · 2024-05-13T14:13:27Z

@antiagainst PTAL

python/test/unit/language/test_core.py

antiagainst · 2024-05-23T01:42:29Z

python/test/unit/language/test_core.py

+        if triton.runtime.driver.active.get_current_target().arch == "gfx1100":
+            if in_dtype_str == "int8" or in_dtype_str == "float32":
+                pytest.skip(f"{in_dtype_str} is not supported in WMMA dot")
+            if out_dtype_str == "float16":


There are float16 accumulate wmma ops? Are they not matching the precision w.r.t. reference pytorch?

I need to check this. At some point they did not match, but maybe this is not the case anymore, since a lot of time passed since I've implemented this.

Yes, precision issue is still there.
I suspect this is a hardware problem, though this requires more investigation of wmma behavior.

Okay thanks. worth understanding more. I think we can also prmote to f32 and then downcast if necessary.

@binarman Are we currently using V_WMMA_F16_16X16X16_F16 and see accuracy mismatch with pytorch? If so, can we use V_WMMA_F32_16X16X16_F16 and then cast to fp16 as @antiagainst mentioned?

antiagainst · 2024-05-23T03:06:06Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

-    assert(shape[0] % mnkDim[0] == 0);
-    multiDimWarpId[0] =
-        urem(multiDimWarpId[0], i32_val(ceil<unsigned>(shape[0], mnkDim[0])));
+  if (shape[rank - 2] >= mnkDim[0]) {


We have quite some duplicated shape[rank - N] references. What about using some self-documenting local variables for them? Then we have less chance to be inconsistent.

This PR is intended to support dot3d, I suggest to refactor this code as a separate task(FYI we discussed this some time ago: #3600 (comment)),
A lot of this code is same on MFMA side and it will be better to refactor both MFMA and WMMA at the same time.

We have two ideas how to refactor this code:

always assume we have batch dimension in dot

use structure with named fields, i.e. M/N/K/B instead of indexes

Choosing one of this paths is a separate task, which will be next step after test bringup.

SG to follow up on this later.

antiagainst · 2024-05-23T03:07:41Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

  for (unsigned elem = 0; elem < elemsPerThreadPerGroup; elem++) {
-    offsets.push_back(
-        {ctaOffsetX * shapePerCta[0] + 2 * elem, ctaOffsetY * shapePerCta[1]});
+    elemOffset[rank - 2] = ctaOffsetX * shapePerCta[rank - 2] + 2 * elem;


Minus is less mentally straightforward than plus. I'd suggest doing bool hasBatch = rank == 3; and then use [0 + hasBath] for M index and [1 + hasBatch] for N index.

I would like to not change this now: all other places like this use minus style.
My suggestion is to make this refactoring a separate task,

Sure works for me.

lib/Dialect/TritonGPU/IR/Dialect.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/DotOpToLLVM/WMMA.cpp

antiagainst

(sorry clicked the wrong buttion before)

antiagainst · 2024-05-23T03:42:30Z

Regarding tests, I treat it as what we want to invest to guard against future breakages. We don't have RDNA CI; so this can easily regress. Compared to the efforts spent on writing some tests now (which is mostly one time), I'm more concerned about the potential time lost on debugging all these complex logic in the future only via integration python tests in a sense. And we don't know how many regressions we will see throughout the journey.

Also btw lit tests don't need to be super detailed and cover all the lines; we can just cover important parts so it's not a change detector. I don't think it's a lot of effort to update them, given that the index caculation doesn't change frequently I believe. And whatever we change there it's delibrate--it can help for folks touching the code to verify their changes too. (Keep in mind that there are contributors that only do MFMA parts--they will not run their changes on some RDNA cards to verify things pass. let alone folks only touching nvidia support. But a lit tests runs everywhere and can provide us guarantees.)

antiagainst · 2024-05-23T03:44:24Z

This PR has extensive indexing calculation. So + @zhanglx13 to double check too.

This PR enables support of 3d dot and fixes tests in test_core.py

antiagainst · 2024-05-28T20:00:08Z

python/test/unit/language/test_core.py

+        if triton.runtime.driver.active.get_current_target().arch == "gfx1100":
+            if in_dtype_str == "int8" or in_dtype_str == "float32":
+                pytest.skip(f"{in_dtype_str} is not supported in WMMA dot")
+            if out_dtype_str == "float16":


Okay thanks. worth understanding more. I think we can also prmote to f32 and then downcast if necessary.

antiagainst · 2024-05-28T20:15:58Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

  for (unsigned elem = 0; elem < elemsPerThreadPerGroup; elem++) {
-    offsets.push_back(
-        {ctaOffsetX * shapePerCta[0] + 2 * elem, ctaOffsetY * shapePerCta[1]});
+    elemOffset[rank - 2] = ctaOffsetX * shapePerCta[rank - 2] + 2 * elem;


Sure works for me.

antiagainst · 2024-05-28T20:18:06Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

-    assert(shape[0] % mnkDim[0] == 0);
-    multiDimWarpId[0] =
-        urem(multiDimWarpId[0], i32_val(ceil<unsigned>(shape[0], mnkDim[0])));
+  if (shape[rank - 2] >= mnkDim[0]) {


SG to follow up on this later.

This PR enables support of 3d dot for RDNA GPUs. (cherry picked from commit 100e2aa)

Cherry picks for release/3.0.x General: - e8bc45d [BACKEND][AMD] Disable linear layout due to perf regression (#4126) - 9a0a7c2 [AMD] Add basic verification to MFMA encoding (#4117) for RDNA: - 100e2aa [AMD][WMMA] Support dot3d (#3674) - 4a1ea8e [AMD][gfx11] Fix BF16 wmma instr generation (#4135) Proton HIP PRs: - 328b86d [PROTON] Refactor GPU profilers (#4056) - 60613fb [PROTON] Roctracer: convert agent id to gpu id for gpu ops (#4090) - c1776fa [PROTON][AMD] Add Proton HIP GPU Utilization Metrics (#4119) --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Ilya V <152324710+joviliast@users.noreply.github.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: mwootton <michael.wootton@amd.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>

This PR enables support of 3d dot for RDNA GPUs.

joviliast reviewed Apr 16, 2024

View reviewed changes

binarman force-pushed the enable_dot3d_wmma branch from 9bb5b68 to 85a1379 Compare April 16, 2024 17:35

binarman force-pushed the enable_dot3d_wmma branch from f27d70c to ae72bf9 Compare April 26, 2024 16:06

joviliast approved these changes Apr 30, 2024

View reviewed changes

antiagainst requested changes Apr 30, 2024

View reviewed changes

alefimov-amd force-pushed the enable_dot3d_wmma branch from 40c142e to 8c9ba47 Compare May 13, 2024 13:32

alefimov-amd force-pushed the enable_dot3d_wmma branch from 8c9ba47 to 55988f4 Compare May 22, 2024 15:20

antiagainst approved these changes May 23, 2024

View reviewed changes

antiagainst requested changes May 23, 2024

View reviewed changes

binarman added 2 commits May 23, 2024 20:01

[AMD][WMMA] Support dot3d

3b73734

This PR enables support of 3d dot and fixes tests in test_core.py

addressing review comments

c824575

alefimov-amd force-pushed the enable_dot3d_wmma branch from 55988f4 to c824575 Compare May 23, 2024 18:29

binarman added 2 commits May 23, 2024 18:38

enable more tests

db507af

Merge branch 'main' into enable_dot3d_wmma

cf428e5

antiagainst approved these changes May 28, 2024

View reviewed changes

antiagainst marked this pull request as ready for review May 28, 2024 20:27

antiagainst requested review from zhanglx13, Jokeren and ptillet as code owners May 28, 2024 20:27

antiagainst removed the request for review from Jokeren May 28, 2024 20:31

antiagainst merged commit 100e2aa into triton-lang:main May 28, 2024
6 checks passed

jlebar mentioned this pull request May 30, 2024

Shared -> LinearLayout conversion #4038

Merged

jataylo pushed a commit to ROCm/triton that referenced this pull request Jun 19, 2024

[AMD][WMMA] Support dot3d (triton-lang#3674)

908a08d

This PR enables support of 3d dot for RDNA GPUs. (cherry picked from commit 100e2aa)

jataylo mentioned this pull request Jun 20, 2024

[RELEASE] [AMD] Additional AMD cherry-picks #4175

Merged

bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024

[AMD][WMMA] Support dot3d (triton-lang#3674)

4dba78a

This PR enables support of 3d dot for RDNA GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][WMMA] Support dot3d #3674

[AMD][WMMA] Support dot3d #3674

binarman commented Apr 16, 2024 •

edited by antiagainst

Loading

binarman commented Apr 16, 2024

joviliast Apr 16, 2024 •

edited

Loading

joviliast Apr 16, 2024

joviliast commented Apr 16, 2024

binarman commented Apr 16, 2024

joviliast left a comment

antiagainst left a comment

binarman commented May 1, 2024

binarman commented May 13, 2024

antiagainst May 23, 2024

binarman May 23, 2024

binarman May 23, 2024 •

edited

Loading

antiagainst May 28, 2024

wenchenvincent May 29, 2024

antiagainst May 23, 2024

binarman May 23, 2024 •

edited

Loading

antiagainst May 28, 2024

antiagainst May 23, 2024

binarman May 23, 2024

antiagainst May 28, 2024

antiagainst left a comment

antiagainst commented May 23, 2024 •

edited

Loading

antiagainst commented May 23, 2024

antiagainst May 28, 2024

antiagainst May 28, 2024

antiagainst May 28, 2024

		unsigned mfmaInstrNonK = elemsPerInstr[opIdx == 0 ? 0 : 1];
		unsigned mfmaInstrK = elemsPerInstr[opIdx == 0 ? 1 : 0];

[AMD][WMMA] Support dot3d #3674

[AMD][WMMA] Support dot3d #3674

Conversation

binarman commented Apr 16, 2024 • edited by antiagainst Loading

binarman commented Apr 16, 2024

joviliast Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joviliast commented Apr 16, 2024

binarman commented Apr 16, 2024

joviliast left a comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

binarman commented May 1, 2024

binarman commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst commented May 23, 2024 • edited Loading

antiagainst commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented Apr 16, 2024 •

edited by antiagainst

Loading

joviliast Apr 16, 2024 •

edited

Loading

binarman May 23, 2024 •

edited

Loading

binarman May 23, 2024 •

edited

Loading

antiagainst commented May 23, 2024 •

edited

Loading