[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

vwbaker · 2024-01-17T09:51:48Z

This changes the wgmma instruction shape based on the total number of warps.

Instead of always using the largest version of wgmma, it honors
the user's numWarps hint, and uses a smaller wgmma shape to
distribute the work to all warps, rather than having some of them
idle.

Using m's shape, it calculates how many warps will be used in the m
dimension, then see how many are left for the n dimension. Then, it
chooses the largest N such that it is still evenly distributed.

This resolves issue #2662.

ThomasRaoux

That makes sense to me. Could you add a simple lit test?

lib/Dialect/TritonGPU/Transforms/Utility.cpp

vwbaker · 2024-01-17T15:44:58Z

Thanks for the review! I will work on adding a test, yes - but it seems one of the tests is broken -

https://github.com/openai/triton/actions/runs/7556918956/job/20575017710#step:10:28148

I understand the assertion that checks if the ptx has the correct wgmma instruction needs to be updated, but there's some that are failing with incorrect results here https://github.com/openai/triton/blob/main/python/test/unit/hopper/test_gemm.py#L475:

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.999984860420227 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

Any ideas how this could be?

ThomasRaoux · 2024-01-17T15:53:20Z

Our Hopper CI is currently broken due to environment problems in the CI bot, this should be fixed soon

ThomasRaoux · 2024-01-17T16:59:23Z

Ci should be fixed. Can you restore the changes?

vwbaker · 2024-01-17T18:10:57Z

Done, still seems to be failing with

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.9999847412109375 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

ThomasRaoux · 2024-01-17T18:13:03Z

Done, still seems to be failing with

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.9999847412109375 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

those are not related to your changes?

vwbaker · 2024-01-17T19:14:12Z

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

ThomasRaoux · 2024-01-17T19:25:08Z

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

ah, not sure, I can't tell from the just the log.

test/TritonGPU/accelerate-matmul.mlir

vwbaker · 2024-01-24T13:07:35Z

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

ah, not sure, I can't tell from the just the log.

I was finally able to reproduce it locally, and it actually only happens when ENABLE_TMA=1, so I am looking at what this does and seeing if there's something that needs to be updated there.

This changes the wgmma instruction based on the total number of warps. Using m's shape, it calculates how many warps will be used in the m dimension, then see how many are left for the n dimension. Then, it chooses the largest N such that it is still evenly distributed. This resolves issue triton-lang#2662.

gflegar

LGTM.

@ThomasRaoux could you take a look if this is good to be merged?

Regarding the failing test that we used to discuss above:

It was TMA-specific, so the test was removed with TMA support.
We were concerned that this PR is somehow still triggering the reduction bug, but Tori figured out that it is actually completely orthogonal to this PR in Reduction Op on MMA Layout produces incorrect results #3467 (comment), and happens just the same on main as well. We'll work on that one separately, as we discussed on the issue.

ThomasRaoux · 2024-04-01T23:52:49Z

Sorry for the delay on this PR. The change looks fine to me however I'm wondering if this will cause performance regressions if we have a chain of dot ops like in attention and one of them forces the N dimension to be distributed across multiple warps.

Is this something you have looked at?

I think one solution is to land that for now and revert it later if it turns out there are such cases in real life workloads.

ThomasRaoux · 2024-04-02T00:07:09Z

I'll merge this but as mentioned above if we end up needing a more complex heuristic we may have to revert it.

gflegar · 2024-04-02T09:49:55Z

I don't think this should cause performance regressions. The only thing this does is uses all the warps we have available in a block, instead of potentially keeping some of them idle, which is what happened before this landed.

If there is a kernel that becomes slower after this, this is an indication that it was already using too many warps, and the right fix would be to just make it use fewer warps - we end up doing the same work per warp, but without having idle warps needlessly consume resources.

vwbaker force-pushed the registers branch 2 times, most recently from 5491dbb to 768f822 Compare January 17, 2024 14:08

ThomasRaoux reviewed Jan 17, 2024

View reviewed changes

lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated Show resolved Hide resolved

vwbaker force-pushed the registers branch from 89f1480 to b92b821 Compare January 17, 2024 15:40

vwbaker force-pushed the registers branch from 8fed422 to f864f11 Compare January 17, 2024 16:21

vwbaker force-pushed the registers branch from 45798b7 to f864f11 Compare January 17, 2024 17:15

vwbaker force-pushed the registers branch 2 times, most recently from 2b24b11 to c70573a Compare January 23, 2024 15:54

ThomasRaoux reviewed Jan 23, 2024

View reviewed changes

test/TritonGPU/accelerate-matmul.mlir Outdated Show resolved Hide resolved

vwbaker force-pushed the registers branch 3 times, most recently from 2abd29f to 47564cf Compare February 7, 2024 09:44

vwbaker mentioned this pull request Mar 26, 2024

Reduction Op on MMA Layout produces incorrect results #3467

Closed

vwbaker added 2 commits March 27, 2024 10:13

Fix python test to look for correct wgmma op

3b596e6

vwbaker force-pushed the registers branch from 47564cf to 7a0dc3e Compare March 27, 2024 09:16

Add lit test

dbb64ee

vwbaker force-pushed the registers branch from 7a0dc3e to dbb64ee Compare March 27, 2024 12:57

vwbaker requested a review from gflegar March 27, 2024 13:37

gflegar changed the title ~~Take numWarps into account for hopper mma op~~ [OPTIMIZER] Take numWarps into account for Hopper mma op Mar 28, 2024

gflegar approved these changes Mar 28, 2024

View reviewed changes

gflegar marked this pull request as ready for review March 28, 2024 10:52

gflegar requested a review from ptillet as a code owner March 28, 2024 10:52

gflegar requested a review from ThomasRaoux March 28, 2024 10:53

ThomasRaoux merged commit 7a7fa4a into triton-lang:main Apr 2, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

vwbaker commented Jan 17, 2024 •

edited by gflegar

Loading

ThomasRaoux left a comment

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 24, 2024

gflegar left a comment •

edited

Loading

ThomasRaoux commented Apr 1, 2024

ThomasRaoux commented Apr 2, 2024

gflegar commented Apr 2, 2024

[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

Conversation

vwbaker commented Jan 17, 2024 • edited by gflegar Loading

ThomasRaoux left a comment

Choose a reason for hiding this comment

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 17, 2024

ThomasRaoux commented Jan 17, 2024

vwbaker commented Jan 24, 2024

gflegar left a comment • edited Loading

Choose a reason for hiding this comment

ThomasRaoux commented Apr 1, 2024

ThomasRaoux commented Apr 2, 2024

gflegar commented Apr 2, 2024

vwbaker commented Jan 17, 2024 •

edited by gflegar

Loading

gflegar left a comment •

edited

Loading