Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPTIMIZER] Take numWarps into account for Hopper mma op #2956

Merged
merged 3 commits into from
Apr 2, 2024

Conversation

vwbaker
Copy link
Collaborator

@vwbaker vwbaker commented Jan 17, 2024

This changes the wgmma instruction shape based on the total number of warps.

Instead of always using the largest version of wgmma, it honors
the user's numWarps hint, and uses a smaller wgmma shape to
distribute the work to all warps, rather than having some of them
idle.

Using m's shape, it calculates how many warps will be used in the m
dimension, then see how many are left for the n dimension. Then, it
chooses the largest N such that it is still evenly distributed.

This resolves issue #2662.

@vwbaker vwbaker force-pushed the registers branch 2 times, most recently from 5491dbb to 768f822 Compare January 17, 2024 14:08
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. Could you add a simple lit test?

lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated Show resolved Hide resolved
@vwbaker
Copy link
Collaborator Author

vwbaker commented Jan 17, 2024

Thanks for the review! I will work on adding a test, yes - but it seems one of the tests is broken -

https://github.com/openai/triton/actions/runs/7556918956/job/20575017710#step:10:28148

I understand the assertion that checks if the ptx has the correct wgmma instruction needs to be updated, but there's some that are failing with incorrect results here https://github.com/openai/triton/blob/main/python/test/unit/hopper/test_gemm.py#L475:

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.999984860420227 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

Any ideas how this could be?

@ThomasRaoux
Copy link
Collaborator

Our Hopper CI is currently broken due to environment problems in the CI bot, this should be fixed soon

@ThomasRaoux
Copy link
Collaborator

Ci should be fixed. Can you restore the changes?

@vwbaker
Copy link
Collaborator Author

vwbaker commented Jan 17, 2024

Done, still seems to be failing with

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.9999847412109375 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

@ThomasRaoux
Copy link
Collaborator

Done, still seems to be failing with

>       assert_close(z, golden, rtol=1e-2, atol=1e-3, check_dtype=False)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5 / 4096 (0.1%)
E       Greatest absolute difference: 1.9999847412109375 at index (14, 27) (up to 0.001 allowed)
E       Greatest relative difference: 2.00000262260437 at index (14, 28) (up to 0.01 allowed)

hopper/test_gemm.py:475: AssertionError

those are not related to your changes?

@vwbaker
Copy link
Collaborator Author

vwbaker commented Jan 17, 2024

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

@ThomasRaoux
Copy link
Collaborator

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

ah, not sure, I can't tell from the just the log.

@vwbaker vwbaker force-pushed the registers branch 2 times, most recently from 2b24b11 to c70573a Compare January 23, 2024 15:54
@vwbaker
Copy link
Collaborator Author

vwbaker commented Jan 24, 2024

those are not related to your changes?

I think they are, I was just curious if you had a hunch as to why this change would affect that :/ . I will take another look tomorrow if not.

ah, not sure, I can't tell from the just the log.

I was finally able to reproduce it locally, and it actually only happens when ENABLE_TMA=1, so I am looking at what this does and seeing if there's something that needs to be updated there.

vwbaker added 2 commits March 27, 2024 10:13
This changes the wgmma instruction based on the total number of warps.
Using m's shape, it calculates how many warps will be used in the m
dimension, then see how many are left for the n dimension. Then, it
chooses the largest N such that it is still evenly distributed.

This resolves issue triton-lang#2662.
@vwbaker vwbaker requested a review from gflegar March 27, 2024 13:37
@gflegar gflegar changed the title Take numWarps into account for hopper mma op [OPTIMIZER] Take numWarps into account for Hopper mma op Mar 28, 2024
Copy link
Collaborator

@gflegar gflegar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ThomasRaoux could you take a look if this is good to be merged?

Regarding the failing test that we used to discuss above:

  1. It was TMA-specific, so the test was removed with TMA support.
  2. We were concerned that this PR is somehow still triggering the reduction bug, but Tori figured out that it is actually completely orthogonal to this PR in Reduction Op on MMA Layout produces incorrect results #3467 (comment), and happens just the same on main as well. We'll work on that one separately, as we discussed on the issue.

@gflegar gflegar marked this pull request as ready for review March 28, 2024 10:52
@gflegar gflegar requested a review from ptillet as a code owner March 28, 2024 10:52
@gflegar gflegar requested a review from ThomasRaoux March 28, 2024 10:53
@ThomasRaoux
Copy link
Collaborator

Sorry for the delay on this PR. The change looks fine to me however I'm wondering if this will cause performance regressions if we have a chain of dot ops like in attention and one of them forces the N dimension to be distributed across multiple warps.

Is this something you have looked at?

I think one solution is to land that for now and revert it later if it turns out there are such cases in real life workloads.

@ThomasRaoux
Copy link
Collaborator

I'll merge this but as mentioned above if we end up needing a more complex heuristic we may have to revert it.

@ThomasRaoux ThomasRaoux merged commit 7a7fa4a into triton-lang:main Apr 2, 2024
5 checks passed
@gflegar
Copy link
Collaborator

gflegar commented Apr 2, 2024

I don't think this should cause performance regressions. The only thing this does is uses all the warps we have available in a block, instead of potentially keeping some of them idle, which is what happened before this landed.

If there is a kernel that becomes slower after this, this is an indication that it was already using too many warps, and the right fix would be to just make it use fewer warps - we end up doing the same work per warp, but without having idle warps needlessly consume resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants