Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pipelining loads that don't feed directly into tl.dot. #3415

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

jlebar
Copy link
Collaborator

@jlebar jlebar commented Mar 19, 2024

Allow pipelining more loads indirectly used by a dot.

Previously if a load was used indirectly by a dot but there were non-unary
operations between the load and the dot, we couldn't pipeline it.

Also clean up some TODOs.

PR chain

  1. 👉 Allow pipelining loads that don't feed directly into tl.dot. #3415 👈 YOU ARE HERE

@jlebar jlebar requested a review from ptillet as a code owner March 19, 2024 19:55
@jlebar jlebar marked this pull request as draft March 19, 2024 19:55
@ThomasRaoux
Copy link
Collaborator

ThomasRaoux commented Mar 19, 2024

I didn't look at all the changes but there is already annotations that can be set on the loop to force pipelining. I think this is what we would want to ise for this case instead of adding to the existing logic that detects load + dot pattern.
Here is the PR that enabled it:
#3092

@jlebar
Copy link
Collaborator Author

jlebar commented Mar 19, 2024

We are already passing num_stages as an argument to the kernel. It appears this does the same thing as passing num_stages to the for loop's tl.range op. https://github.com/openai/triton/blob/8fed348280c32ef22c0e44b69bcd145420f78710/python/triton/compiler/code_generator.py#L933

To be clear, the issue isn't that the loop is not pipelined, it's that this particular load within the loop is not pipelined, because it does not feed into the dot directly.

Am I misunderstanding something?

@ThomasRaoux
Copy link
Collaborator

We are already passing num_stages as an argument to the kernel. It appears this does the same thing as passing num_stages to the for loop's tl.range op.

https://github.com/openai/triton/blob/8fed348280c32ef22c0e44b69bcd145420f78710/python/triton/compiler/code_generator.py#L933

To be clear, the issue isn't that the loop is not pipelined, it's that this particular load within the loop is not pipelined, because it does not feed into the dot directly.

Am I misunderstanding something?

In addition to passing a per-loop num_stages the idea was to also bypass the heuristic that only pipelines matmul loop. So I would think we would pipeline every load that can be pipelined in this case. It's possible the PR I pointed out behaves differently than what I thought. I do think it would be nice to not have to rely on load feeding into a dot for this case. (of course unless we mark explicitly ops there will always be a part of heuristic) but I though the "general" case would do what you want here.
I could be wrong and the code may be missing something. If possible I think we should avoid changing the heuristic.

For context the heuristic is there for both historic and user simplicity reasons. Historic because we use to only have the global control and the compiler had to pick the loop to pipeline and for simplicity as the existing pattern checks usually always benefits from being pipelined.

@jlebar jlebar force-pushed the dev-jlebar/more-pipelining branch 2 times, most recently from 80296e1 to ab4bd63 Compare March 21, 2024 19:04
@jlebar jlebar marked this pull request as ready for review March 21, 2024 20:43
@jlebar jlebar requested a review from ThomasRaoux March 21, 2024 20:43
@jlebar
Copy link
Collaborator Author

jlebar commented Mar 21, 2024

Updated the PR per our discussion

@jlebar
Copy link
Collaborator Author

jlebar commented Mar 25, 2024

Thank you for the review. :)

Previously if a load was used indirectly by a dot but there were non-unary
operations between the load and the dot, we couldn't pipeline it.

Also clean up some TODOs.

GPC: more-pipelining
@jlebar jlebar force-pushed the dev-jlebar/more-pipelining branch from ab4bd63 to 40cc26c Compare March 25, 2024 20:30
@jlebar jlebar enabled auto-merge (squash) March 25, 2024 20:31
@jlebar jlebar merged commit 15770f3 into main Mar 25, 2024
5 checks passed
@jlebar jlebar deleted the dev-jlebar/more-pipelining branch March 25, 2024 20:41
jlebar added a commit that referenced this pull request Mar 26, 2024
Effectively roll back #3415 due to
internal test failures.

GPC: rollback-3415
jlebar added a commit that referenced this pull request Mar 26, 2024
Effectively roll back #3415 due to
internal test failures.

GPC: rollback-3415
jlebar added a commit that referenced this pull request Mar 27, 2024
[Backend] Disable pipelining change from
#3415.

Effectively roll back #3415 due to
internal test failures.
ThomasRaoux pushed a commit that referenced this pull request Mar 27, 2024
[Backend] Disable pipelining change from
#3415.

Effectively roll back #3415 due to
internal test failures.
ThomasRaoux pushed a commit that referenced this pull request Mar 27, 2024
[Backend] Disable pipelining change from
#3415.

Effectively roll back #3415 due to
internal test failures.
ptillet pushed a commit that referenced this pull request Apr 1, 2024
[Backend] Disable pipelining change from
#3415.

Effectively roll back #3415 due to
internal test failures.
karupayun pushed a commit to openxla/triton that referenced this pull request Apr 3, 2024
…lang#3415)

Allow pipelining more loads indirectly used by a dot.

Previously if a load was used indirectly by a dot but there were non-unary
operations between the load and the dot, we couldn't pipeline it.

Also clean up some TODOs.
jlebar added a commit that referenced this pull request Apr 3, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.  This PR
rolls forward the change as much as we currently can, allowing *most* but not
all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have not been
able to fix, but now we catch it with asserts, PR #3549.

GPC: dot-pipelining
jlebar added a commit that referenced this pull request Apr 3, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.  This PR
rolls forward the change as much as we currently can, allowing *most* but not
all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have not been
able to fix, but now we catch it with asserts, PR #3549.

GPC: dot-pipelining
jlebar added a commit that referenced this pull request Apr 4, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.  This PR
rolls forward the change as much as we currently can, allowing *most* but not
all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have not been
able to fix, but now we catch it with asserts, PR #3549.

GPC: dot-pipelining
jlebar added a commit that referenced this pull request Apr 4, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.  This PR
rolls forward the change as much as we currently can, allowing *most* but not
all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have not been
able to fix, but now we catch it with asserts, PR #3549.

GPC: dot-pipelining
jlebar added a commit that referenced this pull request Apr 4, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.  This PR
rolls forward the change as much as we currently can, allowing *most* but not
all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have not been
able to fix, but now we catch it with asserts, PR #3549.

GPC: dot-pipelining
jlebar added a commit that referenced this pull request Apr 4, 2024
PR #3472 partially rolled back PR #3415 due to internal test failures.
This PR rolls forward the change as much as we currently can, allowing *most*
but not all relevant loads to be pipelined.

There is still a TritonGPU -> LLVM codegen bug in Triton that we have
not been able to fix, but now we catch it with asserts, PR #3549.
minjang pushed a commit to minjang/triton that referenced this pull request May 14, 2024
Effectively roll back triton-lang/triton#3415 due to
internal test failures.

GPC: rollback-3415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants