Unaligned matmul work #13104

allieculp · 2023-04-14T18:30:49Z

No description provided.

allieculp · 2023-04-14T18:59:18Z

@qcolombet Adding this as a new issue to track unaligned matmul work, @mattwalsh for vis

allieculp · 2023-04-20T16:45:22Z

In progress - reasonable approximation of performance - similar to aligned case right now (aligned 1, prime number sizes).
@manishucsd for visibility

qcolombet · 2023-04-20T20:12:42Z

@manishucsd do we have the unaligned cases already tracked in your perf framework and in CI?

qcolombet · 2023-04-21T15:14:18Z

Synced-up with @manishucsd and the perf framework works out-of-the-box for unaligned cases.
However, hooking it up in CI is still being discussed.

qcolombet · 2023-04-27T22:31:11Z

Quick update here, we are getting closer to landing the perf improvements @nicolasvasilache grabbed with his new transform dialect strategy. (See https://github.com/openxla/iree/blob/c7925912b2f76b34335ab3d6949cd87a0c4f6071/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L465)

When we turn this on we should be "only" 2-3x slower than cuBLAS (instead of ~20x). I.e., we're not out of the wood but we're getting there.

To close the gap:

@manishucsd will look at the schedule we generate with the new strategy (we're likely missing some async copies for the padding)
I'll take a look at the generated code to see if there is anything we can improve

What we are missing to already get this part of the improvements:

landing Pad handling without changing upstream interface. #13133
flipping the switch to turn it on by default https://github.com/openxla/iree/blob/f0fac24c6037984b9d4b76f14eac47be897770ed/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L37

qcolombet · 2023-05-05T15:09:50Z

To close the gap:

@manishucsd will look at the schedule we generate with the new strategy (we're likely missing some async copies for the padding)

I'll take a look at the generated code to see if there is anything we can improve

Quick update on that front:
For the generated code I believe we have some redundant computations in the main loop: when masking for the padded dimensions we check each row individually whereas if I'm not mistaken we should only need to check one of the padded row because the padded rows are either all present or all masked. I.e., we should be able to do only one check instead of N (4 in this case).
Now, I don't believe this will help performance that much since the bottleneck is on the memory if I'm reading the profile correctly.

Also, instead of doing smarter mask checking in each loop iteration, I believe we could peel the loop so that only the iteration with the masking (hence the last one) has to do these checks.
Finally, given the gymnastic we do here, I think it is probably more valuable to do the padding at the graph level.

On a different front, I found two issues related to the tensor core strategy:

It miscompiles when we use --td-matmul-strategy-use-mma-sync (e.g., 2044 in K dimension, we get 2012 instead of 2044.) I'll file an issue for that
It crashes when I use '2048x1024x2044' (MxNxK):

error.mlir:6:10: error: transform.structured.pad failed to apply
    %2 = linalg.matmul ins(%arg0, %arg1 : tensor<2048x2044xf32>, tensor<2044x1024xf32>) outs(%1 : tensor<2048x1024xf32>) -> tensor<2048x1024xf32>

I'll file an issue for that too.

qcolombet · 2023-05-08T07:23:20Z

Here is the issue for the "pad failed to apply": #13448

qcolombet · 2023-05-08T14:11:36Z

Filed #13451 for the miscompile.

allieculp · 2023-05-11T16:35:10Z

PR is up for landing unaligned - @nicolasvasilache

allieculp · 2023-05-22T21:32:49Z

@qcolombet Can this be considered closed with the 'soft' landing of unaligned matmuls? Let us know what work remains here.

qcolombet · 2023-05-23T12:36:06Z

Let's wait for #13492 to land.

qcolombet · 2023-05-31T08:56:28Z

#13492 landed, let's close this.

allieculp assigned qcolombet Apr 14, 2023

allieculp changed the title ~~Unaligned matmul~~ Unaligned matmul work Apr 14, 2023

allieculp mentioned this issue Apr 21, 2023

[Epic] GPU Codegen: Achieve 75% parity with libraries on A100 on Matmul with unaligned shapes with custom codegen #13242

Closed

1 task

nicolasvasilache mentioned this issue May 12, 2023

[LLVMGPU] Turn on TD strategy for unaligned matmul #13492

Merged

qcolombet closed this as completed May 31, 2023

qcolombet mentioned this issue Jun 16, 2023

Support Unaligned Matmul Shapes for CUDA Codegen #14116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unaligned matmul work #13104

Unaligned matmul work #13104

allieculp commented Apr 14, 2023

allieculp commented Apr 14, 2023

allieculp commented Apr 20, 2023

qcolombet commented Apr 20, 2023

qcolombet commented Apr 21, 2023

qcolombet commented Apr 27, 2023

qcolombet commented May 5, 2023

qcolombet commented May 8, 2023

qcolombet commented May 8, 2023

allieculp commented May 11, 2023

allieculp commented May 22, 2023

qcolombet commented May 23, 2023

qcolombet commented May 31, 2023

Unaligned matmul work #13104

Unaligned matmul work #13104

Comments

allieculp commented Apr 14, 2023

allieculp commented Apr 14, 2023

allieculp commented Apr 20, 2023

qcolombet commented Apr 20, 2023

qcolombet commented Apr 21, 2023

qcolombet commented Apr 27, 2023

qcolombet commented May 5, 2023

qcolombet commented May 8, 2023

qcolombet commented May 8, 2023

allieculp commented May 11, 2023

allieculp commented May 22, 2023

qcolombet commented May 23, 2023

qcolombet commented May 31, 2023