Fusion for pad op in Linalg #2783

hanhanW · 2020-08-05T17:07:00Z

One option is to have a linalg.tensor_pad. If the linalg.tensor_pad is the consumer, we can lower it to buffers world by adding the linalg.fill op and subview op to the top and passing the result of subview op to generic op as output argument. For example:

%0 = linalg.generic ... %arg { ... } (tensor<>) -> tensor<>
%1 = linalg.tensor_pad %0, %init
return %1

->

linalg.fill %0_buffer, %init
%sub_buffer = subview %0_buffer ...
linalg.generic ... %arg_buffer, %sub_buffer { ... } (memref<>, memref<>) -> ()

This only works when the pad is the "last op" (or say there is a buffer for the result tensor of pad op), because we need to make the subview and pass it to generic op.

Haven't get the idea for lowering when a linalg.tensor_pad op is a producer. We probably need #2782 feature, so we can have a temp memory in the middle.

Tag @nicolasvasilache for visibility. I will discuss this approach with @nicolasvasilache

The text was updated successfully, but these errors were encountered:

benvanik · 2020-11-20T04:23:19Z

Not sure if this is a dupe of #1605, but this issue has a nice description so I figured I'd put it here :)
There's been another vkFFT update (https://www.reddit.com/r/nvidia/comments/jwrj8j/rtx_3080_and_radeon_vii_benchmark_results_in/) and I noticed the bit in there about zero padding. The classic intuition in the graphics/HPC world holds here: if you are memory bound, having a single convergent SIMD+SIMT select is a significant savings over actually filling memory with zeros and fetching that memory over the bus.

#2 here talks more about what he is doing:
https://www.reddit.com/r/nvidia/comments/jwrj8j/rtx_3080_and_radeon_vii_benchmark_results_in/gct9gul/?utm_source=reddit&utm_medium=web2x&context=3

But the code is the easiest to understand, and is as straightforward as you'd expect:
DTolm/VkFFT@2da9637#diff-4818a5305b39fc1ba27a8676063d6d5b9120fe8fe8eebb09bceaf38a66c28f4d

We have the same ability to do what he is here: our dynamic shapes are available as push constants (like his consts.zeropad) and we just need to ensure we are getting pads down to dispatch regions correctly (may be improved in linalg-on-tensors world) so we have the IR as in the description above. From there, verifying that linalg->vector->etc can emit similar code would be really useful.

This has the not-at-all-insignificant side-effect of avoiding the need to pad out buffers in memory, using less space, allowing direct transfer from producers to consumers without reallocs/copies if the pad could not be fused into the producer, etc.

benvanik · 2020-11-20T04:24:28Z

oh and here's the graph:

That's why especially for Q1 GPU burndown this must be looked at, but the same thing will help us on the CPU too!

benvanik · 2020-11-22T04:51:09Z

This will be critical for the GPU burndown.

hanhanW · 2020-12-01T10:10:57Z

I feel this is related to the discussion that @MaheshRavishankar and I had today. We are thinking to have a pad_subview op in Linalg, and see if we can handle it at vector.transfer_read/write with mask. This does avoid to pad out buffers in memory (ie using less memory), and it's really good to know that this is that fast.

Every details are still unclear to me now, but it's good to know that we have a similar option here.

MaheshRavishankar · 2022-06-06T21:03:13Z

The PR #9194 fuses tensor.pad operation with its producers. Has some correctness issues that need to be triaged.

allieculp · 2022-08-23T17:38:02Z

@MaheshRavishankar Bumping this up as a stale P1 - please update or deprioritize as needed.

MaheshRavishankar · 2022-08-23T18:46:29Z

Funny. I was just going to start working on some of this, and was looking for this bug :)

hanhanW added enhancement ➕ New feature or request codegen Shared code generation infrastructure and dialects labels Aug 5, 2020

hanhanW self-assigned this Aug 5, 2020

hanhanW mentioned this issue Nov 20, 2020

Make the pad op be fused with other ops #1605

Closed

benvanik added the performance ⚡ Performance/optimization related work across the compiler and runtime label Nov 22, 2020

benvanik mentioned this issue Nov 24, 2020

Majority of mlir.dot code is fully unrolled zeroing of buffers #3991

Closed

MaheshRavishankar assigned MaheshRavishankar and unassigned hanhanW Jun 6, 2022

benvanik mentioned this issue Jul 1, 2022

Fuse tensor.pad with producers. #9194

Closed

This was referenced Aug 26, 2022

Fuse pad with its producers. #10184

Closed

Cherry pick D132720 #10227

Merged

benvanik mentioned this issue Nov 4, 2022

Fills/dispatches when padding not getting folded into consumers/producers. #11049

Open

benvanik added this to the Scheduling Performance Focus Set milestone Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusion for pad op in Linalg #2783

Fusion for pad op in Linalg #2783

hanhanW commented Aug 5, 2020

benvanik commented Nov 20, 2020

benvanik commented Nov 20, 2020

benvanik commented Nov 22, 2020

hanhanW commented Dec 1, 2020

MaheshRavishankar commented Jun 6, 2022

allieculp commented Aug 23, 2022

MaheshRavishankar commented Aug 23, 2022

Fusion for pad op in Linalg #2783

Fusion for pad op in Linalg #2783

Comments

hanhanW commented Aug 5, 2020

benvanik commented Nov 20, 2020

benvanik commented Nov 20, 2020

benvanik commented Nov 22, 2020

hanhanW commented Dec 1, 2020

MaheshRavishankar commented Jun 6, 2022

allieculp commented Aug 23, 2022

MaheshRavishankar commented Aug 23, 2022