Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusion for pad op in Linalg #2783

Open
hanhanW opened this issue Aug 5, 2020 · 7 comments
Open

Fusion for pad op in Linalg #2783

hanhanW opened this issue Aug 5, 2020 · 7 comments
Assignees
Labels
codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@hanhanW
Copy link
Contributor

hanhanW commented Aug 5, 2020

One option is to have a linalg.tensor_pad. If the linalg.tensor_pad is the consumer, we can lower it to buffers world by adding the linalg.fill op and subview op to the top and passing the result of subview op to generic op as output argument. For example:

%0 = linalg.generic ... %arg { ... } (tensor<>) -> tensor<>
%1 = linalg.tensor_pad %0, %init
return %1

->

linalg.fill %0_buffer, %init
%sub_buffer = subview %0_buffer ...
linalg.generic ... %arg_buffer, %sub_buffer { ... } (memref<>, memref<>) -> ()

This only works when the pad is the "last op" (or say there is a buffer for the result tensor of pad op), because we need to make the subview and pass it to generic op.

Haven't get the idea for lowering when a linalg.tensor_pad op is a producer. We probably need #2782 feature, so we can have a temp memory in the middle.

Tag @nicolasvasilache for visibility. I will discuss this approach with @nicolasvasilache

@hanhanW hanhanW added enhancement ➕ New feature or request codegen Shared code generation infrastructure and dialects labels Aug 5, 2020
@hanhanW hanhanW self-assigned this Aug 5, 2020
@benvanik
Copy link
Collaborator

Not sure if this is a dupe of #1605, but this issue has a nice description so I figured I'd put it here :)
There's been another vkFFT update (https://www.reddit.com/r/nvidia/comments/jwrj8j/rtx_3080_and_radeon_vii_benchmark_results_in/) and I noticed the bit in there about zero padding. The classic intuition in the graphics/HPC world holds here: if you are memory bound, having a single convergent SIMD+SIMT select is a significant savings over actually filling memory with zeros and fetching that memory over the bus.

#2 here talks more about what he is doing:
https://www.reddit.com/r/nvidia/comments/jwrj8j/rtx_3080_and_radeon_vii_benchmark_results_in/gct9gul/?utm_source=reddit&utm_medium=web2x&context=3

But the code is the easiest to understand, and is as straightforward as you'd expect:
DTolm/VkFFT@2da9637#diff-4818a5305b39fc1ba27a8676063d6d5b9120fe8fe8eebb09bceaf38a66c28f4d

We have the same ability to do what he is here: our dynamic shapes are available as push constants (like his consts.zeropad) and we just need to ensure we are getting pads down to dispatch regions correctly (may be improved in linalg-on-tensors world) so we have the IR as in the description above. From there, verifying that linalg->vector->etc can emit similar code would be really useful.

This has the not-at-all-insignificant side-effect of avoiding the need to pad out buffers in memory, using less space, allowing direct transfer from producers to consumers without reallocs/copies if the pad could not be fused into the producer, etc.

@benvanik
Copy link
Collaborator

oh and here's the graph:
image
That's why especially for Q1 GPU burndown this must be looked at, but the same thing will help us on the CPU too!

@benvanik benvanik added the performance ⚡ Performance/optimization related work across the compiler and runtime label Nov 22, 2020
@benvanik
Copy link
Collaborator

This will be critical for the GPU burndown.

@hanhanW
Copy link
Contributor Author

hanhanW commented Dec 1, 2020

I feel this is related to the discussion that @MaheshRavishankar and I had today. We are thinking to have a pad_subview op in Linalg, and see if we can handle it at vector.transfer_read/write with mask. This does avoid to pad out buffers in memory (ie using less memory), and it's really good to know that this is that fast.

Every details are still unclear to me now, but it's good to know that we have a similar option here.

@MaheshRavishankar
Copy link
Contributor

The PR #9194 fuses tensor.pad operation with its producers. Has some correctness issues that need to be triaged.

@allieculp
Copy link

@MaheshRavishankar Bumping this up as a stale P1 - please update or deprioritize as needed.

@MaheshRavishankar
Copy link
Contributor

Funny. I was just going to start working on some of this, and was looking for this bug :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: In Progress
4 participants