Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse pad with its producers. #10184

Closed
wants to merge 2 commits into from
Closed

Fuse pad with its producers. #10184

wants to merge 2 commits into from

Conversation

MaheshRavishankar
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar commented Aug 24, 2022

Fixes #2783

@MaheshRavishankar
Copy link
Contributor Author

https://reviews.llvm.org/D132720 are the upstream changes needed for this PR.

@MaheshRavishankar
Copy link
Contributor Author

Currently blocked by #11273

grypp added a commit to grypp/iree that referenced this pull request Jan 19, 2023
This is a WIP PR that shows first attempt for data tiling for GPUs. It implementes materialization for the encoding.

Note that, PR cannot compile any program. Because it generates `tensor.pad` and we don't know how to tile it yet. iree-org#10184 can be enabled to tile `tensor.pad`, but then it results bufferization problem iree-org#11273
@MaheshRavishankar
Copy link
Contributor Author

@KoolJBlack . Rebased on main... can you share the input model you have.

@MaheshRavishankar MaheshRavishankar added (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks labels Apr 13, 2023
@github-actions
Copy link

@iree-github-actions-bot
Copy link
Contributor

Abbreviated Android Benchmark Summary

@ commit 1939f8132b31a4fb2e226982ab4974d501998d64 (vs. base df166ed8d89ef3fe4103a27cb5f5e6f20fbc2599)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileBertSquad [fp16] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ Pixel-6-Pro (GPU-Mali-G78) 88.091 (vs. 77.980, 12.97%↑) 88.048 0.278

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
PoseNet [fp32] (TFLite) big-core,full-inference,default-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) 191.302 (vs. 278.063, 31.20%↓) 191.299 0.117

For more information:

@MaheshRavishankar
Copy link
Contributor Author

@nicolasvasilache looks like this is in a landable state from within IREE. It gets you what you were looking for with #13042 . Current draft of this PR changes the tiling interface implementation of tensor.pad to not generate the scf.if (see the change to LLVM associated with this PR here https://github.com/iree-org/iree-llvm-fork/compare/da68d2164efcc1f5e57f090e2ae2219056b120a0...c3b15b0adbf0972bac2c6aae262337a6259214e7) . Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesnt generate the if by default. I am actually leaning towards this solution.

@nicolasvasilache
Copy link
Contributor

Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesn't generate the if by default. I am actually leaning towards this solution.

I don't grok all the details of the IREE workarounds but the upstream change is incorrect and will potentially miscompile to OOB code for all possible users of the PadOp. "Knowing" that you can take only the else branch because one 1) has done something else before or 2) will do something else after, is an injection of user information.

In particular, note that on GPUs, the assumption you are always making that "tile size is always greater than amount of padding" quickly fails to hold as one distributes the most minor dimension to threadIdx.x with vector<1xf32>, vector<2xf32> or vector<4x32>: the assumption actually almost never holds.

In my case of interest, I have additional information related to 2) that is able to handle this properly.
What works for me and should work for you too, is to provide the user information after tiling with e.g. https://reviews.llvm.org/D148125.

Can't you just refactor the functionality you need and use it in the particular place you need it?

Also @qcolombet with whom we discussed some of this, and #13042, earlier.

@MaheshRavishankar
Copy link
Contributor Author

Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesn't generate the if by default. I am actually leaning towards this solution.

I don't grok all the details of the IREE workarounds but the upstream change is incorrect and will potentially miscompile to OOB code for all possible users of the PadOp. "Knowing" that you can take only the else branch because one 1) has done something else before or 2) will do something else after, is an injection of user information.

There is no workaround here. This is handling pad operations (and fusion with producers and consumers) without having special carve outs for pad op. The issue is the generation of the if condition. I know the upstream changes are incorrect (thats why there is a big "Do Not Submit" on the commit message).

In particular, note that on GPUs, the assumption you are always making that "tile size is always greater than amount of padding" quickly fails to hold as one distributes the most minor dimension to threadIdx.x with vector<1xf32>, vector<2xf32> or vector<4x32>: the assumption actually almost never holds.

In my case of interest, I have additional information related to 2) that is able to handle this properly. What works for me and should work for you too, is to provide the user information after tiling with e.g. reviews.llvm.org/D148125.

Can't you just refactor the functionality you need and use it in the particular place you need it?

I dont see why using that is better. It is generating the if and removing the if. I am just not generating the ifto begin with. Saying that this is user control is a bit of a strange wording. There is no user here, or rather IREE is the user. What I was suggesting is we have an IREE specific implementation of theTilingInterfacefor thepadop which basically doesnt even generate thescf.if` for IREEs. So IREE as a user is injecting this information/taking the burden of making sure that this assertion holds. For example, forking the implementation in IREE will effectively make it unnecessary for you to use that op in IREE.

Still there are footguns here that makes me uncomfortable to land (I ran the benchmarks here to just check if this assertion holds today... and it seems to, but it is very easy to fall into this hole). I think the tensor.pad operation needs to evolve, or using tensor-based codegeneration is hitting limits in terms of abstraction being stretched too much (Uday pointed this out on Discourse too w.r.t to pack operations, and I think he does have a point. At whole program level having tensor-based operations is useful, but within codegen these seem to not play well overall)

@MaheshRavishankar
Copy link
Contributor Author

Superceded by #13133

@benvanik benvanik deleted the pad_fusion branch May 9, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-*
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fusion for pad op in Linalg
4 participants