Fuse pad with its producers. #10184

MaheshRavishankar · 2022-08-24T04:16:42Z

MaheshRavishankar · 2022-08-26T06:58:47Z

https://reviews.llvm.org/D132720 are the upstream changes needed for this PR.

compiler/src/iree/compiler/Tools/init_mlir_dialects.h

MaheshRavishankar · 2022-11-23T03:27:03Z

Currently blocked by #11273

This is a WIP PR that shows first attempt for data tiling for GPUs. It implementes materialization for the encoding. Note that, PR cannot compile any program. Because it generates `tensor.pad` and we don't know how to tile it yet. iree-org#10184 can be enabled to tile `tensor.pad`, but then it results bufferization problem iree-org#11273

MaheshRavishankar · 2023-01-26T23:30:44Z

@KoolJBlack . Rebased on main... can you share the input model you have.

github-actions · 2023-04-13T23:25:20Z

Abbreviated Benchmark Summary

@ commit 2498eb73dcf8aba2a22bec339bc4e657d33ed39d (vs. base 644820ca43e915efcb9ba79373af64b33bfb49d7)

No improved or regressed benchmarks 🏖️

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
PoseNet\_fp32(tflite) [valhall-mali-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,repeated-kernel,compile-stats]	9964 (vs. 5647, 76.45%↑)
PoseNet\_fp32(tflite) [adreno-generic-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,repeated-kernel,compile-stats]	9129 (vs. 5963, 53.09%↑)

For more information:

Source Workflow Run

iree-github-actions-bot · 2023-04-13T23:52:22Z

Abbreviated Android Benchmark Summary

@ commit 1939f8132b31a4fb2e226982ab4974d501998d64 (vs. base df166ed8d89ef3fe4103a27cb5f5e6f20fbc2599)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileBertSquad [fp16] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ Pixel-6-Pro (GPU-Mali-G78)	88.091 (vs. 77.980, 12.97%↑)	88.048	0.278

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
PoseNet [fp32] (TFLite) big-core,full-inference,default-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A)	191.302 (vs. 278.063, 31.20%↓)	191.299	0.117

For more information:

MaheshRavishankar · 2023-04-14T00:17:08Z

@nicolasvasilache looks like this is in a landable state from within IREE. It gets you what you were looking for with #13042 . Current draft of this PR changes the tiling interface implementation of tensor.pad to not generate the scf.if (see the change to LLVM associated with this PR here https://github.com/iree-org/iree-llvm-fork/compare/da68d2164efcc1f5e57f090e2ae2219056b120a0...c3b15b0adbf0972bac2c6aae262337a6259214e7) . Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesnt generate the if by default. I am actually leaning towards this solution.

nicolasvasilache · 2023-04-14T06:32:02Z

Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesn't generate the if by default. I am actually leaning towards this solution.

I don't grok all the details of the IREE workarounds but the upstream change is incorrect and will potentially miscompile to OOB code for all possible users of the PadOp. "Knowing" that you can take only the else branch because one 1) has done something else before or 2) will do something else after, is an injection of user information.

In particular, note that on GPUs, the assumption you are always making that "tile size is always greater than amount of padding" quickly fails to hold as one distributes the most minor dimension to threadIdx.x with vector<1xf32>, vector<2xf32> or vector<4x32>: the assumption actually almost never holds.

In my case of interest, I have additional information related to 2) that is able to handle this properly.
What works for me and should work for you too, is to provide the user information after tiling with e.g. https://reviews.llvm.org/D148125.

Can't you just refactor the functionality you need and use it in the particular place you need it?

Also @qcolombet with whom we discussed some of this, and #13042, earlier.

MaheshRavishankar · 2023-04-14T16:21:31Z

Unfortunately there is no way to "conditionally load" and interface implementation (its one or the other). It looks like for IREE though we can just avoid generating the if conditional during Tiling. It seems to match what you are looking for as well. I think we can change the tiling interface registration to allow for what we need here.
If that is not kosher we can fork the TilingInterface implementation for tensor.pad operation in IREE and use the variant that doesn't generate the if by default. I am actually leaning towards this solution.

I don't grok all the details of the IREE workarounds but the upstream change is incorrect and will potentially miscompile to OOB code for all possible users of the PadOp. "Knowing" that you can take only the else branch because one 1) has done something else before or 2) will do something else after, is an injection of user information.

There is no workaround here. This is handling pad operations (and fusion with producers and consumers) without having special carve outs for pad op. The issue is the generation of the if condition. I know the upstream changes are incorrect (thats why there is a big "Do Not Submit" on the commit message).

In particular, note that on GPUs, the assumption you are always making that "tile size is always greater than amount of padding" quickly fails to hold as one distributes the most minor dimension to threadIdx.x with vector<1xf32>, vector<2xf32> or vector<4x32>: the assumption actually almost never holds.

In my case of interest, I have additional information related to 2) that is able to handle this properly. What works for me and should work for you too, is to provide the user information after tiling with e.g. reviews.llvm.org/D148125.

Can't you just refactor the functionality you need and use it in the particular place you need it?

I dont see why using that is better. It is generating the if and removing the if. I am just not generating the ifto begin with. Saying that this is user control is a bit of a strange wording. There is no user here, or rather IREE is the user. What I was suggesting is we have an IREE specific implementation of theTilingInterfacefor thepadop which basically doesnt even generate thescf.if` for IREEs. So IREE as a user is injecting this information/taking the burden of making sure that this assertion holds. For example, forking the implementation in IREE will effectively make it unnecessary for you to use that op in IREE.

Still there are footguns here that makes me uncomfortable to land (I ran the benchmarks here to just check if this assertion holds today... and it seems to, but it is very easy to fall into this hole). I think the tensor.pad operation needs to evolve, or using tensor-based codegeneration is hitting limits in terms of abstraction being stretched too much (Uday pointed this out on Discourse too w.r.t to pack operations, and I think he does have a point. At whole program level having tensor-based operations is useful, but within codegen these seem to not play well overall)

MaheshRavishankar · 2023-05-08T18:56:40Z

Superceded by #13133

MaheshRavishankar mentioned this pull request Aug 26, 2022

Cherry pick D132720 #10227

Merged

hanhanW reviewed Sep 10, 2022

View reviewed changes

compiler/src/iree/compiler/Tools/init_mlir_dialects.h Outdated Show resolved Hide resolved

benvanik mentioned this pull request Nov 4, 2022

Fills/dispatches when padding not getting folded into consumers/producers. #11049

Open

MaheshRavishankar force-pushed the pad_fusion branch from a746db7 to 284f087 Compare November 23, 2022 03:26

MaheshRavishankar force-pushed the pad_fusion branch from 284f087 to e8ee5e0 Compare January 5, 2023 19:35

grypp mentioned this pull request Jan 19, 2023

[WIP] Data Tiling for GPU #11904

Closed

MaheshRavishankar force-pushed the pad_fusion branch from e8ee5e0 to c4967c0 Compare January 26, 2023 23:30

MaheshRavishankar force-pushed the pad_fusion branch 2 times, most recently from 83818d8 to 54a9c3e Compare February 11, 2023 00:28

MaheshRavishankar force-pushed the pad_fusion branch 7 times, most recently from 0f1d9f6 to 4be1235 Compare March 2, 2023 23:07

MaheshRavishankar force-pushed the pad_fusion branch 4 times, most recently from a80e4b4 to 81338a4 Compare March 10, 2023 06:09

MaheshRavishankar force-pushed the pad_fusion branch 3 times, most recently from 22ee8e7 to 0f6569c Compare March 21, 2023 16:50

MaheshRavishankar force-pushed the pad_fusion branch 2 times, most recently from 93b47bd to 4db592a Compare April 10, 2023 20:20

MaheshRavishankar mentioned this pull request Apr 12, 2023

Register all tensor tiling ops #13042

Closed

Mahesh Ravishankar added 2 commits April 13, 2023 13:48

Enable handling tile and fuse of pad operations with its producers.

8502fad

[DO NO SUBMIT] Change Pad Op tiling interface to use non-safe path.

1939f81

MaheshRavishankar force-pushed the pad_fusion branch from 4db592a to 1939f81 Compare April 13, 2023 21:55

MaheshRavishankar added (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks labels Apr 13, 2023

MaheshRavishankar mentioned this pull request Apr 18, 2023

[pad_fusion] 3x performance drop when fusing tensor.pad with linalg.conv #12633

Closed

MaheshRavishankar closed this May 8, 2023

benvanik deleted the pad_fusion branch May 9, 2024 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse pad with its producers. #10184

Fuse pad with its producers. #10184

MaheshRavishankar commented Aug 24, 2022 •

edited

Loading

MaheshRavishankar commented Aug 26, 2022

MaheshRavishankar commented Nov 23, 2022

MaheshRavishankar commented Jan 26, 2023

github-actions bot commented Apr 13, 2023

iree-github-actions-bot commented Apr 13, 2023

MaheshRavishankar commented Apr 14, 2023

nicolasvasilache commented Apr 14, 2023

MaheshRavishankar commented Apr 14, 2023

MaheshRavishankar commented May 8, 2023

Fuse pad with its producers. #10184

Fuse pad with its producers. #10184

Conversation

MaheshRavishankar commented Aug 24, 2022 • edited Loading

MaheshRavishankar commented Aug 26, 2022

MaheshRavishankar commented Nov 23, 2022

MaheshRavishankar commented Jan 26, 2023

github-actions bot commented Apr 13, 2023

Abbreviated Benchmark Summary

Regressed Compilation Times 🚩

iree-github-actions-bot commented Apr 13, 2023

Abbreviated Android Benchmark Summary

Regressed Latencies 🚩

Improved Latencies 🎉

MaheshRavishankar commented Apr 14, 2023

nicolasvasilache commented Apr 14, 2023

MaheshRavishankar commented Apr 14, 2023

MaheshRavishankar commented May 8, 2023

MaheshRavishankar commented Aug 24, 2022 •

edited

Loading