Pad handling without changing upstream interface. #13133

MaheshRavishankar · 2023-04-18T06:30:52Z

The current default dispatch region formation has options to

disable splitting pad into fill + tensor.insert_slice
allow fusion of pad with producer
allow fusion of pad with consumer.

While none of these are on by default, this PR adds support for handling these in the CPU backend. The current state is

The pad by itself in a dispatch gets vectorized.
Pad fused with consumer gets vectorized too
Pad fused with producer does not get vectorized. This requries more work and potentially some changes to get the IR into a better state w.r.t destination passing.

There is lit test that show the handling of the different modes today within the CPU backend. To get things working, one thing to handle is the code-generated by tiling the pad operation is of the form

scf.if {
  ...
} else {
  ... tensor.pad 
}

the if here is to account for cases where a tile could be reading only the padding. This does not happen in IREE, so there is a temporary hack here that just folds the if away. Long term a better solution is needed (probably requiring rethinking of pad specification and tiling).

github-actions · 2023-04-18T07:40:32Z

Abbreviated Benchmark Summary

@ commit 924d826e9fab567ce12774c81c83cf30a209a689 (vs. base cbdd7893d4ab0c883f1336f86c6c9efbf02f10b8)

No improved or regressed benchmarks 🏖️

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
PoseNet\_fp32(tflite) [qualcomm-adreno-vulkan\_android31-vulkan\_spirv][default-flags,compile-stats]	10131 (vs. 6635, 52.69%↑)

For more information:

Source Workflow Run

iree-github-actions-bot · 2023-04-18T08:00:22Z

Abbreviated Android Benchmark Summary

@ commit a82224eb3e31078c20b060dddafe9389b3375daf (vs. base a7d37df86aa9883c67ac420cf3036f1b129b0a86)

No improved or regressed benchmarks 🏖️

For more information:

nicolasvasilache

Quick heads up, once this lands, this will become functional too: #13191

nicolasvasilache · 2023-04-20T13:40:06Z

compiler/src/iree/compiler/Codegen/Common/TileDispatchUsingInterface.cpp

-      if (!tilingInterfaceProducer ||
+      Operation *definingOp = operand.get().getDefiningOp();
+      auto tilingInterfaceProducer = dyn_cast<TilingInterface>(definingOp);
+      if (!tilingInterfaceProducer || isa<tensor::PadOp>(definingOp) ||


Great, thanks for this, better to bail out ourselves than change the semantics of the op for IREE's purposes.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

compiler/src/iree/compiler/Tools/init_mlir_dialects.h

hanhanW

nice, just two nits!

hanhanW · 2023-05-01T21:49:15Z

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

@@ -503,26 +506,23 @@ void addConvTileAndDecomposeExpertPassPipeline(OpPassManager &passManager,

  nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTileAndFusePass(
      static_cast<int64_t>(TilingLevel::ParallelTiles)));
-  if (clEnablePadConsumerFusion) {


Are we able to drop the option entirely? If so, let's remove its definition, i.e., line 46-53 in this file.

Yes, I think so. I was planning to do that as a follow up.

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

qcolombet

LGTM.
When do you think you'll merge this?
We need this fix to turn https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L37 ON, which would give a 10x boost for unaligned matmul on GPUs

MaheshRavishankar · 2023-05-02T17:30:16Z

LGTM. When do you think you'll merge this? We need this fix to turn main/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L37 ON, which would give a 10x boost for unaligned matmul on GPUs

Waiting for integrate... otherwise it is ready to go.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

dcaballe · 2023-05-02T23:58:10Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

+      // Do not not treat linalg ops that are all parallel as root operations in
+      // this sweep.
+      if (linalgOp.getNumLoops() == linalgOp.getNumParallelLoops()) continue;


Isn't this too much specialization for pad case?

I dont follow... the order of priority of picking the root op is

Ops that implement the tiling interface , except for simple elementwise ops.

Elementwise ops if (1) is not present

If neither (1) and (2) are present, ops like pack/pad/unpack that are data layout ops.

compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUTileAndFuse.cpp

MaheshRavishankar · 2023-05-03T22:03:16Z

Damn... didnt get the commit I needed on the last integrate. Wait continues.

nicolasvasilache · 2023-05-08T13:58:11Z

Damn... didnt get the commit I needed on the last integrate. Wait continues.

FMI, which upstream LLVM commit is needed for this to land?

The current default dispatch region formation has options to - disable splitting pad into fill + tensor.insert_slice - allow fusion of pad with producer - allow fusion of pad with consumer. While none of these are on by default, this PR adds support for handling these in the CPU backend. The current state is - The pad by itself in a dispatch gets vectorized. - Pad fused with consumer gets vectorized too - Pad fused with producer does not get vectorized. This requries more work and potentially some changes to get the IR into a better state w.r.t destination passing. There is lit test that show the handling of the different modes today within the CPU backend. To get things working, one thing to handle is the code-generated by tiling the pad operation is of the form ``` scf.if { ... } else { ... tensor.pad } ``` the if here is to account for cases where a tile could be reading only the padding. This does not happen in IREE, so there is a temporary hack here that just folds the if away. Long term a better solution is needed (probably requiring rethinking of pad specification and tiling).

MaheshRavishankar added (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks labels Apr 18, 2023

MaheshRavishankar force-pushed the pad_hack branch from 37aee5d to 7cecfe1 Compare April 18, 2023 07:00

MaheshRavishankar mentioned this pull request Apr 18, 2023

[pad_fusion] 3x performance drop when fusing tensor.pad with linalg.conv #12633

Closed

nicolasvasilache reviewed Apr 20, 2023

View reviewed changes

nicolasvasilache self-requested a review April 20, 2023 21:04

nicolasvasilache approved these changes Apr 20, 2023

View reviewed changes

mariecwhite self-requested a review April 23, 2023 22:03

MaheshRavishankar force-pushed the pad_hack branch from 7cecfe1 to 85b2438 Compare April 25, 2023 05:11

qcolombet mentioned this pull request Apr 27, 2023

Unaligned matmul work #13104

Closed

MaheshRavishankar force-pushed the pad_hack branch from 85b2438 to 8300aae Compare May 1, 2023 19:51

MaheshRavishankar marked this pull request as ready for review May 1, 2023 21:02

MaheshRavishankar requested review from dcaballe, hanhanW and benvanik as code owners May 1, 2023 21:02

benvanik approved these changes May 1, 2023

View reviewed changes

hanhanW approved these changes May 1, 2023

View reviewed changes

qcolombet approved these changes May 2, 2023

View reviewed changes

mariecwhite approved these changes May 2, 2023

View reviewed changes

dcaballe reviewed May 3, 2023

View reviewed changes

MaheshRavishankar force-pushed the pad_hack branch from 8300aae to 1ec81ca Compare May 3, 2023 20:19

Enable handling tile and fuse of pad operations with its producers.

a82224e

MaheshRavishankar force-pushed the pad_hack branch from 1ec81ca to a82224e Compare May 8, 2023 17:55

MaheshRavishankar enabled auto-merge (squash) May 8, 2023 17:55

MaheshRavishankar merged commit 3679e9c into iree-org:main May 8, 2023

This was referenced May 8, 2023

Fuse pad with its producers. #10184

Closed

[pad_fusion] failed to run translation of source executable to target executable #12613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad handling without changing upstream interface. #13133

Pad handling without changing upstream interface. #13133

MaheshRavishankar commented Apr 18, 2023 •

edited

Loading

github-actions bot commented Apr 18, 2023 •

edited

Loading

iree-github-actions-bot commented Apr 18, 2023 •

edited

Loading

nicolasvasilache left a comment

nicolasvasilache Apr 20, 2023

hanhanW left a comment

hanhanW May 1, 2023

MaheshRavishankar May 3, 2023

qcolombet left a comment

MaheshRavishankar commented May 2, 2023

dcaballe May 2, 2023

MaheshRavishankar May 3, 2023

MaheshRavishankar commented May 3, 2023

nicolasvasilache commented May 8, 2023

Pad handling without changing upstream interface. #13133

Pad handling without changing upstream interface. #13133

Conversation

MaheshRavishankar commented Apr 18, 2023 • edited Loading

github-actions bot commented Apr 18, 2023 • edited Loading

Abbreviated Benchmark Summary

Regressed Compilation Times 🚩

iree-github-actions-bot commented Apr 18, 2023 • edited Loading

Abbreviated Android Benchmark Summary

nicolasvasilache left a comment

Choose a reason for hiding this comment

nicolasvasilache Apr 20, 2023

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

hanhanW May 1, 2023

Choose a reason for hiding this comment

MaheshRavishankar May 3, 2023

Choose a reason for hiding this comment

qcolombet left a comment

Choose a reason for hiding this comment

MaheshRavishankar commented May 2, 2023

dcaballe May 2, 2023

Choose a reason for hiding this comment

MaheshRavishankar May 3, 2023

Choose a reason for hiding this comment

MaheshRavishankar commented May 3, 2023

nicolasvasilache commented May 8, 2023

MaheshRavishankar commented Apr 18, 2023 •

edited

Loading

github-actions bot commented Apr 18, 2023 •

edited

Loading

iree-github-actions-bot commented Apr 18, 2023 •

edited

Loading