Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad handling without changing upstream interface. #13133

Merged
merged 1 commit into from
May 8, 2023

Conversation

MaheshRavishankar
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar commented Apr 18, 2023

The current default dispatch region formation has options to

  • disable splitting pad into fill + tensor.insert_slice
  • allow fusion of pad with producer
  • allow fusion of pad with consumer.

While none of these are on by default, this PR adds support for handling these in the CPU backend. The current state is

  • The pad by itself in a dispatch gets vectorized.
  • Pad fused with consumer gets vectorized too
  • Pad fused with producer does not get vectorized. This requries more work and potentially some changes to get the IR into a better state w.r.t destination passing.

There is lit test that show the handling of the different modes today within the CPU backend. To get things working, one thing to handle is the code-generated by tiling the pad operation is of the form

scf.if {
  ...
} else {
  ... tensor.pad 
}

the if here is to account for cases where a tile could be reading only the padding. This does not happen in IREE, so there is a temporary hack here that just folds the if away. Long term a better solution is needed (probably requiring rethinking of pad specification and tiling).

@MaheshRavishankar MaheshRavishankar added (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks labels Apr 18, 2023
@github-actions
Copy link

github-actions bot commented Apr 18, 2023

Abbreviated Benchmark Summary

@ commit 924d826e9fab567ce12774c81c83cf30a209a689 (vs. base cbdd7893d4ab0c883f1336f86c6c9efbf02f10b8)

No improved or regressed benchmarks 🏖️

Regressed Compilation Times 🚩

Benchmark Name Compilation Time (ms)
PoseNet\_fp32(tflite) [qualcomm-adreno-vulkan\_android31-vulkan\_spirv][default-flags,compile-stats] 10131 (vs. 6635, 52.69%↑)

For more information:

Source Workflow Run

@iree-github-actions-bot
Copy link
Contributor

iree-github-actions-bot commented Apr 18, 2023

Abbreviated Android Benchmark Summary

@ commit a82224eb3e31078c20b060dddafe9389b3375daf (vs. base a7d37df86aa9883c67ac420cf3036f1b129b0a86)

No improved or regressed benchmarks 🏖️

For more information:

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick heads up, once this lands, this will become functional too: #13191

if (!tilingInterfaceProducer ||
Operation *definingOp = operand.get().getDefiningOp();
auto tilingInterfaceProducer = dyn_cast<TilingInterface>(definingOp);
if (!tilingInterfaceProducer || isa<tensor::PadOp>(definingOp) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for this, better to bail out ourselves than change the semantics of the op for IREE's purposes.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, just two nits!

@@ -503,26 +506,23 @@ void addConvTileAndDecomposeExpertPassPipeline(OpPassManager &passManager,

nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTileAndFusePass(
static_cast<int64_t>(TilingLevel::ParallelTiles)));
if (clEnablePadConsumerFusion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to drop the option entirely? If so, let's remove its definition, i.e., line 46-53 in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. I was planning to do that as a follow up.

Copy link
Contributor

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
When do you think you'll merge this?
We need this fix to turn https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L37 ON, which would give a 10x boost for unaligned matmul on GPUs

@MaheshRavishankar
Copy link
Contributor Author

LGTM. When do you think you'll merge this? We need this fix to turn main/compiler/src/iree/compiler/Codegen/TransformDialectStrategies/GPU/Common.cpp#L37 ON, which would give a 10x boost for unaligned matmul on GPUs

Waiting for integrate... otherwise it is ready to go.

Comment on lines 1919 to 1932
// Do not not treat linalg ops that are all parallel as root operations in
// this sweep.
if (linalgOp.getNumLoops() == linalgOp.getNumParallelLoops()) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this too much specialization for pad case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont follow... the order of priority of picking the root op is

  1. Ops that implement the tiling interface , except for simple elementwise ops.
  2. Elementwise ops if (1) is not present
  3. If neither (1) and (2) are present, ops like pack/pad/unpack that are data layout ops.

@MaheshRavishankar
Copy link
Contributor Author

Damn... didnt get the commit I needed on the last integrate. Wait continues.

@nicolasvasilache
Copy link
Contributor

Damn... didnt get the commit I needed on the last integrate. Wait continues.

FMI, which upstream LLVM commit is needed for this to land?

@MaheshRavishankar MaheshRavishankar enabled auto-merge (squash) May 8, 2023 17:55
@MaheshRavishankar MaheshRavishankar merged commit 3679e9c into iree-org:main May 8, 2023
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this pull request Jul 6, 2023
The current default dispatch region formation has options to

- disable splitting pad into fill + tensor.insert_slice
- allow fusion of pad with producer
- allow fusion of pad with consumer.
While none of these are on by default, this PR adds support for handling these in the CPU backend. The current state is

- The pad by itself in a dispatch gets vectorized.
- Pad fused with consumer gets vectorized too
- Pad fused with producer does not get vectorized. This requries more work and potentially some changes to get the IR into a better state w.r.t destination passing.

There is lit test that show the handling of the different modes today within the CPU backend. To get things working, one thing to handle is the code-generated by tiling the pad operation is of the form

```
scf.if {
  ...
} else {
  ... tensor.pad 
}
```

the if here is to account for cases where a tile could be reading only the padding. This does not happen in IREE, so there is a temporary hack here that just folds the if away. Long term a better solution is needed (probably requiring rethinking of pad specification and tiling).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-*
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants