Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir][linalg] unfold projected permutation. #114704

Merged
merged 7 commits into from
Nov 9, 2024

Conversation

javedabsar1
Copy link
Contributor

@javedabsar1 javedabsar1 commented Nov 3, 2024

Projected permutation are effectively folding in of a mixture of transpose and broadcast into the affine map of the operand.
While folding of transpose and broadcast into the affine map of the linalg.generic operand is a very effective optimization, sometimes we may want to unfold that, for instance when recognizing named ops.

Example

#projection = affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>
#identity   = affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
...
%res = linalg.generic
       { indexing_maps = [#projection, #identity, #identity],
       iterator_types = ["parallel", "parallel", "parallel",
                                  "parallel", "parallel"]}
       ins(%x, %y : tensor<7x8x9xf32>, tensor<5x9x7x8x10xf32>)
       outs(%z : tensor<5x9x7x8x10xf32>) {
         ^bb0(%in: f32, %in_1: f32, %out: f32):
              %div = arith.divf %in, %in_1 : f32
              linalg.yield %div : f32
    } -> tensor<5x9x7x8x10xf32>

In the above IR operand %x map is a projected-permutation. This can be unfolded as:

 ...
 %transposed = linalg.transpose
              ins(%x : tensor<7x8x9xf32>)
              outs(%e1 : tensor<9x7x8xf32>) permutation = [2, 0, 1]
  ...
  %broadcasted = linalg.broadcast
                ins(%transposed : tensor<9x7x8xf32>)
                outs(%e2 : tensor<5x9x7x8x10xf32>) dimensions = [0, 4]
  %2 = linalg.div
          ins(%broadcasted, %y :  tensor<5x9x7x8x10xf32>, tensor<5x9x7x8x10xf32>)
           outs(%arg2 : tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32>

Note that linalg.generic has been 'specialized' to linalg.div. To unfold it is more effective to transpose first and then do the broadcast. However, if transpose is done first, the permutation map needs to be expressed in terms of reduced dimension (as broadcast hasn't happened yet). Also, the broadcast dimensions in a linalg.generic come from other operands (those not broadcasted along that particular dimension). We work this out by computing the polytope shape of the linalg.gneric from shapes of all the operands (inputs and outputs).

Special thanks to Mahesh for discussions on this topic

Identify folded 'projected permutations' (i.e. mixture of
transpose and/or broadcast) in linalg generic operands.
The 'projected permutations' are unfolded as separate
linalg.transpose and linalg.broadcast so that the generic
operates on simple identity map which is necessary to
replace generic with named op.
@llvmbot
Copy link
Member

llvmbot commented Nov 3, 2024

@llvm/pr-subscribers-mlir-linalg

@llvm/pr-subscribers-mlir

Author: Javed Absar (javedabsar1)

Changes

Identify folded 'projected permutations' (i.e. mixture of transpose and/or broadcast) in linalg generic operands.
The 'projected permutations' are unfolded as separate linalg.transpose and linalg.broadcast so that the generic
operates on simple identity map which is necessary to replace generic with named op.


Full diff: https://github.com/llvm/llvm-project/pull/114704.diff

6 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td (+1-1)
  • (modified) mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h (+5)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt (+1)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/Specialize.cpp (+1)
  • (added) mlir/lib/Dialect/Linalg/Transforms/UnfoldProjectedPermutation.cpp (+270)
  • (added) mlir/test/Dialect/Linalg/unfold_projected_permutation.mlir (+71)
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
index 0a404194569c22..b81a4c9c8760cf 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
@@ -252,7 +252,7 @@ def LinalgStructuredInterface
       /*args=*/(ins),
       /*methodBody=*/"",
       /*defaultImplementation=*/[{
-        return getNumParallelLoops() ==  getNumParallelLoops();
+        return getNumParallelLoops() ==  getNumLoops();
       }]
     >,
     InterfaceMethod<
diff --git a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
index 0693e31b4f70af..a110eb88e9f699 100644
--- a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
@@ -1786,6 +1786,11 @@ void populateBubbleUpExtractSliceOpPatterns(RewritePatternSet &patterns);
 /// linalg.fill(%cst, tensor.extract_slice(%init)).
 void populateSwapExtractSliceWithFillPatterns(RewritePatternSet &patterns);
 
+
+/// Add patterns to make explicit broadcasts and transforms in the
+/// input operands of a genericOp.
+void populateUnfoldProjectedPermutationPatterns(RewritePatternSet &patterns);
+
 /// Patterns to apply `splitReduction` below.
 void populateSplitReductionPattern(
     RewritePatternSet &patterns,
diff --git a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
index d7c63cdd8198d7..dfe6d7a54c8f14 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
@@ -38,6 +38,7 @@ add_mlir_dialect_library(MLIRLinalgTransforms
   TilingInterfaceImpl.cpp
   Transforms.cpp
   TransposeConv2D.cpp
+  UnfoldProjectedPermutation.cpp
   Vectorization.cpp
   WinogradConv2D.cpp
 
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Specialize.cpp b/mlir/lib/Dialect/Linalg/Transforms/Specialize.cpp
index dfafffce9d9b60..a911286d5d44b2 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Specialize.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Specialize.cpp
@@ -347,6 +347,7 @@ struct LinalgSpecializeGenericOpsPass
 void LinalgSpecializeGenericOpsPass::runOnOperation() {
   RewritePatternSet patterns(&getContext());
   populateLinalgGenericOpsSpecializationPatterns(patterns);
+  populateUnfoldProjectedPermutationPatterns(patterns);
 
   if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))
     signalPassFailure();
diff --git a/mlir/lib/Dialect/Linalg/Transforms/UnfoldProjectedPermutation.cpp b/mlir/lib/Dialect/Linalg/Transforms/UnfoldProjectedPermutation.cpp
new file mode 100644
index 00000000000000..56d6bd23b2343a
--- /dev/null
+++ b/mlir/lib/Dialect/Linalg/Transforms/UnfoldProjectedPermutation.cpp
@@ -0,0 +1,270 @@
+//===- UnfoldProjectedPermutation.cpp - extract projected projections   ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file implements pattern to decompose the operand of a GenericOp that
+// has `transpose+broadcast` juxtaposed via its affine map into separate
+// transpose and broadcast ops.
+//
+//===----------------------------------------------------------------------===//
+//
+#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
+#include <utility>
+
+#include "mlir/Dialect/Affine/IR/AffineOps.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include <map>
+#include <optional>
+#include <vector>
+
+using namespace mlir;
+using namespace mlir::linalg;
+
+namespace {
+
+/// Projected permutation are effectively folding in of a mixture of
+/// transpose and broadcast into the affine map of the operand.
+/// While folding of transpose and broadcast into the affine map of the
+/// linalg.generic operand is a very effective optimization, sometimes
+/// we may want to unfold that, for instance when recognizing named ops.
+///
+///  Example
+///
+/// ```mlir
+///
+/// #projection = affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>
+/// #identity   = affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
+/// ...
+///    %res = linalg.generic
+///       { indexing_maps = [#projection, #identity, #identity],
+///       iterator_types = ["parallel", "parallel", "parallel",
+///                         "parallel", "parallel"]}
+///       ins(%x, %y : tensor<7x8x9xf32>, tensor<5x9x7x8x10xf32>)
+///       outs(%z : tensor<5x9x7x8x10xf32>) {
+///         ^bb0(%in: f32, %in_1: f32, %out: f32):
+///              %div = arith.divf %in, %in_1 : f32
+///              linalg.yield %div : f32
+///    } -> tensor<5x9x7x8x10xf32>
+/// ```
+///
+/// In the above IR operand `%x` map is a projected-permutation. This can be
+/// unfolded as:
+///
+/// ```mlir
+///   ...
+///   %transposed = linalg.transpose ins(%x : tensor<7x8x9xf32>)
+///                    outs(%e1 : tensor<9x7x8xf32>) permutation = [2, 0, 1]
+///   ...
+///   %broadcasted = linalg.broadcast ins(%transposed : tensor<9x7x8xf32>)
+///                    outs(%e2 : tensor<5x9x7x8x10xf32>) dimensions = [0, 4]
+///   %2 = linalg.div
+///           ins(%broadcasted, %y :
+///                  tensor<5x9x7x8x10xf32>, tensor<5x9x7x8x10xf32>)
+///           outs(%arg2 : tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32>
+///
+/// Note that linalg.generic has been 'specialized' to linalg.div.
+/// To unfold it is more effective to transpose first and then do the broadcast.
+/// However, if transpose is done first, the permutation map needs to be
+/// expressed in terms of reduced dimension (as broadcast hasn't happened yet).
+/// Also, the broadcast dimensions in a linalg.generic come from other operands
+/// (those not broadcasted along that particular dimension). We work this out
+/// by computing the polytope shape of the linalg.gneric from shapes of all the
+/// operands (inputs and outputs).
+
+struct UnfoldProjectedPermutation : public OpRewritePattern<GenericOp> {
+  using OpRewritePattern<GenericOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(GenericOp genericOp,
+                                PatternRewriter &rewriter) const override;
+};
+
+/// Calculate shape (dimensions) of the iteration space polytope.
+/// This is calculated by concatenating the indexing maps of all operands
+/// of the generic; inverting the concatenation; concatenating all the
+/// shapes of the operands; and then doing `apply map` to those two.
+SmallVector<int64_t> getPolytopeDims(GenericOp op) {
+  assert(op.hasPureTensorSemantics() && "works only on tensors");
+
+  /// Concat indexing maps of all operands and invert the mapping.
+  auto maps = op.getIndexingMapsArray();
+  auto concat = concatAffineMaps(maps);
+  auto inverse = inversePermutation(concat);
+
+  /// Concat the size of each dims of all operands.
+  SmallVector<int64_t> dims;
+  for (auto &operand : op->getOpOperands()) {
+    auto rankedType = cast<RankedTensorType>(operand.get().getType());
+    for (auto size : rankedType.getShape())
+      dims.push_back(size);
+  }
+
+  /// Match the inverse map with dims to get polytope dimensions.
+  /// Note that some maybe 'kDynamic'.
+  return applyPermutationMap<int64_t>(inverse, dims);
+}
+
+/// For the given `map` determine what dimensions are transposed
+/// and what dimensions are broadcasted.
+/// Returns :
+///  `isTransposed, isBroadcast,
+///   transpose-permutation, broadcast-dimensions`
+///
+std::tuple<bool, bool, SmallVector<int64_t>, SmallVector<int64_t>>
+computeTransposeBroadcast(AffineMap &map) {
+  assert(map.isProjectedPermutation(false) && "not a projection");
+
+  // Dimensions that don't appear on result are broadcast.
+  int64_t minorSize = map.getNumResults();
+
+  // Convert affine expr to int64_t.
+  SmallVector<int64_t> minorResult;
+  for (int64_t i = 0; i < minorSize; ++i) {
+    auto expr = cast<AffineDimExpr>(map.getResults()[i]);
+    minorResult.push_back(expr.getPosition());
+  }
+
+  // If dims are not monotonically increasing then transpose is present.
+  SmallVector<int64_t> sorted(minorResult);
+  std::sort(sorted.begin(), sorted.end());
+  bool hasTranspose = !std::equal(minorResult.begin(), minorResult.end(),
+                                  sorted.begin(), sorted.end());
+
+  // Walk the sorted map result to determine which dimensions are broadcasted.
+  SmallVector<int64_t> broadcast;
+  for (int64_t i = 0, j = 0; i < map.getNumInputs(); ++i) {
+    if (j < minorSize && sorted[j] == i) {
+      j++;
+      continue;
+    }
+    broadcast.push_back(i);
+  }
+  bool hasBroadcast = broadcast.size();
+
+  /// Consider an operand `x : tensor<7x8x9>` of a genericOp that has
+  /// affine map `affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>`
+  /// `x`s access is both transposed and brodcast. But when specifying
+  /// the `linalg.transpose(x : tensor<7x8x9>)` the dimensions need to be
+  /// specified as `affine_map<(d0,d1,d2) -> (d1, d2, d0)` instead of
+  /// refering to d3, d4. Therefore, re-base the transpose dimensions so
+  /// that they start from d0.
+  std::map<int64_t, int64_t> minorMap;
+  for (int64_t i = 0; i < minorSize; ++i)
+    minorMap.insert({sorted[i], i});
+
+  // Re-map the dimensions.
+  SmallVector<int64_t> remappedResult(minorSize);
+  for (int64_t i = 0; i < minorSize; ++i)
+    remappedResult[i] = minorMap[minorResult[i]];
+
+  /// Calculate the permutation for the transpose.
+  SmallVector<int64_t> permutation(minorSize);
+  for (unsigned i = 0; i < minorSize; ++i) {
+    permutation[remappedResult[i]] = i;
+  }
+
+  return {hasTranspose, hasBroadcast, permutation, broadcast};
+}
+
+LogicalResult
+UnfoldProjectedPermutation::matchAndRewrite(GenericOp op,
+                                            PatternRewriter &rewriter) const {
+  if (!op.hasPureTensorSemantics() || op.isSingleInputOutput() ||
+      op.isSingleYieldOp() || !op.isAllParallelLoops())
+    return failure();
+
+  // All maps need to be projected permutations.
+  for (auto &opOperand : op->getOpOperands()) {
+    auto map = op.getMatchingIndexingMap(&opOperand);
+    if (!map.isProjectedPermutation(false))
+      return failure();
+  }
+
+  // Currently we handle only static shapes.
+  for (auto &operand : op->getOpOperands()) {
+    auto rankedType = cast<RankedTensorType>(operand.get().getType());
+    for (auto size : rankedType.getShape())
+      if (size == ShapedType::kDynamic)
+        return failure();
+  }
+
+  // Calculate polytope bounds from affine maps and operand(s) shapes.
+  auto polytope = getPolytopeDims(op);
+
+  auto loc = op.getLoc();
+  bool isChanged = false;
+  SmallVector<Value> newInitValues = op.getDpsInputs();
+  SmallVector<AffineMap> newMap = op.getIndexingMapsArray();
+
+  // Walk over each input operand and unfold if it is transposed, broadcast
+  // or mix of two via operand's affine-map.
+  for (int64_t i = 0; i < op.getNumDpsInputs(); ++i) {
+    auto &map = newMap[i];
+    auto inputRTType = cast<RankedTensorType>(newInitValues[i].getType());
+    auto elType = inputRTType.getElementType();
+
+    /// Nothing to do if map is already an identity.
+    if (map.isIdentity())
+      continue;
+
+    auto [hasTranspose, hasBroadcast, permutation, broadcastedDims] =
+        computeTransposeBroadcast(map);
+
+    if (hasTranspose) {
+      /// linalg.transpose permutes the dimensions of input using
+      /// rule: dim(result, i) = dim(input, permutation[i])
+      SmallVector<int64_t> transposedShape(map.getNumResults());
+      for (int64_t i = 0; i < map.getNumResults(); ++i)
+        transposedShape[i] = inputRTType.getShape()[permutation[i]];
+
+      Value emptyTensor =
+          rewriter.create<tensor::EmptyOp>(loc, transposedShape, elType);
+
+      auto transposeOp = rewriter.create<TransposeOp>(loc, newInitValues[i],
+                                                      emptyTensor, permutation);
+      newInitValues[i] = transposeOp->getResult(0);
+      isChanged = true;
+    }
+
+    // Does it require broadcast
+    if (hasBroadcast) {
+      assert(broadcastedDims.size() && "should have non size broadcast");
+      Value emptyTensor = rewriter.create<tensor::EmptyOp>(
+          loc, polytope, inputRTType.getElementType());
+
+      auto broadcastOp = rewriter.create<linalg::BroadcastOp>(
+          loc, newInitValues[i], emptyTensor, broadcastedDims);
+
+      newInitValues[i] = broadcastOp->getResult(0);
+      isChanged = true;
+    }
+    newMap[i] = rewriter.getMultiDimIdentityMap(map.getNumDims());
+  }
+
+  if (isChanged) {
+    SmallVector<Value> operands = op->getOperands();
+    ValueRange operandsRef(operands);
+
+    auto newOp = rewriter.create<linalg::GenericOp>(
+        /*location=*/op.getLoc(),
+        /*resultTensorTypes=*/op->getResultTypes(),
+        /*inputs=*/newInitValues,
+        /*outputs=*/operandsRef.drop_front(op.getNumDpsInputs()),
+        /*indexingMaps=*/newMap,
+        /*iteratorTypes=*/op.getIteratorTypesArray());
+
+    newOp.getRegion().takeBody(op->getRegion(0));
+    rewriter.replaceOp(op, newOp->getResults());
+  }
+  return success();
+}
+
+} // namespace
+
+void mlir::linalg::populateUnfoldProjectedPermutationPatterns(
+    RewritePatternSet &patterns) {
+  patterns.insert<UnfoldProjectedPermutation>(patterns.getContext());
+}
diff --git a/mlir/test/Dialect/Linalg/unfold_projected_permutation.mlir b/mlir/test/Dialect/Linalg/unfold_projected_permutation.mlir
new file mode 100644
index 00000000000000..4efa07b2de12e3
--- /dev/null
+++ b/mlir/test/Dialect/Linalg/unfold_projected_permutation.mlir
@@ -0,0 +1,71 @@
+// RUN: mlir-opt %s -split-input-file --linalg-specialize-generic-ops | FileCheck %s
+
+#projection = affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>
+#identity   = affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
+
+func.func @test_mixed(%x : tensor<7x8x9xf32>, %y:  tensor<5x9x7x8x10xf32>, %z :  tensor<5x9x7x8x10xf32>) ->  tensor<5x9x7x8x10xf32> {
+  %res = linalg.generic
+     { indexing_maps = [#projection, #identity, #identity], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"]} 
+     ins(%x, %y : tensor<7x8x9xf32>, tensor<5x9x7x8x10xf32>) outs(%z : tensor<5x9x7x8x10xf32>) {
+     ^bb0(%in: f32, %in_1: f32, %out: f32):
+       %div = arith.divf %in, %in_1 : f32
+       linalg.yield %div : f32
+  } -> tensor<5x9x7x8x10xf32>
+  return %res : tensor<5x9x7x8x10xf32>
+}
+
+// CHECK-LABEL: test_mixed
+// CHECK-SAME: %[[X:.+]]: tensor<7x8x9xf32>, %[[Y:.+]]: tensor<5x9x7x8x10xf32>, %[[Z:.+]]: tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32> {
+// CHECK: %[[E0:.+]] = tensor.empty() : tensor<9x7x8xf32>
+// CHECK: %[[Transposed:.+]] = linalg.transpose ins(%[[X]] : tensor<7x8x9xf32>) outs(%[[E0]] : tensor<9x7x8xf32>) permutation = [2, 0, 1]
+// CHECK: %[[E1:.+]] = tensor.empty() : tensor<5x9x7x8x10xf32>
+// CHECK: %[[Broadcasted:.+]] = linalg.broadcast ins(%[[Transposed]] : tensor<9x7x8xf32>) outs(%[[E1]] : tensor<5x9x7x8x10xf32>) dimensions = [0, 4]
+// CHECK: {{.*}} = linalg.div ins(%[[Broadcasted]], %[[Y]] : tensor<5x9x7x8x10xf32>, tensor<5x9x7x8x10xf32>) outs(%[[Z]] : tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32>
+// CHECK-NOT: linalg.generic
+
+// -----
+
+#identity = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
+#transposed = affine_map<(d0, d1, d2) -> (d2, d0, d1)>
+
+func.func @test_transposed(%x : tensor<32x2x16xf32>, %y:  tensor<2x16x32xf32>, %z :  tensor<2x16x32xf32>) ->  tensor<2x16x32xf32> {
+  %res = linalg.generic
+     { indexing_maps = [#transposed, #identity, #identity], iterator_types = ["parallel", "parallel", "parallel"]}
+     ins(%x, %y : tensor<32x2x16xf32>, tensor<2x16x32xf32>)
+     outs(%z : tensor<2x16x32xf32>) {
+     ^bb0(%in: f32, %in_1: f32, %out: f32):
+       %div = arith.divf %in, %in_1 : f32
+       linalg.yield %div : f32
+  } -> tensor<2x16x32xf32>
+  return %res : tensor<2x16x32xf32>
+}
+
+// CHECK-LABEL: test_transposed
+// CHECK-SAME: %[[X:.+]]: tensor<32x2x16xf32>, %[[Y:.+]]: tensor<2x16x32xf32>, %[[Z:.+]]: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
+// CHECK: %[[E0:.+]] = tensor.empty() : tensor<2x16x32xf32>
+// CHECK: %[[Transposed:.+]] = linalg.transpose ins(%[[X]] : tensor<32x2x16xf32>) outs(%[[E0]] : tensor<2x16x32xf32>) permutation = [1, 2, 0]
+// CHECK: {{.*}} = linalg.div ins(%[[Transposed]], %[[Y]] : tensor<2x16x32xf32>, tensor<2x16x32xf32>) outs(%[[Z]] : tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
+// CHECK-NOT: linalg.generic
+
+// -----
+
+#identity = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
+#broadcast = affine_map<(d0, d1, d2) -> (d0, d2)>
+func.func @test_broadcast(%x : tensor<2x16x32xf32>, %y:  tensor<2x32xf32>, %z :  tensor<2x16x32xf32>) ->  tensor<2x16x32xf32> {
+  %res = linalg.generic
+     { indexing_maps = [#identity, #broadcast, #identity], iterator_types = ["parallel", "parallel", "parallel"]}
+     ins(%x, %y : tensor<2x16x32xf32>, tensor<2x32xf32>)
+     outs(%z : tensor<2x16x32xf32>) {
+     ^bb0(%in: f32, %in_1: f32, %out: f32):
+       %div = arith.divf %in, %in_1 : f32
+       linalg.yield %div : f32
+  } -> tensor<2x16x32xf32>
+  return %res : tensor<2x16x32xf32>
+}
+
+// CHECK-LABEL: test_broadcast
+// CHECK-SAME: %[[X:.+]]: tensor<2x16x32xf32>, %[[Y:.+]]: tensor<2x32xf32>, %[[Z:.+]]: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
+// CHECK: %[[E0:.+]] = tensor.empty() : tensor<2x16x32xf32>
+// CHECK: %[[Broadcasted:.+]] = linalg.broadcast ins(%[[Y]] : tensor<2x32xf32>) outs(%[[E0]] : tensor<2x16x32xf32>) dimensions = [1]
+// CHECK: {{.*}} = linalg.div ins(%[[X]], %[[Broadcasted]] : tensor<2x16x32xf32>, tensor<2x16x32xf32>) outs(%arg2 : tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
+// CHECK-NOT: linalg.generic

Copy link

github-actions bot commented Nov 3, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only had a chance to skim through the comments and tests and left a few nits. Will do a proper review later.

There's a lot of threads on Discourse atm and it's tricky to co-ordinate everything 😅

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

I've skimmed through the tests to gauge what's going on and mostly looks fine. But I find UnfoldProjectedPermutation to be a very misleading name. I don't think that's the main thing here. Instead, this is expanding the "specialisation" logic, right?

There's a few tricky bits here, so I'd like to take another look.

}
broadcast.push_back(i);
}
bool hasBroadcast = broadcast.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool hasBroadcast = broadcast.size();
bool hasBroadcast = !broadcast.empty();

mlir/test/Dialect/Linalg/unfold_projected_permutation.mlir Outdated Show resolved Hide resolved
Comment on lines 85 to 88
/// Calculate shape (dimensions) of the iteration space polytope.
/// This is calculated by concatenating the indexing maps of all operands
/// of the generic; inverting the concatenation; concatenating all the
/// shapes of the operands; and then doing `apply map` to those two.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "polytope" a standard term for these things? To me this is just combining all the dims and inverting them using the corresponding maps. I don't see "polytope" being used anywhere in the context of Linalg?

Also, just based on the test, this simply returns the type of the output tensor. Why not use that instead? In what cases grabbing the output tensor would not be sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have called it 'convex polyhedron' or nDimRectangleDims - would have been better. But now as @MaheshRavishankar suggested getStaticLoopRanges does same job.


namespace {

/// Projected permutation are effectively folding in of a mixture of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is very insightful, but doesn't really say what the pattern does. Also, I don't believe UnfoldProjectedPermutation is accurate. The pattern (based on this comment and the tests), specialised linalg.generic {op} into linalg.transpose + linalg.transpose + op?

Unfolding the permutation map is just an implementation detail. A very important one, but not the ultimate goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

/// This is calculated by concatenating the indexing maps of all operands
/// of the generic; inverting the concatenation; concatenating all the
/// shapes of the operands; and then doing `apply map` to those two.
SmallVector<int64_t> getPolytopeDims(GenericOp op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this whole thing should be same as op.getStaticLoopRanges().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome . Yes, does it. Thanks for pointing out. Wish i had seen earlier, but then wouldnt have learnt inversePermutation thingie :)

@javedabsar1
Copy link
Contributor Author

Nice, thanks!

I've skimmed through the tests to gauge what's going on and mostly looks fine. But I find UnfoldProjectedPermutation to be a very misleading name. I don't think that's the main thing here. Instead, this is expanding the "specialisation" logic, right?

There's a few tricky bits here, so I'd like to take another look.

I really think UnfoldProjectedPermutation names what is pattern is doing - the transpose and broadcast can be seen as folded in via the affine map. To identify we use an existing 'isProjectedPermutation` function, and what this rewrite is doing is unfolding what (at least conceptually) got folded in.

I have re-written the intro in the top of the .cpp file to make it more explicit / helpful.

But yeah, if you have a better name in mind would be glad to change this.

@llvm llvm deleted a comment from banach-space Nov 6, 2024
@banach-space
Copy link
Contributor

I really think UnfoldProjectedPermutation names what is pattern is doing - the transpose and broadcast can be seen as folded in via the affine map.

I think that this is skewed towards the key implementation detail rather the transformation itself. To me, it's more like specializeGenericByUnfoldingPermutation or decomposeGenericByUnfoldingPermutation. I need to take another look at the pattern in search for inspiration, but today it's already too late 😅 Hopefully you see what I meant :)

@javedabsar1
Copy link
Contributor Author

I really think UnfoldProjectedPermutation names what is pattern is doing - the transpose and broadcast can be seen as folded in via the affine map.

I think that this is skewed towards the key implementation detail rather the transformation itself. To me, it's more like specializeGenericByUnfoldingPermutation or decomposeGenericByUnfoldingPermutation. I need to take another look at the pattern in search for inspiration, but today it's already too late 😅 Hopefully you see what I meant :)

decomposeGenericByUnfoldingPermutation works for me. Thanks for suggesting it :) Will change it everywhere.

#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include <map>
#include <optional>
#include <vector>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is vector needed?

You should also move <utility> to this section. I think if you remove the blank line (11) , clang-format would sort the includes in the right order for you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// CHECK-LABEL: transpose_and_broadcast
// CHECK-SAME: %[[X:.+]]: tensor<7x8x9xf32>, %[[Y:.+]]: tensor<5x9x7x8x10xf32>, %[[Z:.+]]: tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32> {
// CHECK: %[[E0:.+]] = tensor.empty() : tensor<9x7x8xf32>
// CHECK: %[[X_trans:.+]] = linalg.transpose ins(%[[X]] : tensor<7x8x9xf32>) outs(%[[E0]] : tensor<9x7x8xf32>) permutation = [2, 0, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to be [1, 2, 0] but I am also assuming transpose keeps the input as the basis and describes how to permute the inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, (7, 8, 9) -> (9, 7, 8), which corresponds to (d2, d3, d1) -> (d1, d2, d3), which gives permutation = [2, 0, 1].

Now, I managed convince myself that this is correct, but please double check for yourself 😅

@MaheshRavishankar , you might be skewed by:

#projection = affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>

I think this is the trick (IIUC, this is the actual mapping here):

  • 7 -> d2,
  • 8 -> d3,
  • 9 -> d1.

Whereas you assume that:

  • 7 -> d1,
  • 8 -> d2,
  • 9 -> d3.

Does it make sense?

Copy link
Contributor Author

@javedabsar1 javedabsar1 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course you are right! Two parts to this.

  1. Broadcast semantics is -
    dim(result, i) == dim(input, permutation[i])

Therefore, for input 7x8x9 and permutation [2,0,1] the output works therefore out to 9x7x8.

  1. working out the permutation from affine-map e.g. (d2, d3, d1) -> (d1, d2, d3) and vice-versa.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments re implementation and then mostly typo suggestions. Sorry for being a bit lazy and going after the easy part 😅 I will do better in the next round!

All in all, much appreciated and neat effort! Don't let my pedantic comments make you think otherwise :)

// CHECK-LABEL: transpose_and_broadcast
// CHECK-SAME: %[[X:.+]]: tensor<7x8x9xf32>, %[[Y:.+]]: tensor<5x9x7x8x10xf32>, %[[Z:.+]]: tensor<5x9x7x8x10xf32>) -> tensor<5x9x7x8x10xf32> {
// CHECK: %[[E0:.+]] = tensor.empty() : tensor<9x7x8xf32>
// CHECK: %[[X_trans:.+]] = linalg.transpose ins(%[[X]] : tensor<7x8x9xf32>) outs(%[[E0]] : tensor<9x7x8xf32>) permutation = [2, 0, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, (7, 8, 9) -> (9, 7, 8), which corresponds to (d2, d3, d1) -> (d1, d2, d3), which gives permutation = [2, 0, 1].

Now, I managed convince myself that this is correct, but please double check for yourself 😅

@MaheshRavishankar , you might be skewed by:

#projection = affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>

I think this is the trick (IIUC, this is the actual mapping here):

  • 7 -> d2,
  • 8 -> d3,
  • 9 -> d1.

Whereas you assume that:

  • 7 -> d1,
  • 8 -> d2,
  • 9 -> d3.

Does it make sense?

return failure();
}

// Currently we handle only static shapes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this work at all for dynamic shapes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a start this will assert when trying to create tensor.empty with dynamic shape. https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp#L874

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, rather than documenting what the code does, could you add a comment saying "why"? Or what's missing? From what you are saying, we'd need to add logic to compute dynamic sizes of the input tensors for ops like EmptyOp? And probably sth else as well?

Comment on lines 26 to 27
/// linalg.generic is a good optimization but sometimes we may want to unwrap
/// i.e. `unfold` them as explicit transpose and broadcast. This rewrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// linalg.generic is a good optimization but sometimes we may want to unwrap
/// i.e. `unfold` them as explicit transpose and broadcast. This rewrite
/// linalg.generic is a good optimization but sometimes we may want to unwrap,
/// i.e., `unfold` them as explicit transpose and broadcast. This rewrite

@javedabsar1
Copy link
Contributor Author

javedabsar1 commented Nov 8, 2024

Hi Folks.
With this latest commit I think I have addressed all your review comments. Thanks a lot for your comments that helped improve the implementation.

@MaheshRavishankar
Copy link
Contributor

My bad on the wrong comment on the transpose permutation. The representation for transpose is somehow inverse of how my brain models it.

@javedabsar1
Copy link
Contributor Author

Thanks @MaheshRavishankar for the review.

@banach-space - are you also ok with the changes. No rush, just asking. Thanks.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional comments, but nothing major. LGTM otherwise.

return failure();
}

// Currently we handle only static shapes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, rather than documenting what the code does, could you add a comment saying "why"? Or what's missing? From what you are saying, we'd need to add logic to compute dynamic sizes of the input tensors for ops like EmptyOp? And probably sth else as well?

op.isSingleYieldOp() || !op.isAllParallelLoops())
return failure();

// All maps need to be projected permutations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

Comment on lines 161 to 166
for (auto &operand : op->getOpOperands()) {
auto opType = cast<RankedTensorType>(operand.get().getType());
for (auto size : opType.getShape())
if (size == ShapedType::kDynamic)
return failure();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

assert(map.isProjectedPermutation(false) && "not a projection");
int64_t minorSize = map.getNumResults();

SmallVector<int64_t> minorResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does minor mean in this context? Apologies if this is obvoius.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

if (hasTranspose) {
/// Consider an operand `x : tensor<7x8x9>` of a genericOp that has
/// affine map `affine_map<(d0, d1, d2, d3, d4) -> (d2, d3, d1)>`
/// `x`s access is both transposed and brodcast. But when specifying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
/// `x`s access is both transposed and brodcast. But when specifying
/// `x`s access is both transposed and broadcast. But when specifying


SmallVector<int64_t> permutation;
if (hasTranspose) {
/// Consider an operand `x : tensor<7x8x9>` of a genericOp that has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Doxygen style comments are normally used for API documentation. Also, below you use //

Suggested change
/// Consider an operand `x : tensor<7x8x9>` of a genericOp that has
// Consider an operand `x : tensor<7x8x9>` of a genericOp that has

Comment on lines 164 to 165
// Decomposing linalg.generic involves creating `tensor.empty`
// which cannot have dnyamic shapes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor.empty with dynamic shapes are allowed, see e.g.:

%0 = tensor.empty(%sz) : tensor<5x?x6xf32, "foo">

What's a bit "tricky" is computing the required sizes. You could leave that as a TODO.

Copy link
Contributor Author

@javedabsar1 javedabsar1 Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tensor.empty needs to be passed in a runtime value for '?' . For our case i dont see how to derive it easily from the linalg.generic. Guess it could be done by getting the tensor.dim of some operand (not all).... quite a rabbit-hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 166 to 171
for (auto &operand : op->getOpOperands()) {
auto opType = cast<RankedTensorType>(operand.get().getType());
for (auto size : opType.getShape())
if (size == ShapedType::kDynamic)
return failure();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto &operand : op->getOpOperands()) {
auto opType = cast<RankedTensorType>(operand.get().getType());
for (auto size : opType.getShape())
if (size == ShapedType::kDynamic)
return failure();
}
if (!llvm::all_of(packOp->getOpOperands(), [](OpOperand &oper) {
auto opType = cast<RankedTensorType>(oper.get().getType());
return !ShapedType::isDynamicShape(opType.getShape());
}) return failure();

Originally posted here:

GitHub is really good at hiding these 😓

@javedabsar1 javedabsar1 merged commit 0ac4821 into llvm:main Nov 9, 2024
8 checks passed
Groverkss pushed a commit to iree-org/llvm-project that referenced this pull request Nov 15, 2024
Patterns to decompose the input operand(s) of a linalg.generic that has
a projected permutation`  affine-map -- i.e. effectively a folded `transpose`, 
`broadcast`, or a mixture of two --  into explicit transpose and broadcast.
This is useful for instance when trying to recognize named ops.
email: quic_mabsar@quicinc.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants