Failure in EliminateEmptyTensors pass fails #11273

MaheshRavishankar · 2022-11-23T01:03:21Z

With this input

module {
  func.func @_pad_1D_test_dispatch_0() {
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = affine.apply affine_map<(d0, d1, d2) -> (-d0 - d1 + d2 + 1)>(%2, %6, %4)
      %10 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %padded = tensor.pad %11 low[%2] high[%9] {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<?xi32> to tensor<1xi32>
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

iree-opt --iree-eliminate-empty-tensors repro.mlir

fails with the following error

$ ~/iree/build_debug/tools/iree-opt --iree-eliminate-empty-tensors elimiate_empty_tensors.repro.mlir 
elimiate_empty_tensors.repro.mlir:24:9: error: operand #0 of ReturnLike op does not satisfy destination passing style
        scf.yield %generated : tensor<1xi32>
        ^
elimiate_empty_tensors.repro.mlir:24:9: note: see current operation: "scf.yield"(%16) : (tensor<1xi32>) -> ()
elimiate_empty_tensors.repro.mlir:31:9: error: operand #0 of ReturnLike op does not satisfy destination passing style
        scf.yield %padded : tensor<1xi32>
        ^
elimiate_empty_tensors.repro.mlir:31:9: note: see current operation: "scf.yield"(%17) : (tensor<1xi32>) -> ()

The text was updated successfully, but these errors were encountered:

matthias-springer · 2022-11-29T09:02:20Z

The error message is confusing, here is a cleanup to make it more descriptive: https://reviews.llvm.org/D138902
It will now print as:

elimiate_empty_tensors.repro.mlir:24:9: error: operand #0 may return/yield a new buffer allocation
        scf.yield %generated : tensor<1xi32>

This a limitation that we currently have in One-Shot Bufferize: When you yield a new buffer allocation (both tensor.pad and tensor.generate bufferize to a new allocation), One-Shot Bufferize cannot deallocate the buffer. We could support this case by running the bufferization with allowReturnAllocs = true and createDealloc = false, followed by the MLIR BufferDeallocationPass. However, the generated code can be quite inefficient and contain buffer copies, so I wouldn't recommend doing that. (Although here it's just a tensor<1xi32>, so it wouldn't matter too much.) Also the buffer deallocation has known bugs that need to be fixed.

Can we write a pass to move the flow.dispatch.tensor.store into the scf.if (and duplicate it)? That would solve the problem here. As a followup, we can then rewrite the IR to operate directly on the loaded buffer.

I.e., after moving the flow.dispatch.tensor.store inside the scf.if (only showing the "then" branch here):

%generated = tensor.generate  {
^bb0(%arg1: index):
  tensor.yield %c0_i32 : i32
} : tensor<1xi32>
flow.dispatch.tensor.store %generated, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

At this point, we still have an alloc inside of the scf.if, but at least it can be deallocated at the end of the "then" block, so the error is gone.

Then the follow-up rewrite. This is essentially what EliminateEmptyTensors does, but it could be extended to other ops that allocate. So it would turn into something like EliminateAllocs.

%loaded = flow.dispatch.tensor.load %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
%generated = linalg.generic outs(%loaded : tensor<1xi32>) {
^bb0(%out: f32):
  %arg1 = linalg.index()
  tensor.yield %c0_i32 : i32
}
flow.dispatch.tensor.store %generated, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Now the alloc is completely gone. We basically turned the tensor.generate into a destination-passing style op.

We have to do the same for the "else" branch.

Note: You are currently this error message during EliminateEmptyTensors. That's because that passes reuses the bufferization analysis under the hood. If you were to the remove that pass from your pipeline, it would still fail with the same error message when One-Shot Bufferize runs.

MaheshRavishankar · 2022-12-02T21:22:34Z

Actually I just realized that I am not using pad vectorization, so Ill circle back to this if it is still an issue then. Sorry for the noise.

MaheshRavishankar · 2023-01-06T21:24:45Z

@matthias-springer thanks for the response earlier. I finally got around to parsing what you had. Couple of things

Moving the flow.dispatch.tensor.store around this way is frought.... It opens up a can of worms that is hard to reason about in general. Just looking at the scf.if

      %10 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %padded = tensor.pad %11 low[%2] high[%9] {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<?xi32> to tensor<1xi32>
	   scf.yield %padded : tensor<1xi32>
      }

I would expect that %10, %generated and %padded to be marked as "aliased" (or whatever term usptream bufferization uses). By semantics of scf.if these tensors are all effectively the same buffer. With that, then you can convert the tensor.generate to just be a linalg.fill of 0's into the result buffer. We already know %10 writes into the result buffer, so that should avoid the temporary buffer?

Soon, I am going to begin the work of dropping the use of sliced flow.dispatch.tensor.load/store. I can do this by basically using scf.foreach_thread for the first level tile and distribute in IREE. Would that make this easier?

matthias-springer · 2023-01-07T10:47:54Z

That could work. There's three things that we have to do:

Step 1: Decompose tensor.generate into tensor.empty + linalg.generic to fill the tensor. Decompose tensor.pad into tensor.empty + linalg.fill + tensor.insert_slice.

      %10 = scf.if %8 -> (tensor<1xi32>) {
        %empty1 = tensor.empty : tensor<1xi32>
        %generated = linalg.generic outs(%empty1) ....
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %empty2 = tensor.empty : tensor<1xi32>
        %filled = linalg.fill %empty2
        %padded = tensor.insert_slice %11 into %filled
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Step 2: Minor update to --iree-eliminate-empty-tensors so that it handles cases where a value may be originating from one of multiple tensor.empty. Then after running empty tensor elimination:

      %10 = scf.if %8 -> (tensor<1xi32>) {
        %empty1 = flow.dispatch.tensor.load %1 ...
        %generated = linalg.generic outs(%empty1) ....
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %empty2 = flow.dispatch.tensor.load %1 ...
        %filled = linalg.fill %empty2
        %padded = tensor.insert_slice %11 into %filled
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Now we bufferize.

%10 = scf.if ... -> (memref<1x32>) {
  %empty1 = memref.subview %1
  ...
  scf.yield %empty1
} else {
  %empty2 = memref.subview %1
  ...
  scf.yield %empty2
}
%sv = memref.subview %1 ...
memref.copy %10, %sv

Step 3: Add an scf.if canonicalization that hoists the memref.subviews, CSE's them and fold away the scf.if result. This may already be implemented (at least the folding of scf.if results). Then we get this IR:

%sv = memref.subview %1
%10 = scf.if {
  ...
} else {
  ...
}
memref.copy %sv, sv

Now the memref.copy folds away. I can give it a try next week.

RE scf.foreach_thread: I think that won't make a difference here. But it would allow us to remove --iree-eliminate-empty-tensors and instead use the pass in MLIR core.

MaheshRavishankar · 2023-01-18T06:43:19Z

@matthias-springer any updates on this. If this is going to take longer, I can add an IREE specific WAR for now

matthias-springer · 2023-01-18T14:48:21Z

I'm working on this but there's one detail that I overlooked. We need to extend BufferizableOpInterface for keeping track of equivalent OpOperands properly. This is a pretty big change. I think I should be able to finish this by the end of this week. If you have free cycles, we also still need lowering patterns for tensor.generate and tensor.pad that could be written in parallel. (I would put it somewhere in Dialect/Tensor/Transforms in MLIR.)

MaheshRavishankar · 2023-01-18T17:52:12Z

I'm working on this but there's one detail that I overlooked. We need to extend BufferizableOpInterface for keeping track of equivalent OpOperands properly. This is a pretty big change. I think I should be able to finish this by the end of this week. If you have free cycles, we also still need lowering patterns for tensor.generate and tensor.pad that could be written in parallel. (I would put it somewhere in Dialect/Tensor/Transforms in MLIR.)

How about changing the lowering itself to not use tensor.generate ? Maybe pad needs to be generalized. @vmurali FYI in case you have cycles before I get to it.

matthias-springer · 2023-01-19T16:42:29Z

I finished the implementation for Step 2: https://reviews.llvm.org/D142130. It turned out to be a larger change and the entire stack is needed. Working on Step 1 now.

matthias-springer · 2023-01-19T16:43:37Z

How about changing the lowering itself to not use tensor.generate?

Which lowering are you referring to? Do you mean not producing the tensor.generate in IREE in the first place?

MaheshRavishankar · 2023-01-19T17:04:18Z

How about changing the lowering itself to not use tensor.generate?

Which lowering are you referring to? Do you mean not producing the tensor.generate in IREE in the first place?

The tensor.generate is produced by the upstream implementation of TilingInterface for tensor.pad.

This is a WIP PR that shows first attempt for data tiling for GPUs. It implementes materialization for the encoding. Note that, PR cannot compile any program. Because it generates `tensor.pad` and we don't know how to tile it yet. iree-org#10184 can be enabled to tile `tensor.pad`, but then it results bufferization problem iree-org#11273

matthias-springer · 2023-01-20T14:05:05Z

Step 1 is also done: stack of https://reviews.llvm.org/D142207.

Now it should bufferize but there is still a self-copy.

E.g.,

  func.func @eleminate_multiple_ops(%arg0: memref<?xf32>, %arg1: index, %arg2: i1) -> memref<?xf32> {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = scf.if %arg2 -> (memref<?xf32, strided<[1], offset: 42>>) {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst_0 : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    } else {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    }
    %subview = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
    memref.copy %0, %subview : memref<?xf32, strided<[1], offset: 42>> to memref<?xf32, strided<[1], offset: 42>>
    return %arg0 : memref<?xf32>
  }

Unfortunately, the memref.subview does not CSE with -cse. Actually I'm not quite sure how to simplify this. In general, hoisting the memref.subview from an scf.if may not be safe.

I could write a hoisting transformation that looks for identical ops in an scf.if on the then and the else branch. Does that sound reasonable?

MaheshRavishankar · 2023-01-20T17:41:41Z

Step 1 is also done: stack of reviews.llvm.org/D142207.

Are you waiting for me to review? Maybe someone with more context can. I can take a look if you want me to.

Now it should bufferize but there is still a self-copy.

E.g.,

  func.func @eleminate_multiple_ops(%arg0: memref<?xf32>, %arg1: index, %arg2: i1) -> memref<?xf32> {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = scf.if %arg2 -> (memref<?xf32, strided<[1], offset: 42>>) {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst_0 : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    } else {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    }
    %subview = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
    memref.copy %0, %subview : memref<?xf32, strided<[1], offset: 42>> to memref<?xf32, strided<[1], offset: 42>>
    return %arg0 : memref<?xf32>
  }

Unfortunately, the memref.subview does not CSE with -cse. Actually I'm not quite sure how to simplify this. In general, hoisting the memref.subview from an scf.if may not be safe.

I could write a hoisting transformation that looks for identical ops in an scf.if on the then and the else branch. Does that sound reasonable?

Easier might be to just write a hoisting of ops from either the if or the else branch. Then CSE should be able to take care of the rest and get you to the form you want.

matthias-springer · 2023-01-20T20:33:19Z

Easier might be to just write a hoisting of ops from either the if or the else branch. Then CSE should be able to take care of the rest and get you to the form you want.

Is that safe in general? The scf.if may guard the memref.subview so that it is not executed under certain conditions because it may otherwise compute an invalid subview (e.g., index out of bounds). I have not followed this discussion in detail. Not sure what the semantics of memref.subview are when it computes an invalid subview; maybe it would assert at runtime.

MaheshRavishankar · 2023-01-20T20:41:03Z

subview is supposed to be just pointer arithmetic. So it should be hoistable.... I think you can label it speculatable.... I think you can always compute an out-of-bounds pointer. If you use it is when the error is.... So I would say marking it speculatable and hoisting it should be fine.

matthias-springer · 2023-01-24T16:30:47Z

Step 3: https://reviews.llvm.org/D142480. This is a general CSE for scf.if branches. It can hoist+CSE anything, not just memref.subview.

MaheshRavishankar · 2023-01-27T05:24:27Z

@matthias-springer while you land upstream, are there patches I can import locally to fix this issue, so that I can look into other issues that might be happening.

matthias-springer · 2023-01-27T16:19:57Z

If you patch in this stack of changes (https://reviews.llvm.org/D142130), it should work. Optionally also this change (https://reviews.llvm.org/D142480) to CSE the memref.subviews.

MaheshRavishankar · 2023-02-08T20:35:17Z

@matthias-springer thanks for all the great work. I patched all your changes (I pushed them to https://github.com/iree-org/iree/tree/pad_fusion , you should be able to check those out and build it locally). I still see errors in bufferization

// -----// IR Dump Before EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //
module {                                                                                                                                                                                                                                                                                                           
  func.func @_pad_1D_test_dispatch_0() {                                                                                                                                                                                                                                                                           
    %cst = arith.constant dense<0> : vector<1xi32>                                                                                                                                                                                                                                                                 
    %c7 = arith.constant 7 : index                                                                                                                                                                                                                                                                                 
    %c0 = arith.constant 0 : index                                                                                                                                                                                                                                                                                 
    %c0_i32 = arith.constant 0 : i32                                                                                                                                                                                                                                                                               
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>                                                                                                                                            
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>                                                                                                                                                           
    %workgroup_id_x = hal.interface.workgroup.id[0] : index                                                                                                                                                                                                                                                        
    %workgroup_count_x = hal.interface.workgroup.count[0] : index                                                                                                                                                                                                                                                  
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {                                                                                                                                                                                                                                               
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)                                                                                                                                                                                                                                                      
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)                                                                                                                                                                                                                                                       
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)                                                                                                                                                                                                                                                              
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)                                                                                                                                                                                                                                                       
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)                                                                                                                                                                                                                                                              
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)                                                                                                                                                                                                                                                  
      %8 = arith.cmpi eq, %7, %c0 : index                                                                                                                                                                                                                                                                          
      %9 = scf.if %8 -> (tensor<1xi32>) {                                                                                                                                                                                                                                                                          
        %generated = tensor.generate  {                                                                                                                                                                                                                                                                            
        ^bb0(%arg1: index):                                                                                                                                                                                                                                                                                        
          tensor.yield %c0_i32 : i32                                                                                                                                                                                                                                                                               
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = tensor.empty() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

After running the empty-tensor-to-alloc-tensor pass and iree-codegen-iree-comprehensive-bufferize pass I get the following error

/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
// -----// IR Dump After IREEComprehensiveBufferize Failed (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

To help repro you can also start with

hal.executable private @repro {
hal.executable.variant public @embedded_elf_x86_64, target = <"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", native_vector_size = 16 : index, target_triple = "x86_64-unknown-unknown-eabi-elf"}>\
 {
  hal.executable.export public @_pad_1D_test_dispatch_0 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
  ^bb0(%arg0: !hal.device, %arg1: index):
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg1
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @_pad_1D_test_dispatch_0() {
      %c0 = arith.constant 0 : index
      %c3 = arith.constant 3 : index
      %c2 = arith.constant 2 : index
      %c0_i32 = arith.constant 0 : i32
      %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
      %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
      %2 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [2], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<2xi32>
      %padded = tensor.pad %2 low[%c3] high[%c2] {
      ^bb0(%arg0: index):
        tensor.yield %c0_i32 : i32
      } : tensor<2xi32> to tensor<7xi32>
      flow.dispatch.tensor.store %padded, %1, offsets = [0], sizes = [7], strides = [1] : tensor<7xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
      return
    }
  }
}

and run iree-opt --pass-pipeline=hal.executable(hal.executable.variant(iree-codegen-iree-comprehensive-bufferize)) repro.mlir

matthias-springer · 2023-02-08T21:48:31Z

The tensor.generate is still there, can you run these patterns before bufferization (in particular before EmptyTensorElimination): test-linalg-transform-patterns=test-convert-to-destination-style-patterns

MaheshRavishankar · 2023-02-08T23:45:00Z

Do we have to do this with transform dialect? Is there a populateMethod I can use?

MaheshRavishankar · 2023-02-09T01:30:50Z

Found it. Still a failure in IREEComprehensiveBufferizePass. IR before failure

// -----// IR Dump Before IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %10 = bufferization.alloc_tensor() : tensor<1xi32>
        %11 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} outs(%10 : tensor<1xi32>) {
        ^bb0(%out: i32):
          linalg.yield %c0_i32 : i32
        } -> tensor<1xi32>
        scf.yield %11 : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

This is the error

/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
// -----// IR Dump After IREEComprehensiveBufferize Failed (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %10 = bufferization.alloc_tensor() : tensor<1xi32>
        %11 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} outs(%10 : tensor<1xi32>) {
        ^bb0(%out: i32):
          linalg.yield %c0_i32 : i32
        } -> tensor<1xi32>
        scf.yield %11 : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE, iree-org#11273 (comment) is implemented here as a workaround. To ensure bufferization without allocation, yields of the then and else branch and the result of the `scf.if` are all tied together. If the `then` and `else` come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the `BufferizationAnalysis` to avoid merging sets that have constants or have two different `interface_bindings`.

It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE, #11273 (comment) is implemented here as a workaround. To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings. benchmarks: x86_64, cuda

It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE, iree-org#11273 (comment) is implemented here as a workaround. To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings. benchmarks: x86_64, cuda

MaheshRavishankar · 2023-04-11T21:00:44Z

Moved to P2 for now.

It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE, #11273 (comment) is implemented here as a workaround. To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings. benchmarks: x86_64, cuda

It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE, iree-org#11273 (comment) is implemented here as a workaround. To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings. benchmarks: x86_64, cuda

MaheshRavishankar assigned matthias-springer Nov 23, 2022

MaheshRavishankar mentioned this issue Nov 23, 2022

Fuse pad with its producers. #10184

Closed

grypp mentioned this issue Jan 19, 2023

[WIP] Data Tiling for GPU #11904

Closed

MaheshRavishankar mentioned this issue Feb 28, 2023

Pad fusion bufferization workaround. #12425

Merged

allieculp assigned MaheshRavishankar Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in EliminateEmptyTensors pass fails #11273

Failure in EliminateEmptyTensors pass fails #11273

MaheshRavishankar commented Nov 23, 2022

matthias-springer commented Nov 29, 2022

MaheshRavishankar commented Dec 2, 2022

MaheshRavishankar commented Jan 6, 2023 •

edited

Loading

matthias-springer commented Jan 7, 2023

MaheshRavishankar commented Jan 18, 2023

matthias-springer commented Jan 18, 2023

MaheshRavishankar commented Jan 18, 2023

matthias-springer commented Jan 19, 2023 •

edited

Loading

matthias-springer commented Jan 19, 2023

MaheshRavishankar commented Jan 19, 2023

matthias-springer commented Jan 20, 2023 •

edited

Loading

MaheshRavishankar commented Jan 20, 2023

matthias-springer commented Jan 20, 2023

MaheshRavishankar commented Jan 20, 2023

matthias-springer commented Jan 24, 2023

MaheshRavishankar commented Jan 27, 2023

matthias-springer commented Jan 27, 2023

MaheshRavishankar commented Feb 8, 2023

matthias-springer commented Feb 8, 2023

MaheshRavishankar commented Feb 8, 2023

MaheshRavishankar commented Feb 9, 2023

MaheshRavishankar commented Apr 11, 2023

Failure in EliminateEmptyTensors pass fails #11273

Failure in EliminateEmptyTensors pass fails #11273

Comments

MaheshRavishankar commented Nov 23, 2022

matthias-springer commented Nov 29, 2022

MaheshRavishankar commented Dec 2, 2022

MaheshRavishankar commented Jan 6, 2023 • edited Loading

matthias-springer commented Jan 7, 2023

MaheshRavishankar commented Jan 18, 2023

matthias-springer commented Jan 18, 2023

MaheshRavishankar commented Jan 18, 2023

matthias-springer commented Jan 19, 2023 • edited Loading

matthias-springer commented Jan 19, 2023

MaheshRavishankar commented Jan 19, 2023

matthias-springer commented Jan 20, 2023 • edited Loading

MaheshRavishankar commented Jan 20, 2023

matthias-springer commented Jan 20, 2023

MaheshRavishankar commented Jan 20, 2023

matthias-springer commented Jan 24, 2023

MaheshRavishankar commented Jan 27, 2023

matthias-springer commented Jan 27, 2023

MaheshRavishankar commented Feb 8, 2023

matthias-springer commented Feb 8, 2023

MaheshRavishankar commented Feb 8, 2023

MaheshRavishankar commented Feb 9, 2023

MaheshRavishankar commented Apr 11, 2023

MaheshRavishankar commented Jan 6, 2023 •

edited

Loading

matthias-springer commented Jan 19, 2023 •

edited

Loading

matthias-springer commented Jan 20, 2023 •

edited

Loading