Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in EliminateEmptyTensors pass fails #11273

Open
MaheshRavishankar opened this issue Nov 23, 2022 · 22 comments
Open

Failure in EliminateEmptyTensors pass fails #11273

MaheshRavishankar opened this issue Nov 23, 2022 · 22 comments
Assignees

Comments

@MaheshRavishankar
Copy link
Contributor

With this input

module {
  func.func @_pad_1D_test_dispatch_0() {
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = affine.apply affine_map<(d0, d1, d2) -> (-d0 - d1 + d2 + 1)>(%2, %6, %4)
      %10 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %padded = tensor.pad %11 low[%2] high[%9] {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<?xi32> to tensor<1xi32>
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}
iree-opt --iree-eliminate-empty-tensors repro.mlir

fails with the following error

$ ~/iree/build_debug/tools/iree-opt --iree-eliminate-empty-tensors elimiate_empty_tensors.repro.mlir 
elimiate_empty_tensors.repro.mlir:24:9: error: operand #0 of ReturnLike op does not satisfy destination passing style
        scf.yield %generated : tensor<1xi32>
        ^
elimiate_empty_tensors.repro.mlir:24:9: note: see current operation: "scf.yield"(%16) : (tensor<1xi32>) -> ()
elimiate_empty_tensors.repro.mlir:31:9: error: operand #0 of ReturnLike op does not satisfy destination passing style
        scf.yield %padded : tensor<1xi32>
        ^
elimiate_empty_tensors.repro.mlir:31:9: note: see current operation: "scf.yield"(%17) : (tensor<1xi32>) -> ()
@matthias-springer
Copy link
Contributor

The error message is confusing, here is a cleanup to make it more descriptive: https://reviews.llvm.org/D138902
It will now print as:

elimiate_empty_tensors.repro.mlir:24:9: error: operand #0 may return/yield a new buffer allocation
        scf.yield %generated : tensor<1xi32>

This a limitation that we currently have in One-Shot Bufferize: When you yield a new buffer allocation (both tensor.pad and tensor.generate bufferize to a new allocation), One-Shot Bufferize cannot deallocate the buffer. We could support this case by running the bufferization with allowReturnAllocs = true and createDealloc = false, followed by the MLIR BufferDeallocationPass. However, the generated code can be quite inefficient and contain buffer copies, so I wouldn't recommend doing that. (Although here it's just a tensor<1xi32>, so it wouldn't matter too much.) Also the buffer deallocation has known bugs that need to be fixed.

Can we write a pass to move the flow.dispatch.tensor.store into the scf.if (and duplicate it)? That would solve the problem here. As a followup, we can then rewrite the IR to operate directly on the loaded buffer.

I.e., after moving the flow.dispatch.tensor.store inside the scf.if (only showing the "then" branch here):

%generated = tensor.generate  {
^bb0(%arg1: index):
  tensor.yield %c0_i32 : i32
} : tensor<1xi32>
flow.dispatch.tensor.store %generated, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

At this point, we still have an alloc inside of the scf.if, but at least it can be deallocated at the end of the "then" block, so the error is gone.

Then the follow-up rewrite. This is essentially what EliminateEmptyTensors does, but it could be extended to other ops that allocate. So it would turn into something like EliminateAllocs.

%loaded = flow.dispatch.tensor.load %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
%generated = linalg.generic outs(%loaded : tensor<1xi32>) {
^bb0(%out: f32):
  %arg1 = linalg.index()
  tensor.yield %c0_i32 : i32
}
flow.dispatch.tensor.store %generated, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Now the alloc is completely gone. We basically turned the tensor.generate into a destination-passing style op.

We have to do the same for the "else" branch.

Note: You are currently this error message during EliminateEmptyTensors. That's because that passes reuses the bufferization analysis under the hood. If you were to the remove that pass from your pipeline, it would still fail with the same error message when One-Shot Bufferize runs.

@MaheshRavishankar
Copy link
Contributor Author

Actually I just realized that I am not using pad vectorization, so Ill circle back to this if it is still an issue then. Sorry for the noise.

@MaheshRavishankar
Copy link
Contributor Author

MaheshRavishankar commented Jan 6, 2023

@matthias-springer thanks for the response earlier. I finally got around to parsing what you had. Couple of things

  1. Moving the flow.dispatch.tensor.store around this way is frought.... It opens up a can of worms that is hard to reason about in general. Just looking at the scf.if
      %10 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %padded = tensor.pad %11 low[%2] high[%9] {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<?xi32> to tensor<1xi32>
	   scf.yield %padded : tensor<1xi32>
      }

I would expect that %10, %generated and %padded to be marked as "aliased" (or whatever term usptream bufferization uses). By semantics of scf.if these tensors are all effectively the same buffer. With that, then you can convert the tensor.generate to just be a linalg.fill of 0's into the result buffer. We already know %10 writes into the result buffer, so that should avoid the temporary buffer?

  1. Soon, I am going to begin the work of dropping the use of sliced flow.dispatch.tensor.load/store. I can do this by basically using scf.foreach_thread for the first level tile and distribute in IREE. Would that make this easier?

@matthias-springer
Copy link
Contributor

That could work. There's three things that we have to do:

Step 1: Decompose tensor.generate into tensor.empty + linalg.generic to fill the tensor. Decompose tensor.pad into tensor.empty + linalg.fill + tensor.insert_slice.

      %10 = scf.if %8 -> (tensor<1xi32>) {
        %empty1 = tensor.empty : tensor<1xi32>
        %generated = linalg.generic outs(%empty1) ....
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %empty2 = tensor.empty : tensor<1xi32>
        %filled = linalg.fill %empty2
        %padded = tensor.insert_slice %11 into %filled
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Step 2: Minor update to --iree-eliminate-empty-tensors so that it handles cases where a value may be originating from one of multiple tensor.empty. Then after running empty tensor elimination:

      %10 = scf.if %8 -> (tensor<1xi32>) {
        %empty1 = flow.dispatch.tensor.load %1 ...
        %generated = linalg.generic outs(%empty1) ....
        scf.yield %generated : tensor<1xi32>
      } else {
        %11 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %empty2 = flow.dispatch.tensor.load %1 ...
        %filled = linalg.fill %empty2
        %padded = tensor.insert_slice %11 into %filled
        scf.yield %padded : tensor<1xi32>
      }
      flow.dispatch.tensor.store %10, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>

Now we bufferize.

%10 = scf.if ... -> (memref<1x32>) {
  %empty1 = memref.subview %1
  ...
  scf.yield %empty1
} else {
  %empty2 = memref.subview %1
  ...
  scf.yield %empty2
}
%sv = memref.subview %1 ...
memref.copy %10, %sv

Step 3: Add an scf.if canonicalization that hoists the memref.subviews, CSE's them and fold away the scf.if result. This may already be implemented (at least the folding of scf.if results). Then we get this IR:

%sv = memref.subview %1
%10 = scf.if {
  ...
} else {
  ...
}
memref.copy %sv, sv

Now the memref.copy folds away. I can give it a try next week.

RE scf.foreach_thread: I think that won't make a difference here. But it would allow us to remove --iree-eliminate-empty-tensors and instead use the pass in MLIR core.

@MaheshRavishankar
Copy link
Contributor Author

@matthias-springer any updates on this. If this is going to take longer, I can add an IREE specific WAR for now

@matthias-springer
Copy link
Contributor

I'm working on this but there's one detail that I overlooked. We need to extend BufferizableOpInterface for keeping track of equivalent OpOperands properly. This is a pretty big change. I think I should be able to finish this by the end of this week. If you have free cycles, we also still need lowering patterns for tensor.generate and tensor.pad that could be written in parallel. (I would put it somewhere in Dialect/Tensor/Transforms in MLIR.)

@MaheshRavishankar
Copy link
Contributor Author

I'm working on this but there's one detail that I overlooked. We need to extend BufferizableOpInterface for keeping track of equivalent OpOperands properly. This is a pretty big change. I think I should be able to finish this by the end of this week. If you have free cycles, we also still need lowering patterns for tensor.generate and tensor.pad that could be written in parallel. (I would put it somewhere in Dialect/Tensor/Transforms in MLIR.)

How about changing the lowering itself to not use tensor.generate ? Maybe pad needs to be generalized. @vmurali FYI in case you have cycles before I get to it.

@matthias-springer
Copy link
Contributor

matthias-springer commented Jan 19, 2023

I finished the implementation for Step 2: https://reviews.llvm.org/D142130. It turned out to be a larger change and the entire stack is needed. Working on Step 1 now.

@matthias-springer
Copy link
Contributor

How about changing the lowering itself to not use tensor.generate?

Which lowering are you referring to? Do you mean not producing the tensor.generate in IREE in the first place?

@MaheshRavishankar
Copy link
Contributor Author

How about changing the lowering itself to not use tensor.generate?

Which lowering are you referring to? Do you mean not producing the tensor.generate in IREE in the first place?

The tensor.generate is produced by the upstream implementation of TilingInterface for tensor.pad.

grypp added a commit to grypp/iree that referenced this issue Jan 19, 2023
This is a WIP PR that shows first attempt for data tiling for GPUs. It implementes materialization for the encoding.

Note that, PR cannot compile any program. Because it generates `tensor.pad` and we don't know how to tile it yet. iree-org#10184 can be enabled to tile `tensor.pad`, but then it results bufferization problem iree-org#11273
@matthias-springer
Copy link
Contributor

matthias-springer commented Jan 20, 2023

Step 1 is also done: stack of https://reviews.llvm.org/D142207.

Now it should bufferize but there is still a self-copy.

E.g.,

  func.func @eleminate_multiple_ops(%arg0: memref<?xf32>, %arg1: index, %arg2: i1) -> memref<?xf32> {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = scf.if %arg2 -> (memref<?xf32, strided<[1], offset: 42>>) {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst_0 : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    } else {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    }
    %subview = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
    memref.copy %0, %subview : memref<?xf32, strided<[1], offset: 42>> to memref<?xf32, strided<[1], offset: 42>>
    return %arg0 : memref<?xf32>
  }

Unfortunately, the memref.subview does not CSE with -cse. Actually I'm not quite sure how to simplify this. In general, hoisting the memref.subview from an scf.if may not be safe.

I could write a hoisting transformation that looks for identical ops in an scf.if on the then and the else branch. Does that sound reasonable?

@MaheshRavishankar
Copy link
Contributor Author

Step 1 is also done: stack of reviews.llvm.org/D142207.

Are you waiting for me to review? Maybe someone with more context can. I can take a look if you want me to.

Now it should bufferize but there is still a self-copy.

E.g.,

  func.func @eleminate_multiple_ops(%arg0: memref<?xf32>, %arg1: index, %arg2: i1) -> memref<?xf32> {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = scf.if %arg2 -> (memref<?xf32, strided<[1], offset: 42>>) {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst_0 : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    } else {
      %subview_1 = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
      linalg.fill ins(%cst : f32) outs(%subview_1 : memref<?xf32, strided<[1], offset: 42>>)
      scf.yield %subview_1 : memref<?xf32, strided<[1], offset: 42>>
    }
    %subview = memref.subview %arg0[42] [%arg1] [1] : memref<?xf32> to memref<?xf32, strided<[1], offset: 42>>
    memref.copy %0, %subview : memref<?xf32, strided<[1], offset: 42>> to memref<?xf32, strided<[1], offset: 42>>
    return %arg0 : memref<?xf32>
  }

Unfortunately, the memref.subview does not CSE with -cse. Actually I'm not quite sure how to simplify this. In general, hoisting the memref.subview from an scf.if may not be safe.

I could write a hoisting transformation that looks for identical ops in an scf.if on the then and the else branch. Does that sound reasonable?

Easier might be to just write a hoisting of ops from either the if or the else branch. Then CSE should be able to take care of the rest and get you to the form you want.

@matthias-springer
Copy link
Contributor

Easier might be to just write a hoisting of ops from either the if or the else branch. Then CSE should be able to take care of the rest and get you to the form you want.

Is that safe in general? The scf.if may guard the memref.subview so that it is not executed under certain conditions because it may otherwise compute an invalid subview (e.g., index out of bounds). I have not followed this discussion in detail. Not sure what the semantics of memref.subview are when it computes an invalid subview; maybe it would assert at runtime.

@MaheshRavishankar
Copy link
Contributor Author

subview is supposed to be just pointer arithmetic. So it should be hoistable.... I think you can label it speculatable.... I think you can always compute an out-of-bounds pointer. If you use it is when the error is.... So I would say marking it speculatable and hoisting it should be fine.

@matthias-springer
Copy link
Contributor

Step 3: https://reviews.llvm.org/D142480. This is a general CSE for scf.if branches. It can hoist+CSE anything, not just memref.subview.

@MaheshRavishankar
Copy link
Contributor Author

@matthias-springer while you land upstream, are there patches I can import locally to fix this issue, so that I can look into other issues that might be happening.

@matthias-springer
Copy link
Contributor

If you patch in this stack of changes (https://reviews.llvm.org/D142130), it should work. Optionally also this change (https://reviews.llvm.org/D142480) to CSE the memref.subviews.

@MaheshRavishankar
Copy link
Contributor Author

@matthias-springer thanks for all the great work. I patched all your changes (I pushed them to https://github.com/iree-org/iree/tree/pad_fusion , you should be able to check those out and build it locally). I still see errors in bufferization

// -----// IR Dump Before EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //
module {                                                                                                                                                                                                                                                                                                           
  func.func @_pad_1D_test_dispatch_0() {                                                                                                                                                                                                                                                                           
    %cst = arith.constant dense<0> : vector<1xi32>                                                                                                                                                                                                                                                                 
    %c7 = arith.constant 7 : index                                                                                                                                                                                                                                                                                 
    %c0 = arith.constant 0 : index                                                                                                                                                                                                                                                                                 
    %c0_i32 = arith.constant 0 : i32                                                                                                                                                                                                                                                                               
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>                                                                                                                                            
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>                                                                                                                                                           
    %workgroup_id_x = hal.interface.workgroup.id[0] : index                                                                                                                                                                                                                                                        
    %workgroup_count_x = hal.interface.workgroup.count[0] : index                                                                                                                                                                                                                                                  
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {                                                                                                                                                                                                                                               
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)                                                                                                                                                                                                                                                      
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)                                                                                                                                                                                                                                                       
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)                                                                                                                                                                                                                                                              
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)                                                                                                                                                                                                                                                       
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)                                                                                                                                                                                                                                                              
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)                                                                                                                                                                                                                                                  
      %8 = arith.cmpi eq, %7, %c0 : index                                                                                                                                                                                                                                                                          
      %9 = scf.if %8 -> (tensor<1xi32>) {                                                                                                                                                                                                                                                                          
        %generated = tensor.generate  {                                                                                                                                                                                                                                                                            
        ^bb0(%arg1: index):                                                                                                                                                                                                                                                                                        
          tensor.yield %c0_i32 : i32                                                                                                                                                                                                                                                                               
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = tensor.empty() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

After running the empty-tensor-to-alloc-tensor pass and iree-codegen-iree-comprehensive-bufferize pass I get the following error

/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
// -----// IR Dump After IREEComprehensiveBufferize Failed (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %generated = tensor.generate  {
        ^bb0(%arg1: index):
          tensor.yield %c0_i32 : i32
        } : tensor<1xi32>
        scf.yield %generated : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

To help repro you can also start with

hal.executable private @repro {
hal.executable.variant public @embedded_elf_x86_64, target = <"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", native_vector_size = 16 : index, target_triple = "x86_64-unknown-unknown-eabi-elf"}>\
 {
  hal.executable.export public @_pad_1D_test_dispatch_0 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
  ^bb0(%arg0: !hal.device, %arg1: index):
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg1
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @_pad_1D_test_dispatch_0() {
      %c0 = arith.constant 0 : index
      %c3 = arith.constant 3 : index
      %c2 = arith.constant 2 : index
      %c0_i32 = arith.constant 0 : i32
      %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
      %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
      %2 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [2], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<2xi32>
      %padded = tensor.pad %2 low[%c3] high[%c2] {
      ^bb0(%arg0: index):
        tensor.yield %c0_i32 : i32
      } : tensor<2xi32> to tensor<7xi32>
      flow.dispatch.tensor.store %padded, %1, offsets = [0], sizes = [7], strides = [1] : tensor<7xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
      return
    }
  }
}

and run iree-opt --pass-pipeline=hal.executable(hal.executable.variant(iree-codegen-iree-comprehensive-bufferize)) repro.mlir

@matthias-springer
Copy link
Contributor

The tensor.generate is still there, can you run these patterns before bufferization (in particular before EmptyTensorElimination): test-linalg-transform-patterns=test-convert-to-destination-style-patterns

@MaheshRavishankar
Copy link
Contributor Author

Do we have to do this with transform dialect? Is there a populateMethod I can use?

@MaheshRavishankar
Copy link
Contributor Author

Found it. Still a failure in IREEComprehensiveBufferizePass. IR before failure

// -----// IR Dump Before IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %10 = bufferization.alloc_tensor() : tensor<1xi32>
        %11 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} outs(%10 : tensor<1xi32>) {
        ^bb0(%out: i32):
          linalg.yield %c0_i32 : i32
        } -> tensor<1xi32>
        scf.yield %11 : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

This is the error

/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
/usr/local/google/home/ravishankarm/iree/iree/tests/e2e/tosa_ops/pad.mlir:4:15: error: operand #0 may return/yield a new buffer allocation
    %result = "tosa.pad"(%0, %1)  : (tensor<2xi32>, tensor<1x2xi32>)  -> (tensor<7xi32>)
              ^
// -----// IR Dump After IREEComprehensiveBufferize Failed (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @_pad_1D_test_dispatch_0() {
    %cst = arith.constant dense<0> : vector<1xi32>
    %c7 = arith.constant 7 : index
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2xi32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    scf.for %arg0 = %workgroup_id_x to %c7 step %workgroup_count_x {
      %2 = affine.max affine_map<(d0) -> (-d0 + 3, 0)>(%arg0)
      %3 = affine.max affine_map<(d0) -> (0, d0 - 3)>(%arg0)
      %4 = affine.min affine_map<(d0) -> (2, d0)>(%3)
      %5 = affine.max affine_map<(d0) -> (0, d0 - 2)>(%arg0)
      %6 = affine.min affine_map<(d0) -> (2, d0)>(%5)
      %7 = affine.apply affine_map<(d0, d1) -> (d0 - d1)>(%6, %4)
      %8 = arith.cmpi eq, %7, %c0 : index
      %9 = scf.if %8 -> (tensor<1xi32>) {
        %10 = bufferization.alloc_tensor() : tensor<1xi32>
        %11 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} outs(%10 : tensor<1xi32>) {
        ^bb0(%out: i32):
          linalg.yield %c0_i32 : i32
        } -> tensor<1xi32>
        scf.yield %11 : tensor<1xi32>
      } else {
        %10 = flow.dispatch.tensor.load %0, offsets = [%4], sizes = [%7], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2xi32>> -> tensor<?xi32>
        %dim = tensor.dim %10, %c0 : tensor<?xi32>
        %11 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%2, %dim]
        %12 = arith.cmpi sle, %2, %c0 : index
        %13 = arith.cmpi sgt, %11, %c0 : index
        %14 = arith.andi %12, %13 : i1
        %15 = affine.apply affine_map<()[s0, s1] -> (s0 - s1)>()[%c0, %2]
        %16 = scf.if %14 -> (vector<1xi32>) {
          %19 = vector.transfer_read %10[%15], %c0_i32 {in_bounds = [true]} : tensor<?xi32>, vector<1xi32>
          scf.yield %19 : vector<1xi32>
        } else {
          scf.yield %cst : vector<1xi32>
        }
        %17 = bufferization.alloc_tensor() : tensor<1xi32>
        %18 = vector.transfer_write %16, %17[%c0] {in_bounds = [true]} : vector<1xi32>, tensor<1xi32>
        scf.yield %18 : tensor<1xi32>
      }
      flow.dispatch.tensor.store %9, %1, offsets = [%arg0], sizes = [1], strides = [1] : tensor<1xi32> -> !flow.dispatch.tensor<writeonly:tensor<7xi32>>
    }
    return
  }
}

MaheshRavishankar pushed a commit to MaheshRavishankar/iree that referenced this issue Feb 28, 2023
It seems like handling the code generated by the tiling of pad
operations needs more work in bufferization. To unblock the work of
handling pad operations natively in IREE,
iree-org#11273 (comment)
is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and
else branch and the result of the `scf.if` are all tied together. If
the `then` and `else` come from different bindings, then this would be
illegal (because a copy is needed). This example led to adding more
constraints on what sets can be merged during the
`BufferizationAnalysis` to avoid merging sets that have constants or
have two different `interface_bindings`.
MaheshRavishankar added a commit that referenced this issue Mar 2, 2023
It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE,
#11273 (comment) is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the
BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings.

benchmarks: x86_64, cuda
qedawkins pushed a commit to qedawkins/iree that referenced this issue Apr 2, 2023
It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE,
iree-org#11273 (comment) is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the
BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings.

benchmarks: x86_64, cuda
@MaheshRavishankar
Copy link
Contributor Author

Moved to P2 for now.

jpienaar pushed a commit that referenced this issue May 1, 2023
It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE,
#11273 (comment) is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the
BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings.

benchmarks: x86_64, cuda
rengolin pushed a commit to plaidml/iree that referenced this issue May 2, 2023
It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE,
iree-org#11273 (comment) is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the
BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings.

benchmarks: x86_64, cuda
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this issue Jul 6, 2023
It seems like handling the code generated by the tiling of pad operations needs more work in bufferization. To unblock the work of handling pad operations natively in IREE,
iree-org#11273 (comment) is implemented here as a workaround.

To ensure bufferization without allocation, yields of the then and else branch and the result of the scf.if are all tied together. If the then and else come from different bindings, then this would be illegal (because a copy is needed). This example led to adding more constraints on what sets can be merged during the
BufferizationAnalysis to avoid merging sets that have constants or have two different interface_bindings.

benchmarks: x86_64, cuda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants