ObjectFifo Matmul + Elementwise #644

jtuyls · 2024-08-05T18:50:11Z

The main failure in matmul + elementwise seems to be caused by too many connections being created. There are multiple ways this could be solved:

Packet routing, so streams can be reused for multiple connections.
Reprogramming DMAs and routes after matmul data has been moved in, to create new routes for elementwise data movement.
Combining connections together whenever possible. This could either be combining A and B connections, so the same streams are used for these, or combining the elementwise constant data connection with the A or B connection.

As there is no e2e support yet for the more general approaches 1 and 2, the current thought is to implement approach 3. This also still has some potential long term benefits as depending on how this is implemented and what operations are targeted, this could give very good performance as well.

For approach 3, we need following fixes and new transformations in the flow:

Fix access ops not at the right place (@jtuyls)
Fixed by :
- [DistributeCoresAndObjectFifos] Fix access op ordering in cores #625
Split temp memrefs in distribute pass (one 4x4x4x4x4 -> 4 separate 4x4x4x4 memrefs) (@Abhishek-Varma )
Fixed by :
- [AMD-AIE][ObjectFifo] Address temporary L1 buffers for Matmul+Elem #647
- [ObjectFifo] Create a new pass to split L2 buffers #659
Linearize logical objectfifo memrefs (@yzhang93 )
Fixed by :
- [ObjectFifo] Add pass to flatten the logical objectFifo #638
- [ObjectFifo] Modify LowerToAIE pass to take flattened objectfifos #652
Combine DMAs and insert additional reads after same number of accesses in both cores (@jtuyls)
- Enable reuse of circular dmas/connections for different logical objectFifos on L3: Add logical objFifo placeholder op for connection reuse #709

This PR is part to achieve #644

) -- In Matmul+Elemwise we get to see the intermediate L1 buffers for matmuls :- ``` alloc -> subview -> access (within amdaie.core) ``` -- We should replace the subview with a narrowed alloc itself for this case as well. -- This commit therefore addresses that as part of `--iree-amdaie-distribute-cores-and-objectfifos` pass. -- Addresses sub-action item `2` from #644 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

-- This commit introduces a new pass `--iree-amdaie-split-buffers` to split L2 buffers for dealing with Matmul+Elementwise. -- It addresses sub-action 2 as well from #644 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

jtuyls · 2024-08-12T04:18:05Z

@Abhishek-Varma

For 4, see the below snippet for a sample in/output:

NOTE: The circular DMA objectfifos need to be decoupled from the actual underlying memref argument, so a single circular DMA can be reused for multiple different memref arguments (see ARG_NEW in the expected output). I will have a look at how to update the ops to accomplish this.

%tile_0_2 = amdaie.tile(%c0, %c2)
%tile_1_2 = amdaie.tile(%c1, %c2)
%tile_0_3 = amdaie.tile(%c0, %c3)
%tile_1_3 = amdaie.tile(%c1, %c3)

%0 = amdaie.circular_dma_cpy_nd(%arg1[] [] [], %arg0[] [] []) : (!amdaie.logicalobjectfifo<memref<2x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<2x16xi32>>)
%1 = amdaie.circular_dma_cpy_nd(%arg2[] [] [], %arg1[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)
%2 = amdaie.circular_dma_cpy_nd(%arg3[] [] [], %arg1[16] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)

%3 = amdaie.circular_dma_cpy_nd(%arg5[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<1x16xi32>>)
%4 = amdaie.circular_dma_cpy_nd(%arg6[] [] [], %arg5[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 4>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%5 = amdaie.circular_dma_cpy_nd(%arg7[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%6 = amdaie.circular_dma_cpy_nd(%arg8[] [] [], %arg7[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%7 = amdaie.circular_dma_cpy_nd(%arg9[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%8 = amdaie.circular_dma_cpy_nd(%arg10[] [] [], %arg9[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)

%9 = amdaie.circular_dma_cpy_nd(%arg11[] [] [], %arg4[] [] []) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<4x16xi32>>)
%10 = amdaie.circular_dma_cpy_nd(%arg12[] [] [], %arg11[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<1x16xi32, 1>>)


%core_0_2 = amdaie.core(%tile_0_2, in : [%1, %4], out : []) {
   %access_0 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_0: memref<1x16xi32, 2>)
    %access_1 = amdaie.logicalobjectfifo.access(%arg6, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_1: memref<1x16xi32, 2>)
    amdaie.end
}
%core_1_2 = amdaie.core(%tile_1_2, in : [%1, %6], out : []) {
   %access_2 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_2: memref<1x16xi32, 2>)
    %access_2 = amdaie.logicalobjectfifo.access(%arg8, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_2: memref<1x16xi32, 2>)
    amdaie.end
}
%core_0_3 = amdaie.core(%tile_0_3, in : [%2, %8], out : []) {
   %access_3 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_3: memref<1x16xi32, 2>)
    %access_4 = amdaie.logicalobjectfifo.access(%arg10, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_4: memref<1x16xi32, 2>)
    amdaie.end
}
%core_1_3 = amdaie.core(%tile_1_3, in : [%2, %10], out : []) {
    %access_5 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_5: memref<1x16xi32, 2>)
    %access_6 = amdaie.logicalobjectfifo.access(%arg12, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_12: memref<1x16xi32, 2>)
    amdaie.end
}

amdaie.controlcode {
    %npu0 = amdaie.npu.dma_cpy_nd %0([] [] [], [0] [32] [1])
    %npu1 = amdaie.npu.dma_cpy_nd %3([] [] [], [0] [16] [1])
    %npu2 = amdaie.npu.dma_cpy_nd %5([] [] [], [16] [16] [1])
    %npu3 = amdaie.npu.dma_cpy_nd %7([] [] [], [32] [16] [1])
    %npu4 = amdaie.npu.dma_cpy_nd %9([] [] [], [48] [16] [1])
}

Expected output:

%tile_0_2 = amdaie.tile(%c0, %c2)
%tile_1_2 = amdaie.tile(%c1, %c2)
%tile_0_3 = amdaie.tile(%c0, %c3)
%tile_1_3 = amdaie.tile(%c1, %c3)

%0 = amdaie.circular_dma_cpy_nd(%arg1[] [] [], %ARG_NEW[] [] []) : (!amdaie.logicalobjectfifo<memref<2x16xi32, 1>>, !amdaie.logicalobjectfifo<memref<2x16xi32>>)
%1 = amdaie.circular_dma_cpy_nd(%arg2[] [] [], %arg1[0] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)
%2 = amdaie.circular_dma_cpy_nd(%arg3[] [] [], %arg1[16] [16] [1]) : (!amdaie.logicalobjectfifo<memref<1x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<2x16xi32, 1>>)

%core_0_2 = amdaie.core(%tile_0_2, in : [%1], out : []) {
   %access_0 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_0: memref<1x16xi32, 2>)
    %access_1 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_1: memref<1x16xi32, 2>)
    // Read access objectFifo again, but don't use it as data is not needed in this core (but in core_1_2).
    %access_2 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    amdaie.end
}
%core_1_2 = amdaie.core(%tile_1_2, in : [%1], out : []) {
    %access_3 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_3: memref<1x16xi32, 2>)
    // Read access objectFifo, but don't use it as data is not needed in this core (but in core_0_2).
    %access_4 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    // Read access objectFifo again, but now use the data.
    %access_5 = amdaie.logicalobjectfifo.access(%arg2, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_5: memref<1x16xi32, 2>)
    amdaie.end
}
%core_0_3 = amdaie.core(%tile_0_3, in : [%2, %8], out : []) {
    %access_6 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_6: memref<1x16xi32, 2>)
    %access_7 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_7: memref<1x16xi32, 2>)
    // Read access objectFifo again, but don't use it as data is not needed in this core (but in core_1_3).
    %access_8 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    amdaie.end
}
%core_1_3 = amdaie.core(%tile_1_3, in : [%2, %10], out : []) {
    %access_9 = amdaie.logicalobjectfifo.access(%arg3, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_9: memref<1x16xi32, 2>)
    // Read access objectFifo, but don't use it as data is not needed in this core (but in core_0_3).
    %access_10 = amdaie.logicalobjectfifo.access(%arg10, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    // Read access objectFifo again, but now use the data.
    %access_11 = amdaie.logicalobjectfifo.access(%arg11, Read) : !amdaie.logicalobjectfifo<memref<1x16xi32, 2>> -> memref<1x16xi32, 2>
    linalg.fill ins(%c0_i32 : i32) outs(%access_11: memref<1x16xi32, 2>)
    amdaie.end
}

amdaie.controlcode {
    %npu0 = amdaie.npu.dma_cpy_nd %0([] [] [], [0] [32] [1])
    %npu1 = amdaie.npu.dma_cpy_nd %0([] [] [], [0, 0] [2, 16] [32, 1])
    %npu2 = amdaie.npu.dma_cpy_nd %0([] [] [], [0, 16] [2, 16] [32, 1])
}

-- This commit introduces a new pass `--iree-amdaie-split-buffers` to split L2 buffers for dealing with Matmul+Elementwise. -- It addresses sub-action 2 as well from #644 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

jtuyls · 2024-09-02T06:45:49Z

@Abhishek-Varma Here is an example of input and expected output for point 4 above, showing how the C DMAs are combined with the B ones and additional read accesses are inserted to accommodate broadcasted data.

Input:

%15 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%16 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%17 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%18 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%19 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%20 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%21 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%22 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%23 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%24 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%43 = amdaie.dma_cpy_nd(%6[0, 0, 0, 0] [2, 1, 32, 32] [1024, 1024, 32, 1], %8[0, 0, %24, 224] [2, 1, 32, 32] [8192, 32, 256, 1]) : (!amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x256xi32>>)
%44 = amdaie.dma_cpy_nd(%5[0, 0, 0, 0] [1, 2, 32, 32] [2048, 1024, 32, 1], %10[0, 0, 224, %19] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<256x128xi32>>)
%45 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, %15] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%46 = amdaie.dma_cpy_nd(%2[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %21, %16] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%47 = amdaie.dma_cpy_nd(%3[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, %17] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%48 = amdaie.dma_cpy_nd(%4[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %23, %18] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%57 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%58 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%59 = amdaie.dma_cpy_nd(%30[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
%60 = amdaie.dma_cpy_nd(%52[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %1[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%61 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %56[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%62 = amdaie.core(%tile_15, in : [%59, %57, %60], out : [%61]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%34, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%52, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%56, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%63 = amdaie.dma_cpy_nd(%50[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %2[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%64 = amdaie.dma_cpy_nd(%0[0, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %54[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%65 = amdaie.core(%tile_13, in : [%59, %58, %63], out : [%64]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%36, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%50, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%54, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%66 = amdaie.dma_cpy_nd(%29[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[1, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
%67 = amdaie.dma_cpy_nd(%51[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %3[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%68 = amdaie.dma_cpy_nd(%0[1, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %55[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%69 = amdaie.core(%tile_14, in : [%66, %57, %67], out : [%68]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%39, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%51, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%55, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%70 = amdaie.dma_cpy_nd(%49[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %4[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%71 = amdaie.dma_cpy_nd(%0[1, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %53[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
%72 = amdaie.core(%tile_12, in : [%66, %58, %70], out : [%71]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%41, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  %77 = amdaie.logicalobjectfifo.access(%49, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%53, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%73 = amdaie.dma_cpy_nd(%14[%24, %19] [64, 64] [128, 1], %0[0, 0, 0, 0] [2, 32, 2, 32] [2048, 32, 1024, 1]) : (!amdaie.logicalobjectfifo<memref<128x128xi32>>, !amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>)

Expected output:

%15 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%16 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%17 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%18 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg1)
%19 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%20 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%21 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%22 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%23 = affine.apply affine_map<(d0) -> (d0 * 64 + 32)>(%arg0)
%24 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
%43 = amdaie.dma_cpy_nd(%6[0, 0, 0, 0] [2, 1, 32, 32] [1024, 1024, 32, 1], %8[0, 0, %24, 224] [2, 1, 32, 32] [8192, 32, 256, 1]) : (!amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x256xi32>>)
%44 = amdaie.dma_cpy_nd(%5[0, 0, 0, 0] [1, 2, 32, 32] [2048, 1024, 32, 1], %10[0, 0, 224, %19] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<256x128xi32>>)
// [OLD] %45 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, %15] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %46 = amdaie.dma_cpy_nd(%2[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %21, %16] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %47 = amdaie.dma_cpy_nd(%3[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, %17] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
// [OLD] %48 = amdaie.dma_cpy_nd(%4[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %23, %18] [1, 1, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%45_46 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %20, 0] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%47_48 = amdaie.dma_cpy_nd(%1[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %12[0, 0, %22, 0] [1, 2, 32, 32] [4096, 32, 128, 1]) : (!amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<128x128xi32>>)
%57 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%58 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [1024, 1024, 128, 32, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 4, 8, 4] [2048, 1024, 4, 256, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%59 = amdaie.dma_cpy_nd(%30[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
// [OLD] %60 = amdaie.dma_cpy_nd(%52[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %1[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%60 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%61 = amdaie.dma_cpy_nd(%0[0, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %56[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(0, 2)
%62 = amdaie.core(%tile_15, in : [%59, %57, %60], out : [%61]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%34, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Operate on the first read from `%28` (broadcasted to this core and core(0, 3))
  %77 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%56, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  // Perform another read of `%28` because the data is broadcasted and core(0, 3) will operate on it
  %77_new = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  amdaie.end
}
// [OLD] %63 = amdaie.dma_cpy_nd(%50[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %2[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%63 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%64 = amdaie.dma_cpy_nd(%0[0, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %54[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(1, 2)
%65 = amdaie.core(%tile_13, in : [%59, %58, %63], out : [%64]) {
  %74 = amdaie.logicalobjectfifo.access(%30, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%36, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Operate on the first read from `%27` (broadcasted to this core and core(1, 3))
  %77 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%54, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77 : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  // Perform another read of `%28` because the data is broadcasted and core(1, 3) will operate on it
  %77_new = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  amdaie.end
}
%66 = amdaie.dma_cpy_nd(%29[0, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 256, 32, 8, 1], %6[1, 0, 0, 0, 0, 0] [1, 1, 4, 8, 4, 8] [1024, 1024, 8, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<2x1x32x32xi32, 1 : i32>>)
// %67 = amdaie.dma_cpy_nd(%51[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %3[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%67 = amdaie.dma_cpy_nd(%28[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%68 = amdaie.dma_cpy_nd(%0[1, 0, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %55[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(0, 3)
%69 = amdaie.core(%tile_14, in : [%66, %57, %67], out : [%68]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%39, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Perform a first read of `%28` because the data is broadcasted and core(0, 2) will operate on it
  %77 = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  // Perform another read from `%28` because this core will operate on the second read
  %77_new = amdaie.logicalobjectfifo.access(%28, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %78 = amdaie.logicalobjectfifo.access(%55, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77_new : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
// [OLD] %70 = amdaie.dma_cpy_nd(%49[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %4[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x32x32xi32, 1 : i32>>)
%70 = amdaie.dma_cpy_nd(%27[0, 0, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1024, 1024, 128, 16, 4, 1], %5[0, 1, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [2048, 1024, 4, 128, 32, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>, !amdaie.logicalobjectfifo<memref<1x2x32x32xi32, 1 : i32>>)
%71 = amdaie.dma_cpy_nd(%0[1, 1, 0, 0] [1, 1, 32, 32] [2048, 1024, 32, 1], %53[0, 0, 0, 0] [8, 4, 8, 4] [16, 4, 128, 1]) : (!amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>, !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>>)
// core(1, 3)
%72 = amdaie.core(%tile_12, in : [%66, %58, %70], out : [%71]) {
  %74 = amdaie.logicalobjectfifo.access(%29, Read) : !amdaie.logicalobjectfifo<memref<1x1x4x8x4x8xi32, 2 : i32>> -> memref<1x1x4x8x4x8xi32, 2 : i32>
  %75 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>> -> memref<1x1x8x4x8x4xi32, 2 : i32>
  %76 = amdaie.logicalobjectfifo.access(%41, None) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%74, %75 : memref<1x1x4x8x4x8xi32, 2 : i32>, memref<1x1x8x4x8x4xi32, 2 : i32>) outs(%76 : memref<1x1x8x8x4x4xi32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.muli %in, %in_16 : i32
    %80 = arith.addi %out, %79 : i32
    linalg.yield %80 : i32
  }
  // Perform a first read of `%27` because the data is broadcasted and core(0, 3) will operate on it
  %77 = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  %77_new = amdaie.logicalobjectfifo.access(%27, Read) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  // Perform another read from `%27` because this core will operate on the second read
  %78 = amdaie.logicalobjectfifo.access(%53, Write) : !amdaie.logicalobjectfifo<memref<1x1x8x8x4x4xi32, 2 : i32>> -> memref<1x1x8x8x4x4xi32, 2 : i32>
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%76, %77_new : memref<1x1x8x8x4x4xi32, 2 : i32>, memref<1x1x8x8x4x4xi32, 2 : i32>) outs(%78 : memref<1x1x8x8x4x4xi32, 2 : i32>) {
  ^bb0(%in: i32, %in_16: i32, %out: i32):
    %79 = arith.addi %in, %in_16 : i32
    linalg.yield %79 : i32
  }
  amdaie.end
}
%73 = amdaie.dma_cpy_nd(%14[%24, %19] [64, 64] [128, 1], %0[0, 0, 0, 0] [2, 32, 2, 32] [2048, 32, 1024, 1]) : (!amdaie.logicalobjectfifo<memref<128x128xi32>>, !amdaie.logicalobjectfifo<memref<2x2x32x32xi32, 1 : i32>>)

-- This commit introduces a new pass `--iree-amdaie-split-buffers` to split L2 buffers for dealing with Matmul+Elementwise. -- It addresses sub-action 2 as well from #644 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

-- This commit introduces a new pass `--iree-amdaie-split-logical-objectfifos-for-connection-reuse` to split logical objectFifos for dealing with Matmul+Elementwise. -- Also contains a utility to check whether splitting can be performed. -- It addresses sub-action 2 as well from #644 Signed-off-by: Abhishek Varma <abhvarma@amd.com>

-- This commit adds a new pass `--iree-amdaie-logical-objectfifos-for-connection-reuse`. -- Essentially follows the narrative after splitting of logical objectFifos and is aimed to address point 4 of #644. Signed-off-by: Abhishek Varma <abhvarma@amd.com>

jtuyls assigned yzhang93, jtuyls and Abhishek-Varma Aug 5, 2024

yzhang93 mentioned this issue Aug 6, 2024

[ObjectFifo] Add pass to flatten the logical objectFifo #638

Merged

Abhishek-Varma mentioned this issue Aug 6, 2024

[AMD-AIE][ObjectFifo] Address temporary L1 buffers for Matmul+Elem #647

Merged

yzhang93 added a commit that referenced this issue Aug 6, 2024

[ObjectFifo] Add pass to flatten the logical objectFifo (#638)

ab9fabd

This PR is part to achieve #644

Abhishek-Varma mentioned this issue Aug 9, 2024

[ObjectFifo] Create a new pass to split L2 buffers #659

Merged

Abhishek-Varma mentioned this issue Sep 9, 2024

[ObjectFifo] Combine Logical ObjectFifos for reuse #755

Open

Abhishek-Varma mentioned this issue Sep 10, 2024

[ObjectFifo] Add a pass to combine logical objFifos for connection reuse #760

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ObjectFifo Matmul + Elementwise #644

ObjectFifo Matmul + Elementwise #644

jtuyls commented Aug 5, 2024 •

edited

Loading

jtuyls commented Aug 12, 2024

jtuyls commented Sep 2, 2024

ObjectFifo Matmul + Elementwise #644

ObjectFifo Matmul + Elementwise #644

Comments

jtuyls commented Aug 5, 2024 • edited Loading

jtuyls commented Aug 12, 2024

jtuyls commented Sep 2, 2024

jtuyls commented Aug 5, 2024 •

edited

Loading