Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Convolution] Packing + objectFifo, initial support #789

Merged
merged 5 commits into from
Oct 9, 2024

Conversation

newling
Copy link
Contributor

@newling newling commented Sep 19, 2024

This PR switches all numerical convolution tests to use the objectFifo pipeline. With respect to the new tiling strategy:

  1. A single column is currently used. Targeting multiple columns results in error: 'aie.memtile_dma' op could not find and assign a valid BD id. This will will be investigated as follow-up work: Multicore convolution  #821

  2. There is no longer interleaving of compute and L2->L1 data movement, which means Failure running conv2d i32 with objectFifo #619 becomes low priority / obsolete

  3. L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing.

  4. Channel-first convolution is completely unsupported, we expect high level transforms to convert to channel last before reaching our backend.

  5. Vectorization is not currently enabled, due to issues with alignment. See follow-up task Numerics issue with vectorized conv2d #820. This is functionally ok for now, as peano can scalarize code for all data types.

@newling newling changed the title [WIP][Convolution] Packing + objectFifo [Convolution] Packing + objectFifo, initial support Sep 19, 2024
@yzhang93
Copy link
Contributor

yzhang93 commented Sep 20, 2024

I'll have a thorough review tomorrow. Just a high level question, does this new strategy also work with AIR pipeline?

Thanks. No, it doesn't get through compilation with the AIR pipeline.

@newling
Copy link
Contributor Author

newling commented Sep 23, 2024

@erwei-xilinx @yzhang93 I am making changes to the tiling strategy today, so please don't review this in detail yet. I will post IR once I have channel-last vectorizing correctly.

@newling newling changed the title [Convolution] Packing + objectFifo, initial support [WIP][Convolution] Packing + objectFifo, initial support Sep 23, 2024
@newling
Copy link
Contributor Author

newling commented Sep 26, 2024

Reporting current status with current code checked in, mostly to keep track for myself.

The vectorization looks good, compilation to vmfb successful. vector.contract -> aievec.matmul, no discontiguous transfer_reads or transfer_writes. Good.

But numerically incorrect. Values all quite close (no zeros).

If I make the input tensor be all 1s (but kernel still random values) then numerically correct.

If I comment out the pass AMDAIEVectorization, numerically correct (with input tensor and kernel having random values).

The AIEVec passes look like they're doing the correct thing, I can't see any problem with aievec.matmul, or lowering to LLVM.

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

Or is the problem somewhere else, and it's just having a bad interaction with AMDAIEVectorization?

@makslevental
Copy link
Collaborator

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.

@newling
Copy link
Contributor Author

newling commented Sep 26, 2024

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.

Ok. The difference does appear to be vectorization/below. Replace this:

scf.for %arg4 = %c0 to %c3 step %c1 {
  scf.for %arg5 = %c0 to %c3 step %c1 {
    %collapse_shape = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
    %25 = vector.transfer_read %reinterpret_cast_8[%c0, %arg4, %c0, %arg5, %c0], %cst {in_bounds = [true, true]} : memref<1x3x1x6x8xbf16, 2 : i32>, vector<4x8xbf16>
    %26 = vector.transfer_read %reinterpret_cast_9[%arg4, %arg5, %c0, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x1x1x8x4xbf16, 2 : i32>, vector<8x4xbf16>
    %27 = vector.transfer_read %collapse_shape[%c0, %c0], %cst_0 {in_bounds = [true, true]} : memref<4x4xf32, 2 : i32>, vector<4x4xf32>
    %28 = arith.extf %25 : vector<4x8xbf16> to vector<4x8xf32>
    %29 = arith.extf %26 : vector<8x4xbf16> to vector<8x4xf32>
    %30 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %28, %29, %27 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
    vector.transfer_write %30, %collapse_shape[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf32>, memref<4x4xf32, 2 : i32>
  }
}

with this:

scf.for %arg4 = %c0 to %c3 step %c1 {
  scf.for %arg5 = %c0 to %c3 step %c1 {
    %subview = memref.subview %reinterpret_cast_7[0, %arg4, 0, %arg5, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x1x6x8xbf16, 2 : i32> to memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32>
    %collapse_shape = memref.collapse_shape %subview [[0, 1, 2, 3], [4]] : memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32> into memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>
    %subview_9 = memref.subview %reinterpret_cast_8[%arg4, %arg5, 0, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x1x1x8x4xbf16, 2 : i32> to memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32>
    %collapse_shape_10 = memref.collapse_shape %subview_9 [[0, 1, 2, 3, 4], [5]] : memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32> into memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>
    %collapse_shape_11 = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
    linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapse_shape, %collapse_shape_10 : memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>, memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>) outs(%collapse_shape_11 : memref<4x4xf32, 2 : i32>) {
    ^bb0(%in: bf16, %in_12: bf16, %out: f32):
      %25 = arith.extf %in : bf16 to f32
      %26 = arith.extf %in_12 : bf16 to f32
      %27 = arith.mulf %25, %26 : f32
      %28 = arith.addf %out, %27 : f32
      linalg.yield %28 : f32
    }
  }
}

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

@yzhang93
Copy link
Contributor

@newling Can we have a full review of the current state of convolution ops in the next AIE sync? CC @MaheshRavishankar

@makslevental
Copy link
Collaborator

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

I made progress on this yesterday but I'm still blocked by some issue about MEMTAB in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.

@newling
Copy link
Contributor Author

newling commented Sep 26, 2024

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

I made progress on this yesterday but I'm still blocked by some issue about MEMTAB in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.

Ok, thanks, I'll keep an eye out for that.

@newling newling changed the title [WIP][Convolution] Packing + objectFifo, initial support [Convolution] Packing + objectFifo, initial support Oct 3, 2024
@newling newling requested a review from yzhang93 October 3, 2024 15:27
@newling
Copy link
Contributor Author

newling commented Oct 4, 2024

@yzhang93 this is ready for review again if you'd like to take a look

@newling newling force-pushed the packing_for_convolution branch 2 times, most recently from 2bed2df to b4a86cf Compare October 7, 2024 17:55
@newling newling merged commit c84cca0 into nod-ai:main Oct 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants