Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-kernel tests (plus a few others) don't work with driver date 20240819 #700

Open
makslevental opened this issue Aug 23, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@makslevental
Copy link
Collaborator

The Windows E2E PR #689 is blocked because in order to get Windows to work I had to use a particular commit of XRT (required by XRT-MCDM). At that commit, the current linux driver (date 20240819) fails some tests while the older driver (such as the one on the sharkbox phoenix nuc) passes.

Possibly there's some connection to https://github.com/nod-ai/iree-amd-aie/blob/main/runtime/src/iree-amd-aie/driver/xrt/native_executable.cc#L270, which I had to ifndef _WIN32 away on Windows (which @jtuyls added when battling numerics issues) but it's not clear since it's not only the multi-kernel tests that fail.

@MaheshRavishankar
Copy link
Collaborator

Could you call out specific tests... They might not all get resolved in one go.

@makslevental
Copy link
Collaborator Author

Could you call out specific tests... They might not all get resolved in one go.

  for name in [
      "two_matmul_switching",
      "matmul_f32_8_8_4",
      "matmul_f32_8_4_8",
  ]:
      aie_vs_llvm_cpu(config, test_files_dir / f"{name}.mlir")

  aie_vs_llvm_cpu(
      config,
      test_files_dir / "three_matmuls.mlir",
      function_name="three_$mm$",
  )

  # Test(s) of the form matmul(A,B) where A:MxK, B:KxN
  test_name = output_dir / "test_from_template.mlir"
  template_name = matmul_template_dir / "matmul_MxK_KxN.mlir"
  generate_matmul_test(test_name, template_name, 32, 32, 64, "bf16", "f32")
  aie_vs_llvm_cpu(config, test_name)

  # Test(s) of the form matmul(A,B) + C where A:MxK, B:KxN, C:N
  test_name = output_dir / "test_from_template_bias_N.mlir"
  template_name = matmul_template_dir / "matmul_bias_MxK_KxN_N.mlir"
  generate_matmul_test(
      test_name, template_name, 1024, 1024, 512, "bf16", "f32"
  )
  if config.vitis_dir:
      aie_vs_llvm_cpu(
          config, test_name, tile_pipeline="pack-peel", use_ukernel=True
      )
  aie_vs_llvm_cpu(
      config, test_name, tile_pipeline="pack-peel", use_ukernel=False
  )

@makslevental
Copy link
Collaborator Author

Note, I believe I did do the whole stratification thing (some on, some off) and they indeed all do fail independtly of each other. But I'll do that again to be sure.

@makslevental makslevental added the bug Something isn't working label Aug 23, 2024
@makslevental
Copy link
Collaborator Author

Btw, locally I have debug turned on for my driver (linux) and I see this for every test:

28320403: PID(68358): Allocated KMQ BO (userptr=0x7f3966c9d000, size=32768, flags=0x20000000, type=1, drm_bo=8)
28354699: PID(68358): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=8, size=32768
28357514: PID(68358): Unmap BO, aligned 0x7f3966c9d000 parent (nil)
29546943: PID(68358): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=4, size=8192
29553385: PID(68358): Unmap BO, aligned 0x7f3966cb1000 parent (nil)
29612528: PID(68358): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=5, size=16384
29617438: PID(68358): Unmap BO, aligned 0x7f3966cad000 parent (nil)
29662174: PID(68358): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=6, size=32768
29664648: PID(68358): Unmap BO, aligned 0x7f3966ca5000 parent (nil)
29709454: PID(68358): Destroying KMQ pcidev
29712640: PID(68358): Device node fd leaked!! fd=3 <------------------- ???

@makslevental
Copy link
Collaborator Author

On the new driver, for two_matmul_switching test:

EXEC @matmul_small
134363871: PID(65459): Allocated KMQ BO (userptr=0x7ff8204c7000, size=128, flags=0x20000000, type=1, drm_bo=8)
134405692: PID(65459): Allocated KMQ BO (userptr=0x7ff8204c6000, size=384, flags=0x20000000, type=1, drm_bo=9)
134479183: PID(65459): Allocated KMQ BO (userptr=0x7ff8204c5000, size=4096, flags=0x80000000, type=4, drm_bo=10)
134490635: PID(65459): Added arg BO 3 to cmd BO 10
134494753: PID(65459): Added arg BO 6 to cmd BO 10
134497568: PID(65459): Added arg BO 7 to cmd BO 10
134499712: PID(65459): Added arg BO 9 to cmd BO 10
134525592: PID(65459): Submitted command (1)
134529940: PID(65459): Waiting for cmd (1)...
134921743: PID(65459): Added arg BO 5 to cmd BO 10
134924488: PID(65459): Added arg BO 9 to cmd BO 10
134926372: PID(65459): Added arg BO 6 to cmd BO 10
134928245: PID(65459): Added arg BO 9 to cmd BO 10
terminate called after throwing an instance of 'xrt_core::system_error'
  what():  DRM_IOCTL_AMDXDNA_EXEC_CMD IOCTL failed (err=114): Operation already in progress

@makslevental
Copy link
Collaborator Author

makslevental commented Aug 24, 2024

Inside iree_hal_xrt_direct_command_buffer_dispatch:

  std::cerr << "before start\n";
  run.start();
  std::cerr << "after start\n";
  run.wait();
  std::cerr << "after wait\n";
  for (xrt::bo& bo : bos) bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE);

prints

139451362: PID(67134): Waiting for cmd (1)...
139820560: PID(67134): Added arg BO 5 to cmd BO 10
139824186: PID(67134): Added arg BO 9 to cmd BO 10
139827483: PID(67134): Added arg BO 6 to cmd BO 10
139830078: PID(67134): Added arg BO 9 to cmd BO 10
before start
after start
after wait
before start
terminate called after throwing an instance of 'xrt_core::system_error'
  what():  DRM_IOCTL_AMDXDNA_EXEC_CMD IOCTL failed (err=114): Operation already in progress

So seems iree_hal_xrt_direct_command_buffer_dispatch is getting called twice when XRT already has both dispatches or commands buffers or whatever and doesn't except it to get called twice? Maybe @nirvedhmeshram has some insight.

@makslevental
Copy link
Collaborator Author

makslevental commented Aug 24, 2024

static bool runOnce;

static iree_status_t iree_hal_xrt_direct_command_buffer_dispatch(...) {
  ...
  if (runOnce)
    return iree_ok_status();

  run.start();
  run.wait();
  runOnce = true;

  IREE_TRACE_ZONE_END(z0);
  return iree_ok_status();
}

gives

EXEC @matmul_small
22151753: PID(69421): Allocated KMQ BO (userptr=0x7fb06cfe8000, size=128, flags=0x20000000, type=1, drm_bo=8)
22190698: PID(69421): Allocated KMQ BO (userptr=0x7fb06cfe7000, size=384, flags=0x20000000, type=1, drm_bo=9)
22263948: PID(69421): Allocated KMQ BO (userptr=0x7fb06cfe6000, size=4096, flags=0x80000000, type=4, drm_bo=10)
22272795: PID(69421): Added arg BO 3 to cmd BO 10
22274789: PID(69421): Added arg BO 6 to cmd BO 10
22275730: PID(69421): Added arg BO 7 to cmd BO 10
22276542: PID(69421): Added arg BO 9 to cmd BO 10
22308413: PID(69421): Submitted command (1)
22312300: PID(69421): Waiting for cmd (1)...
22735479: PID(69421): Added arg BO 5 to cmd BO 10
22736411: PID(69421): Added arg BO 9 to cmd BO 10
22736962: PID(69421): Added arg BO 6 to cmd BO 10
22737594: PID(69421): Added arg BO 9 to cmd BO 10
22738686: PID(69421): Added arg BO 3 to cmd BO 10
22739307: PID(69421): Added arg BO 9 to cmd BO 10
22739868: PID(69421): Added arg BO 7 to cmd BO 10
22740449: PID(69421): Added arg BO 9 to cmd BO 10
22741311: PID(69421): Added arg BO 5 to cmd BO 10
22741872: PID(69421): Added arg BO 9 to cmd BO 10
22742423: PID(69421): Added arg BO 6 to cmd BO 10
22742974: PID(69421): Added arg BO 9 to cmd BO 10
22743966: PID(69421): Added arg BO 3 to cmd BO 10
22744547: PID(69421): Added arg BO 9 to cmd BO 10
22745088: PID(69421): Added arg BO 7 to cmd BO 10
22745629: PID(69421): Added arg BO 9 to cmd BO 10
22746450: PID(69421): Added arg BO 5 to cmd BO 10
22746961: PID(69421): Added arg BO 9 to cmd BO 10
22747502: PID(69421): Added arg BO 6 to cmd BO 10
22748074: PID(69421): Added arg BO 9 to cmd BO 10
22748885: PID(69421): Added arg BO 3 to cmd BO 10
22749396: PID(69421): Added arg BO 9 to cmd BO 10
22749917: PID(69421): Added arg BO 7 to cmd BO 10
22750448: PID(69421): Added arg BO 9 to cmd BO 10
22751280: PID(69421): Added arg BO 5 to cmd BO 10
22751811: PID(69421): Added arg BO 9 to cmd BO 10
22752332: PID(69421): Added arg BO 6 to cmd BO 10
22752883: PID(69421): Added arg BO 8 to cmd BO 10
22759014: PID(69421): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=9, size=384
22759606: PID(69421): Unmap BO, aligned 0x7fb06cfe7000 parent (nil)
22794923: PID(69421): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=6, size=128
22795354: PID(69421): Unmap BO, aligned 0x7fb06d023000 parent (nil)
22815743: PID(69421): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=7, size=128
22816094: PID(69421): Unmap BO, aligned 0x7fb06cfe9000 parent (nil)
22875086: PID(69421): Allocated KMQ BO (userptr=0x7fb06d023000, size=128, flags=0x20000000, type=1, drm_bo=6)
22882891: PID(69421): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=8, size=128
22883292: PID(69421): Unmap BO, aligned 0x7fb06cfe8000 parent (nil)
23083305: PID(69421): Freeing KMQ BO, type=AMDXDNA_BO_SHMEM, drm_bo=6, size=128
23084066: PID(69421): Unmap BO, aligned 0x7fb06d023000 parent (nil)
23119464: PID(69421): Destroying KMQ pcidev
23120456: PID(69421): Device node fd leaked!! fd=3

Time spent in running the model: 43.0 [ms]
Running 'compare' with rtol=1e-06 and atol=1e-06
Values are not all close. Here is a summary of the differences:
Number of positions where values are different is 32 out of 32
Maximum difference: 3109041995776.0
Discrepancies: 
At index: 0 0 
AIE value: 0.0
CPU value: 2693080547328.0
At index: 0 1 
AIE value: 0.0
CPU value: 3109041995776.0
At index: 0 2 
AIE value: 0.0
CPU value: 2508072943616.0
At index: 0 3 
AIE value: 0.0
CPU value: 2455742185472.0
At index: 1 0 
AIE value: 0.0
CPU value: 2198771990528.0
At index: 1 1 
AIE value: 0.0
CPU value: 2538383081472.0
At index: 1 2 
AIE value: 0.0
CPU value: 2047721865216.0
At index: 1 3 
AIE value: 0.0
CPU value: 2004997898240.0
At index: 2 0 
AIE value: 0.0
CPU value: 2410728390656.0
At index: 2 1 
AIE value: 0.0
CPU value: 2783077990400.0
And 22 more discrepancies...

So no crash but not numerically correct.

@makslevental
Copy link
Collaborator Author

I guess their implementation of multiple kernels has changed (I vaguely remember Sonal discussing this during our on-site).

@nirvedhmeshram
Copy link
Contributor

So seems iree_hal_xrt_direct_command_buffer_dispatch is getting called twice when XRT already has both dispatches or commands buffers or whatever and doesn't except it to get called twice? Maybe @nirvedhmeshram has some insight.

For the multi-kernel test this is expected for the way it is implemented, if you have N calls in you test to M dispatches, you will see N iree_hal_xrt_direct_command_buffer_dispatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants