[BUG] cuda async pool cannot be initialized with `access_flags::none` #1753

abellina · 2024-12-09T18:44:05Z

Describe the bug
We added support for shareable fabric handles recently in cuda_async_memory_resource. The enum, access_flags defines a value of none and read_write for memory protection. It turns out that calling cudaMemPoolSetAccess on the pool with none causes the error: cudaErrorInvalidDevice invalid device ordinal.

Another PR I have in cuDF repros this issue (rapidsai/cudf#17526). I am filing this bug to make sure folks are aware that passing anything other than access_flags::read_write in the second argument of cuda_async_memory_resource's constructor is not valid.

So to summarize, here is the state of things:

cuda_async_memory_resource(initial_size), and cuda_async_memory_resource(initial_size, release_threshold) work as before, with export handles of type 0x0 (none), so not shareable, and we don't call cudaMemPoolSetAccess so this is OK.
cuda_async_memory_resource(initial_size, release_threshold, export_handle_type) without specifying the access flag, works.
cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::none) doesn't work. But cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::read_write) is OK

I am trying to confirm what the expected usage is here. If you are using fabric/read_write, that is working (and that is my use case), but I'd like to clean this up so that we don't produce cryptic errors otherwise. I am leaning towards removing the none option from our enum, and leaving it as a single value enum with read_write as the only element.

Steps/Code to reproduce bug
Create a pool with:

cuda_async_mr mr{
  pool_init_size, 
  pool_release_threshold, 
  // I tested none, fabric, and posix_file_descriptor, all produce the same error.
  mr::cuda_async_memory_resource::allocation_handle_type::none,
  mr::cuda_async_memory_resource::access_flags::none};

Expected behavior
access_flags::none should either behave as a no-op, or it should not be allowed.

The text was updated successfully, but these errors were encountered:

bdice · 2024-12-09T18:57:42Z

Can you set the value to read_write and then set it to none?

abellina · 2024-12-09T20:40:40Z

Can you set the value to read_write and then set it to none?

This is not disallowed, after discussing with the async allocator team. I also confirmed using tests by creating the pool using the constructor, and setting access to read_write, and then calling cudaMemPoolSetAccess with none. This creates the same ordinal error.

I also learned a bit more about how cudaMemPoolSetAccess is meant to be used. You should call this from peers with the pool's handle (which libraries such as UCX export/serialize/and transmit). Once the driver in the peer deserializes and imports the handle you can ask it to open access in read+write. You can also set it to none from that peer device later. You cannot set access to none from the default (owning) device.

I tested that I do not need to call cudaMemPoolSetAccess at all from the owning device (RMMs domain). So I am going to remove it from the code. Sorry for the back and forth on this. I am first going to PR in cuDF a change to remove usage of that optional parameter (that way we don't break cuDF) and I'll put up an RMM pr concurrently, to be merged after the cuDF pr goes in.

abellina added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 9, 2024

github-project-automation bot added this to RMM Project Board Dec 9, 2024

github-project-automation bot moved this to Todo in RMM Project Board Dec 9, 2024

This was referenced Dec 9, 2024

Adds fabric handle and memory protection flags to cuda_async_memory_resource #1743

Merged

[JNI] Enables fabric handles for CUDA async memory pools rapidsai/cudf#17526

Merged

abellina mentioned this issue Dec 9, 2024

Remove memory access flags from cuda_async_memory_resource #1754

Merged

3 tasks

rapids-bot bot closed this as completed in #1754 Dec 9, 2024

rapids-bot bot closed this as completed in 8d41610 Dec 9, 2024

github-project-automation bot moved this from Todo to Done in RMM Project Board Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuda async pool cannot be initialized with `access_flags::none` #1753

[BUG] cuda async pool cannot be initialized with `access_flags::none` #1753

abellina commented Dec 9, 2024

bdice commented Dec 9, 2024

abellina commented Dec 9, 2024

[BUG] cuda async pool cannot be initialized with access_flags::none #1753

[BUG] cuda async pool cannot be initialized with access_flags::none #1753

Comments

abellina commented Dec 9, 2024

bdice commented Dec 9, 2024

abellina commented Dec 9, 2024

[BUG] cuda async pool cannot be initialized with `access_flags::none` #1753

[BUG] cuda async pool cannot be initialized with `access_flags::none` #1753