Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuda async pool cannot be initialized with access_flags::none #1753

Closed
abellina opened this issue Dec 9, 2024 · 2 comments · Fixed by #1754
Closed

[BUG] cuda async pool cannot be initialized with access_flags::none #1753

abellina opened this issue Dec 9, 2024 · 2 comments · Fixed by #1754
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@abellina
Copy link
Contributor

abellina commented Dec 9, 2024

Describe the bug
We added support for shareable fabric handles recently in cuda_async_memory_resource. The enum, access_flags defines a value of none and read_write for memory protection. It turns out that calling cudaMemPoolSetAccess on the pool with none causes the error: cudaErrorInvalidDevice invalid device ordinal.

Another PR I have in cuDF repros this issue (rapidsai/cudf#17526). I am filing this bug to make sure folks are aware that passing anything other than access_flags::read_write in the second argument of cuda_async_memory_resource's constructor is not valid.

So to summarize, here is the state of things:

  1. cuda_async_memory_resource(initial_size), and cuda_async_memory_resource(initial_size, release_threshold) work as before, with export handles of type 0x0 (none), so not shareable, and we don't call cudaMemPoolSetAccess so this is OK.
  2. cuda_async_memory_resource(initial_size, release_threshold, export_handle_type) without specifying the access flag, works.
  3. cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::none) doesn't work. But cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::read_write) is OK

I am trying to confirm what the expected usage is here. If you are using fabric/read_write, that is working (and that is my use case), but I'd like to clean this up so that we don't produce cryptic errors otherwise. I am leaning towards removing the none option from our enum, and leaving it as a single value enum with read_write as the only element.

Steps/Code to reproduce bug
Create a pool with:

cuda_async_mr mr{
  pool_init_size, 
  pool_release_threshold, 
  // I tested none, fabric, and posix_file_descriptor, all produce the same error.
  mr::cuda_async_memory_resource::allocation_handle_type::none,
  mr::cuda_async_memory_resource::access_flags::none};

Expected behavior
access_flags::none should either behave as a no-op, or it should not be allowed.

@bdice
Copy link
Contributor

bdice commented Dec 9, 2024

Can you set the value to read_write and then set it to none?

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

Can you set the value to read_write and then set it to none?

This is not disallowed, after discussing with the async allocator team. I also confirmed using tests by creating the pool using the constructor, and setting access to read_write, and then calling cudaMemPoolSetAccess with none. This creates the same ordinal error.

I also learned a bit more about how cudaMemPoolSetAccess is meant to be used. You should call this from peers with the pool's handle (which libraries such as UCX export/serialize/and transmit). Once the driver in the peer deserializes and imports the handle you can ask it to open access in read+write. You can also set it to none from that peer device later. You cannot set access to none from the default (owning) device.

I tested that I do not need to call cudaMemPoolSetAccess at all from the owning device (RMMs domain). So I am going to remove it from the code. Sorry for the back and forth on this. I am first going to PR in cuDF a change to remove usage of that optional parameter (that way we don't break cuDF) and I'll put up an RMM pr concurrently, to be merged after the cuDF pr goes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants