You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We added support for shareable fabric handles recently in cuda_async_memory_resource. The enum, access_flags defines a value of none and read_write for memory protection. It turns out that calling cudaMemPoolSetAccess on the pool with none causes the error: cudaErrorInvalidDevice invalid device ordinal.
Another PR I have in cuDF repros this issue (rapidsai/cudf#17526). I am filing this bug to make sure folks are aware that passing anything other than access_flags::read_write in the second argument of cuda_async_memory_resource's constructor is not valid.
So to summarize, here is the state of things:
cuda_async_memory_resource(initial_size), and cuda_async_memory_resource(initial_size, release_threshold) work as before, with export handles of type 0x0 (none), so not shareable, and we don't call cudaMemPoolSetAccess so this is OK.
cuda_async_memory_resource(initial_size, release_threshold, export_handle_type) without specifying the access flag, works.
cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::none)doesn't work. But cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::read_write) is OK
I am trying to confirm what the expected usage is here. If you are using fabric/read_write, that is working (and that is my use case), but I'd like to clean this up so that we don't produce cryptic errors otherwise. I am leaning towards removing the none option from our enum, and leaving it as a single value enum with read_write as the only element.
Steps/Code to reproduce bug
Create a pool with:
cuda_async_mr mr{
pool_init_size,
pool_release_threshold,
// I tested none, fabric, and posix_file_descriptor, all produce the same error.
mr::cuda_async_memory_resource::allocation_handle_type::none,
mr::cuda_async_memory_resource::access_flags::none};
Expected behavior access_flags::none should either behave as a no-op, or it should not be allowed.
The text was updated successfully, but these errors were encountered:
Can you set the value to read_write and then set it to none?
This is not disallowed, after discussing with the async allocator team. I also confirmed using tests by creating the pool using the constructor, and setting access to read_write, and then calling cudaMemPoolSetAccess with none. This creates the same ordinal error.
I also learned a bit more about how cudaMemPoolSetAccess is meant to be used. You should call this from peers with the pool's handle (which libraries such as UCX export/serialize/and transmit). Once the driver in the peer deserializes and imports the handle you can ask it to open access in read+write. You can also set it to none from that peer device later. You cannot set access to none from the default (owning) device.
I tested that I do not need to call cudaMemPoolSetAccess at all from the owning device (RMMs domain). So I am going to remove it from the code. Sorry for the back and forth on this. I am first going to PR in cuDF a change to remove usage of that optional parameter (that way we don't break cuDF) and I'll put up an RMM pr concurrently, to be merged after the cuDF pr goes in.
Describe the bug
We added support for shareable fabric handles recently in cuda_async_memory_resource. The enum,
access_flags
defines a value ofnone
andread_write
for memory protection. It turns out that callingcudaMemPoolSetAccess
on the pool withnone
causes the error:cudaErrorInvalidDevice invalid device ordinal
.Another PR I have in cuDF repros this issue (rapidsai/cudf#17526). I am filing this bug to make sure folks are aware that passing anything other than
access_flags::read_write
in the second argument ofcuda_async_memory_resource
's constructor is not valid.So to summarize, here is the state of things:
cuda_async_memory_resource(initial_size)
, andcuda_async_memory_resource(initial_size, release_threshold)
work as before, with export handles of type 0x0 (none), so not shareable, and we don't callcudaMemPoolSetAccess
so this is OK.cuda_async_memory_resource(initial_size, release_threshold, export_handle_type)
without specifying the access flag, works.cuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::none)
doesn't work. Butcuda_async_memory_resource(initial_size, release_threshold, export_handle_type, access_flags::read_write)
is OKI am trying to confirm what the expected usage is here. If you are using fabric/read_write, that is working (and that is my use case), but I'd like to clean this up so that we don't produce cryptic errors otherwise. I am leaning towards removing the
none
option from our enum, and leaving it as a single value enum withread_write
as the only element.Steps/Code to reproduce bug
Create a pool with:
Expected behavior
access_flags::none
should either behave as a no-op, or it should not be allowed.The text was updated successfully, but these errors were encountered: