-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JNI] Enables fabric handles for CUDA async memory pools #17526
Conversation
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
For extra tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyrights otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024 copyrights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024 copyrights
java/src/main/native/src/RmmJni.cpp
Outdated
auto [handle_type, prot_flag] = !fabric ? | ||
std::pair{ | ||
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none, | ||
rmm::mr::cuda_async_memory_resource::access_flags::none} : | ||
std::pair{ | ||
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric, | ||
rmm::mr::cuda_async_memory_resource::access_flags::read_write}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's a little easier to read without the negation
auto [handle_type, prot_flag] = !fabric ? | |
std::pair{ | |
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none, | |
rmm::mr::cuda_async_memory_resource::access_flags::none} : | |
std::pair{ | |
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric, | |
rmm::mr::cuda_async_memory_resource::access_flags::read_write}; | |
auto [handle_type, prot_flag] = fabric ? | |
std::pair{ | |
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric, | |
rmm::mr::cuda_async_memory_resource::access_flags::read_write} : | |
std::pair{ | |
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none, | |
rmm::mr::cuda_async_memory_resource::access_flags::none}; |
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
/ok to test |
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
/ok to test |
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
/ok to test |
I saw a failure in existing async tests that was consistent, around "invalid device ordinal". This happened at places where the async allocator would be instantiated. I have changed the code so that, if fabric is not selected, it follows the exact path it used to follow 8d27a92. I'll try to repro this, but my guess is the memory access protection APIs don't work on older gpus (this was a V100 in CI). |
It turns out that passing |
@jlowe fyi |
/merge |
This is a follow up from #17526, where fabric handles can be enabled from RMM. That PR also sets the memory access protection flag (`cudaMemPoolSetAccess`), but I have learned that this second flag is not needed from the owner device. In fact, it causes confusion because the owning device fails to call this function with some of the flags (access none). `cudaMemPoolSetAccess` is meant to only be called from peer processes that have imported the pool's handle. In our case, UCX handles this from the peer's side and it does not need to be anywhere in RMM or cuDF. Sorry for the noise. I'd like to get this fix in, and then I am going to fix RMM by removing that API. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #17553
Closes #17525
Depends on rapidsai/rmm#1743
Description
This PR adds a
CUDA_ASYNC_FABRIC
allocation mode inRmmAllocationMode
and pipes in the options to RMM'scuda_async_memory_resource
of afabric
for the handle type, andread_write
as the memory protection mode (as that's the only mode supported by the pools, and is required for IPC).If
CUDA_ASYNC
is used, fabric handles are not requested, and the memory protection isnone
.Checklist