-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA runtime api #262
CUDA runtime api #262
Conversation
Hi Patrick, I'm interested in your proposal! I'm trying to add support for cuda 12.4 but I don't understand how to generate the source file like Thanks for your effort for porting CUDA runtime! |
Sorry there is some problem with my |
Yeah this was the main problem that I ran into as well. I think with rust we always would have to rely on the driver api to call kernels. This is why at this point I just chose to go with driver api. Do we gain anything with adding runtime api? At the very least we should document this shortcoming |
My understanding is runtime API works at a higher level and is easier to deal with for developers. From CUDA developer guide:
I believe an alternative way to do |
If you are trying to gauge community interest, then I at least would be super interested in this. I am working on a project that provides Rust support for a large-ish in-house C/C++ project that is build exclusively with the Runtime API. Since many of the types are interchangeable with the driver API this is doable already, but would of course be much easier with direct bindings to the Runtime API! We are still using CUDA 11, though... |
Hey @ahem, that's awesome to hear. As for using CUDA 11, I don't think the code will vastly differ. The only difference is that I don't think the primary context is initialized after |
CUDA versions from Note though, while my tests on |
I am very interested in the work here. I was trying to previously use I was looking at the diff here and I don't see support for pitched memory and page-locked memory. These are both going to be useful to have if more people are going to use this crate. |
I think my biggest concern here is just the amount of shared code between driver & runtime that would need to be maintained together. If we are maintaining the exact same api between driver::safe::CudaDevice and runtime::safe::CudaDevice, I'm wondering if there's a better way to do all of this without having copies of everything (Although it does make it less complex because we don't need extra abstractions). I guess I'm wondering: for downstream crates, what are the differences when using driver vs runtime? If there aren't really any differences, we could probably just stick to exposing the sys/result for runtime, but not expose a safe api for runtime at all. Thoughts? |
At least for memory I think I do agree with @coreylowman that the current work seems to be a lot like the unsafe API. |
Yeah, I agree there's a lot of code copying. I looked into a more integrated approach, but it would require a lot of abstraction through traits, like having CudaBlasLT accept CudaDevice from either runtime or driver. Personally, I'm not a fan of abstraction. My idea of having it separate, is that even if you update the driver API, it won't cause integration issues with runtime. Managing the abstraction would be more cumbersome IMO (in the short term at least). From what I gather, users seem to prefer using this as a standalone module rather than integrating it with the rest. My thoughts regarding whether to include a safe API or not, I think it’s more practical (and less time-consuming) to stick with the sys/result for now. We can always allow contributions toward a safe API later on.
Yeah, I agree. While page-locked memory isn't exclusive to cudart, it does make things easier.
The result module is a thin wrapper around sys, which is why it has that similarity. |
Had to step away from this for a bit. I removed the safe API and made changes to pass all the checks. Should be good to go now. |
I'm interested in this PR. Some of the runtime support for memory, streams etc. etc. would be really interesting in terms of QoL Edit: I can't see any page-locked host memory allocation. Is that on purpose or an oversight? |
Plus 1 to the ability to bind the runtime API. It would definitely simplify distribution. |
Hello,
I started working on the implementation for the cuda runtime api since I saw some interest in it (#200). I managed to translate the cuda runtime api equivalent for most of the functions in the driver api except for context and module management, which is handled automatically by cudart. Below are some of the issues/limitations:
cudaOccupancyMaxPotentialBlockSize.*
cudaLaunchKernel
via FFIcudaLaunchKernel
via FFI bindings in Rust isn't possible AFAIK because CUDA runtime expects a specific binary layout to find and launch compiled kernels. Rust's FFI mechanism doesn't natively conform to this layout, preventing the CUDA runtime from resolving and executing kernel functions properly.I only have bindings for cuda-12050 and 12020 so far. I want to gauge community interest in this implementation before investing further time.
Edit: the issue with
cudaLaunchKernel
has been fixed. Please read below