Refcount Error in Graph Code? #181

apowers313 · 2024-10-13T03:38:14Z

apowers313
Oct 13, 2024

I'm running into a problem with creating a graph. A recreation of the problem is here and the quick summary is that the code does something like:

cuda_code = """ 
    extern "C" __global__ void simple(char *str) {
        printf("this is a test\\n");
        printf("ptr: %p\\n", str);
        printf("passed argument was: %s\\n", str);
    }
    """

def mkgraph():
    # initialize device
    # load PTX, get function
    # allocate memory
    # create memcpy node, copies UTF-8 encoded bytes to GPU
    # create kernel node
    # add dependency from kernel node to memcpy node
    # *** run graph first time ***
    # return graph
g = mkgraph()
# *** run graph second time ***

The first graph execution works. I can also instantiate and execute the graph multiple times before the function returns and it works fine and prints out the correct string.

The second graph execution has the exact same memory address as the first instantiation. The this is a test message prints fine, the pointer address points fine, and then the final line is passed argument was: followed by garbage.

*** LAUNCHING GRAPH IN FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: hello from host
*** LAUNCHING GRAPH OUTSIDE FUNCTION ***
this is a test
ptr: 0x7f4b19800000
passed argument was: ?t?TK�

My best guess is that there is a refcount that gets decremented when the function returns and the graph isn't hanging on to a copy of the memory, so it's freed or something? Is this a bug in the CUDA Python code or is there something I'm missing?

Answered by vzhurba01

Oct 15, 2024

Yeah this would be one of our implicit requirements.
Since these bindings are 1-to-1 with CUDA in C, we inherit the same rules as C. So in this case the host memory space needs to be alive for all invocations of the graph.

View full answer

apowers313 · 2024-10-13T14:52:41Z

apowers313
Oct 13, 2024
Author

As further evidence that this is a refcount bug, I've updated the example code to isolate the bug to the Buffer that is passed to the memcpy graph node as an argument. If you set do_bug to True no reference is kept and the bug shows up. If you set do_bug to False a top-level reference is kept in ref_keeper and the bug goes away.

0 replies

leofang · 2024-10-15T17:02:02Z

leofang
Oct 15, 2024
Maintainer

@vzhurba01 could you take a look? My guess is it is our implicit requirement for the Python bindings that Python objects such as str_arg_buffer should be kept alive together with the graph nodes like memcpy_node.

0 replies

vzhurba01 · 2024-10-15T17:42:08Z

vzhurba01
Oct 15, 2024
Maintainer

Yeah this would be one of our implicit requirements.
Since these bindings are 1-to-1 with CUDA in C, we inherit the same rules as C. So in this case the host memory space needs to be alive for all invocations of the graph.

0 replies

leofang · 2024-10-16T16:05:57Z

leofang
Oct 16, 2024
Maintainer

Thanks, Vlad! I've created #175 to track the need to document our requirements. @apowers313 does this answer your question?

0 replies

leofang · 2024-10-16T16:09:05Z

leofang
Oct 16, 2024
Maintainer

btw forgot to say, we are building pythonic abstraction over the low-level bindings (#70), and CUDA graphs will be covered in a future beta release (#111). @apowers313 it'd be nice if you can also share your use cases with us so that we can understand better your needs, and @vzhurba01 we should probably take lifetime management into account in the API design 🙂

0 replies

apowers313 · 2024-10-17T15:01:20Z

apowers313
Oct 17, 2024
Author

Thanks for the quick reply and hopefully documentation helps future users avoid this footgun.

I'm building a system that strings together multiple feature extractors that depend on inputs / outputs from each other and it's a lot more efficient to build a graph of feature extractors rather than to do memory transfers for each of them.

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

0 replies

leofang · 2024-10-17T15:31:58Z

leofang
Oct 17, 2024
Maintainer

Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete)

@apowers313 Thanks a lot for sharing, I wish we could have learned about your needs sooner to help you save some time 😅 (cc @aterrel for vis) I am pleased to share with you that an official solution for exactly your needs is being built and it's called cuda.core, currently available as a beta release and installable by building from source from this repo. Python wheels and conda packages will be offered in the near future. See the example codes here. We've been busy wrapping up the first beta release (tracked here).

As mentioned earlier we'll cover CUDA graphs in a future release. If possible please give it a try and let us know if you have any feedbacks/questions!

0 replies

leofang · 2024-10-17T15:44:32Z

leofang
Oct 17, 2024
Maintainer

@apowers313 In the meanwhile if you are OK with a field-tested, 3rd-party solution (but with NVIDIA support) for executing your C++ kernels in Python, I would encourage you to try out CuPy's cupy.RawKernel/cupy.RawModule. It's a somewhat higher-level abstraction from what cuda.core aims to offer. CuPy currently also supports building a CUDA graph in Python through stream capture, though it's also in experimental phase.

0 replies

apowers313 · 2024-10-20T19:30:39Z

apowers313
Oct 20, 2024
Author

Thanks for sharing cuda.core, it looks really cool! It's not clear to me from the examples if get_kernel is going to do type-checking / conversion for parameters passed to kernels, or if it is just picking which function to call based on how the C++ template was compiled.

I think cuda-ffi and cuda.core have very different approaches to their designs. Maybe it's possible to create a design takes the best of both?

One of the major considerations of cuda-ffi is to hide as many CUDA details from the developer as possible to enable them to quickly start developing CUDA with very little learning curve. I would think this would be a good business goal for NVIDIA since it will drive more adoption and innovation from developers that aren't (yet) familiar with CUDA (and ultimately more GPU use).

Some features I would advocate for:

Use __call__() rather than launch() to create interfaces that are more familiar to Python programmers and don't require any learning curve.
I'd also suggest that sync() gets called at the end of the __call__ by default so that users don't have to think about it -- especially because forgetting to call sync() can lead to unexpected results for newbies. If there are reasons not to call sync() create a no_sync parameter to __call__ to disable that as an "advanced feature".
Hide resource cleanup like close() in Python's __del__() destructor so that programmers can't make mistakes and have one less thing to think about and maintain. If you can't use __del__ consider a design pattern using with, __enter__, and __exit__.
If you know the types of arguments being passed to the kernel you can hide a lot of CUDA details from the users. You can do things like:
- automatically allocating and freeing device memory using your knowledge of the size of the parameters being passed in
- automatically memcpy data to and from the device without users having to know how that works
- automatically sync()ing before using CuPy data (a potential source of errors)
- using NumPy's ndarray.shape to automatically assign block and grid sizes
- as a future phase of the project, I think you could do some really smart host-side optimizations (e.g. register host memory)

I know you guys have been doing this a lot longer than I have, but hopefully my thoughts are helpful. :)

1 reply

leofang Oct 20, 2024
Maintainer

Hi @apowers313 Thanks a ton for taking a look and sharing your feedbacks! I'd love to share with you the reasons cuda.core is designed in the way it is today, based on our years of (both success and failure!) experiences.

Unfortunately as we speak we're still wrapping up with the cuda.core initial documentation (#79), so hopefully soon I'll have a few more (and better) references to point you to read about.

One of the major considerations of cuda-ffi is to hide as many CUDA details from the developer as possible to enable them to quickly start developing CUDA with very little learning curve. I would think this would be a good business goal for NVIDIA since it will drive more adoption and innovation from developers that aren't (yet) familiar with CUDA (and ultimately more GPU use).

We fully agree and acknowledge that the CUDA learning curve is a bit steeper today than what it should have been. However, the way we approach to this problem is multi-fold, and cuda.core is just a big piece, but not all, of the story. One of the many pieces that we are working on is education/outreach, see, e.g., https://github.com/NVIDIA/accelerated-computing-hub.

"Hiding more CUDA" is unfortunately not a goal shared by cuda.core. Its mission is to offer idiomatic, pythonic, user-friendly, but full, access to CUDA, instead of hiding CUDA entirely from users. We aim to serve the entire Python user spectrum, from novice CUDA learners to GPU library developers to CUDA ninjas (who are too lazy to develop in C++). That means while there will be some nice common abstractions (with good default choices) -- that cuda-ffi also does -- so that our users do not use bindings to the C APIs directly (and the bindings become an implementation detail), CUDA's asynchronous programming model will continue to be an, integral part for users to grasp, so as to prevent many hard-to-debug or hard-to-work-around performance traps, inconsistencies in API semantics, and poor interoperability between GPU libraries, and we make it as simple to understand as we can. We are not trying to hide asynchronicity from users. We make sure users are aware of this and safely guided by our design.

"Hiding more CUDA" should be the goal of libraries and frameworks like CuPy, Numba, PyTorch, Jax (which does this exceptionally well that plugging in a user-defined CUDA kernel is surprisingly hard!), etc, for those who don't want to think or talk about CUDA, and cuda.core can (and hopefully will) serve as a middleware for any Python GPU libraries like them. In fact, that's already done today, so it's not a new notion at all. (Wearing my CuPy hat -- as a long-time CuPy core developer -- I feel CuPy should already meet all of your needs. I would be very interested to know what CuPy, or even its predecessor PyCUDA, does not already offer that cuda-ffi does 🙂)

As a result, several of the suggested designs are not really compatible unfortunately 😅

We need a free launch function instead of __call__ to ensure future extensibility. We don't want to make the Kernel object overly stateful, and we want launch to be able to take things other than Kernel for future needs.
sync() after every kernel launch is a hard no. AFAIK none of the major Python GPU libraries does this by default today. We want to promote and encourage users to do more stream-ordered, asynchronous workloads.
__del__ has notorious resource management issues, it's also NOT what we are supposed to use in modern Python. In fact our current usage of __del__ is incorrect (Need to make class destruction more robust #141) and will be fixed soon, so you did spot an issue 🙏 close() is already extensively done, even in the Python standard library when it comes to proper resource lifetime management.
We value explicitness over implicitness, and prefer one way to achieve things (both from Zen of Python). All of the "automatic decisions" that you list have been considered error prone by many libraries, and naturally out of scope of cuda.core. On the other hand, they might be very useful to a new Python framework depending on the use cases.

Hope this helps you understand our approach better!

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refcount Error in Graph Code? #181

{{title}}

Replies: 10 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

This comment has been hidden.

{{title}}

{{title}}

Select a reply

Refcount Error in Graph Code? #181

apowers313 Oct 13, 2024

Replies: 10 comments · 1 reply

apowers313 Oct 13, 2024 Author

leofang Oct 15, 2024 Maintainer

vzhurba01 Oct 15, 2024 Maintainer

leofang Oct 16, 2024 Maintainer

leofang Oct 16, 2024 Maintainer

apowers313 Oct 17, 2024 Author

leofang Oct 17, 2024 Maintainer

leofang Oct 17, 2024 Maintainer

This comment has been hidden.

apowers313 Oct 20, 2024 Author

leofang Oct 20, 2024 Maintainer

apowers313
Oct 13, 2024

Replies: 10 comments 1 reply

apowers313
Oct 13, 2024
Author

leofang
Oct 15, 2024
Maintainer

vzhurba01
Oct 15, 2024
Maintainer

leofang
Oct 16, 2024
Maintainer

leofang
Oct 16, 2024
Maintainer

apowers313
Oct 17, 2024
Author

leofang
Oct 17, 2024
Maintainer

leofang
Oct 17, 2024
Maintainer

apowers313
Oct 20, 2024
Author

leofang Oct 20, 2024
Maintainer