Refcount Error in Graph Code? #181
-
I'm running into a problem with creating a graph. A recreation of the problem is here and the quick summary is that the code does something like: cuda_code = """
extern "C" __global__ void simple(char *str) {
printf("this is a test\\n");
printf("ptr: %p\\n", str);
printf("passed argument was: %s\\n", str);
}
"""
def mkgraph():
# initialize device
# load PTX, get function
# allocate memory
# create memcpy node, copies UTF-8 encoded bytes to GPU
# create kernel node
# add dependency from kernel node to memcpy node
# *** run graph first time ***
# return graph
g = mkgraph()
# *** run graph second time *** The first graph execution works. I can also instantiate and execute the graph multiple times before the function returns and it works fine and prints out the correct string. The second graph execution has the exact same memory address as the first instantiation. The
My best guess is that there is a refcount that gets decremented when the function returns and the graph isn't hanging on to a copy of the memory, so it's freed or something? Is this a bug in the CUDA Python code or is there something I'm missing? |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
As further evidence that this is a refcount bug, I've updated the example code to isolate the bug to the Buffer that is passed to the memcpy graph node as an argument. If you set |
Beta Was this translation helpful? Give feedback.
-
@vzhurba01 could you take a look? My guess is it is our implicit requirement for the Python bindings that Python objects such as |
Beta Was this translation helpful? Give feedback.
-
Yeah this would be one of our implicit requirements. |
Beta Was this translation helpful? Give feedback.
-
Thanks, Vlad! I've created #175 to track the need to document our requirements. @apowers313 does this answer your question? |
Beta Was this translation helpful? Give feedback.
-
btw forgot to say, we are building pythonic abstraction over the low-level bindings (#70), and CUDA graphs will be covered in a future beta release (#111). @apowers313 it'd be nice if you can also share your use cases with us so that we can understand better your needs, and @vzhurba01 we should probably take lifetime management into account in the API design 🙂 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick reply and hopefully documentation helps future users avoid this footgun. I'm building a system that strings together multiple feature extractors that depend on inputs / outputs from each other and it's a lot more efficient to build a graph of feature extractors rather than to do memory transfers for each of them. Along the way I realized that there's a simple architecture for automatically handling inputs, outputs, and kernel dependencies so I built a FFI library to automatically build graphs for kernel calls that hides a lot of details from the users: https://github.com/atoms-org/cuda-ffi (still a work in progress, but ~80% complete) |
Beta Was this translation helpful? Give feedback.
-
@apowers313 Thanks a lot for sharing, I wish we could have learned about your needs sooner to help you save some time 😅 (cc @aterrel for vis) I am pleased to share with you that an official solution for exactly your needs is being built and it's called As mentioned earlier we'll cover CUDA graphs in a future release. If possible please give it a try and let us know if you have any feedbacks/questions! |
Beta Was this translation helpful? Give feedback.
-
@apowers313 In the meanwhile if you are OK with a field-tested, 3rd-party solution (but with NVIDIA support) for executing your C++ kernels in Python, I would encourage you to try out CuPy's |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
Thanks for sharing cuda.core, it looks really cool! It's not clear to me from the examples if I think One of the major considerations of Some features I would advocate for:
I know you guys have been doing this a lot longer than I have, but hopefully my thoughts are helpful. :) |
Beta Was this translation helpful? Give feedback.
Yeah this would be one of our implicit requirements.
Since these bindings are 1-to-1 with CUDA in C, we inherit the same rules as C. So in this case the host memory space needs to be alive for all invocations of the graph.