-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify and improve CUDA graphs through use of indirect copy pointers #9017
base: master
Are you sure you want to change the base?
Conversation
Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.
@slaren could you possibly review this whenever you get the bandwidth? Note that as well as simplifying the CUDA graphs code, this change also gives ~1-2% performance uplift by avoiding CUDA Graph updates for each token. |
Is this PR compatible with #8366, or does it supersedes it? |
The idea of keeping a list of pointers in device memory to avoid the update to the graphs is interesting, but the way this is implemented is shifting some of the complexity from the CUDA backend to the application side. My view generally is that adding custom functions to the backends that require special handling from the application side should only be done as a last resort, and the priority should be to provide a simple and unified interface to the applications. I think it would be possible to implement this entirely in the CUDA backend side by scanning the graph to obtain and update the list of pointers. I suppose it may be worth it if updating the nodes in the CUDA graph is significantly slower than copying a list of pointers to device memory, but if the difference is small, it may be hard to justify the added complexity to the CUDA backend code. |
Thanks @slaren. The current code involves repeated updates to the graph, and the proposed approach does give a significant performance advantage (even with the exrtra memcopies). E.g. On A100 for llama 7B Q4 I get (tokens/s):
This 1.6% speedup is not dramatic, but given the huge worldwide usage of llama.cpp I'd argue that it would accumulate to an enormous overall time, cost and energy saving. Plus it is a step in the right direction (IMO) of reducing the need to do a full rebuild of the GGML graph every step. But I acknowledge that it does add e few lines of extra complexity to the llama.cpp file - I'll have a think about that can be better abstracted into GGML. |
I've now fully abstracted into the GGML CUDA backend, with just a single call from llama.cpp. |
@agray3 I am sorry, I think there has been a misunderstanding. The problem is not the location of the few lines of code to build the list of pointers, the problem is skipping several layers of abstraction and going directly from llama.cpp to the CUDA backend code. Not only this code is going to be hard to maintain and will certainly require exceptions for some architectures, but ggml is a separate library from llama.cpp and it used in more applications, and the goal is to continue expanding the capabilities to use ggml in other projects. Simply put, it is not ok to add new functions to the CUDA backend interface to achieve this, and much less so to the ggml-backend interface. The only way I can see to implement this would be to build the list of pointers automatically and transparently by inspecting the graph within the CUDA backend. |
OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present? Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token. |
The problem is that we cannot add new functions to the backend interface every time it is more convenient to implement some optimization by doing so, because it will pollute the application code and the backend interface, and will quickly become unmaintainable. Even if this is a small change now, there are currently 7 backends supported in ggml, and all of them would like to add similar functions to simplify their implementation. We cannot go this route unless it is absolutely necessary, and I don't think that this case qualifies.
Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the |
Yes, you are right. I was getting mixed up between GGML and CUDA graphs. Currently we extract from the GGML graph and insert into the CUDA graph, but we could instead extract from the GGML graph and pass to the GPU via a memcpy. I'll experiment with that, thanks. |
ggerganov#9017 Co-Authored-By: agray3 <10851179+agray3@users.noreply.github.com>
…y pointers ggerganov#9017" This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.
Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork? Any boost of performance, even small, is welcome! :D Thanks in any case! |
This will require a bit more rebasing to be compatible with my other patch - I'm away for a few days so will take a look when I'm back. |
@agray3: Thanks! Have a great time meanwhile! |
Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.