Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Aug 13, 2024

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2024
@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

@slaren could you possibly review this whenever you get the bandwidth? Note that as well as simplifying the CUDA graphs code, this change also gives ~1-2% performance uplift by avoiding CUDA Graph updates for each token.

@Nexesenex
Copy link
Contributor

Is this PR compatible with #8366, or does it supersedes it?

@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

Is this PR compatible with #8366, or does it supersedes it?

Yes, it is compatible (it doesn't supersede since #8366 provides further benefits). If/when this change is merged then I will re-base #8366 on it (which will actually also simplify #8366).
__

@slaren
Copy link
Collaborator

slaren commented Aug 13, 2024

The idea of keeping a list of pointers in device memory to avoid the update to the graphs is interesting, but the way this is implemented is shifting some of the complexity from the CUDA backend to the application side. My view generally is that adding custom functions to the backends that require special handling from the application side should only be done as a last resort, and the priority should be to provide a simple and unified interface to the applications.

I think it would be possible to implement this entirely in the CUDA backend side by scanning the graph to obtain and update the list of pointers. I suppose it may be worth it if updating the nodes in the CUDA graph is significantly slower than copying a list of pointers to device memory, but if the difference is small, it may be hard to justify the added complexity to the CUDA backend code.

@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

Thanks @slaren. The current code involves repeated updates to the graph, and the proposed approach does give a significant performance advantage (even with the exrtra memcopies). E.g. On A100 for llama 7B Q4 I get (tokens/s):

        PR9017  Current 
Run1	157.02	154.35
Run2	157.03	154.62
Run3	156.78	154.45
Run4	156.45	153.98
Run5	156.78	154.26
		
Average	156.82	154.33
Speedup	1.016   1

This 1.6% speedup is not dramatic, but given the huge worldwide usage of llama.cpp I'd argue that it would accumulate to an enormous overall time, cost and energy saving. Plus it is a step in the right direction (IMO) of reducing the need to do a full rebuild of the GGML graph every step.

But I acknowledge that it does add e few lines of extra complexity to the llama.cpp file - I'll have a think about that can be better abstracted into GGML.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

I've now fully abstracted into the GGML CUDA backend, with just a single call from llama.cpp.

@slaren
Copy link
Collaborator

slaren commented Aug 14, 2024

@agray3 I am sorry, I think there has been a misunderstanding. The problem is not the location of the few lines of code to build the list of pointers, the problem is skipping several layers of abstraction and going directly from llama.cpp to the CUDA backend code. Not only this code is going to be hard to maintain and will certainly require exceptions for some architectures, but ggml is a separate library from llama.cpp and it used in more applications, and the goal is to continue expanding the capabilities to use ggml in other projects. Simply put, it is not ok to add new functions to the CUDA backend interface to achieve this, and much less so to the ggml-backend interface. The only way I can see to implement this would be to build the list of pointers automatically and transparently by inspecting the graph within the CUDA backend.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

@slaren
Copy link
Collaborator

slaren commented Aug 14, 2024

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

The problem is that we cannot add new functions to the backend interface every time it is more convenient to implement some optimization by doing so, because it will pollute the application code and the backend interface, and will quickly become unmaintainable. Even if this is a small change now, there are currently 7 backends supported in ggml, and all of them would like to add similar functions to simplify their implementation. We cannot go this route unless it is absolutely necessary, and I don't think that this case qualifies.

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes. The only difference is how the kernel receives the updated pointers, either as a kernel argument argument updated in the CUDA graph, or as a pointer obtained from device memory.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes.

Yes, you are right. I was getting mixed up between GGML and CUDA graphs. Currently we extract from the GGML graph and insert into the CUDA graph, but we could instead extract from the GGML graph and pass to the GPU via a memcpy. I'll experiment with that, thanks.

@agray3 agray3 marked this pull request as draft August 14, 2024 15:47
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024
ggerganov#9017

Co-Authored-By: agray3 <10851179+agray3@users.noreply.github.com>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024
…y pointers ggerganov#9017"

This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.
@Nexesenex
Copy link
Contributor

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

@agray3
Copy link
Contributor Author

agray3 commented Oct 19, 2024

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

This will require a bit more rebasing to be compatible with my other patch - I'm away for a few days so will take a look when I'm back.

@Nexesenex
Copy link
Contributor

@agray3: Thanks! Have a great time meanwhile!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants