Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Jul 8, 2024

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 8, 2024
@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

Here are the BS 1 inference performance improvements I have measured for this optimization (all on Linux):

Hardware Llama 7B Q4 Llama 13B Q4
H100-SXM + AMD EPYC 7413 6% 5%
H100-PCIe + AMD EPYC 7313 4% 4%
A100-PCIe + Intel 4210R 7% 5%
L40S + AMD EPYC 7763 3% 2%
RTX-4090 + AMD Ryzen 3700X 4% 3%
A40 + Intel 4210R 5% 3%

@slaren @JohannesGaessler @ggerganov this is currently working OK for my local tests but I'd appreciate any further testing from your side to help determine if it needs to be made more robust.

@JohannesGaessler
Copy link
Collaborator

Did you check whether with this optimization the performance for large batch sizes becomes better with CUDA graphs enabled?

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

Did you check whether with this optimization the performance for large batch sizes becomes better with CUDA graphs enabled?

No, at the moment both this optimization and CUDA graphs are only activated with batch size 1 (investigating BS > 1 is future work).

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

This is currently causing a segfault for llama-bench - working on it.

@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

This is currently causing a segfault for llama-bench - working on it.

Now fixed.

src/llama.cpp Outdated
Comment on lines 14628 to 14637
ggml_tensor * tmp_tensor = kv_self.k_l[il];
size_t tmp_offset = (ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa))*kv_head;
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset);
tmp_tensor = kv_self.v_l[il];
if (cparams.flash_attn) {
tmp_offset = (kv_head)*ggml_row_size(kv_self.v_l[il]->type, n_embd_v_gqa);
} else {
tmp_offset = (kv_head)*ggml_element_size(kv_self.v_l[il]);
}
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this going to be hard to adapt to Jamba (#7531) if this assumes alernating ggml_cpy to k_l and v_l in the graph.

(Jamba uses more than only the KV cache; it also has a recurrent state cache, and no simple alternating rule can work with how the layers of Jamba are stacked (some use the KV cache, some don't))

Maybe graph caching would simply be disabled for that kind of model architecture to keep it simpler, but it would be nice for a more general way to update the cached graph. (Or it could be a special case I can handle along with hybrid models, so no need to generalize this first if it's non-trivial.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see a more general-purpose solution. This is a good PoC to see what are the possible gains, but it is too hacky.

Probably we should profile llama_build_graph() and see if we can brush off any overhead in order to improve the baseline. I think the llm_build_cb could be adding some non-negligible overhead, so it's something to look into. Then maybe the ggml_ calls are adding too much overhead - are we sure those are properly inlined by the compiler? Are there any hotspots that we should try to optimize first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both - I agree it would be better to make this more generic and I'll think about how that can be done. Probably via adding more info to each node at the graph build stage, which can then be checked when updating the KV cache parameters. @ggerganov Note that it's not just llama_build_graph() avoided here, also ggml_backend_sched_split_graph() and ggml_backend_sched_alloc_splits() - see the profile at #7456
Across all these CPU activities the profile is fairly flat so I haven't found any easy overhead optimizations that would have a large effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate. Can you please check and let me know whether or not this would now work with your case? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate

I realised that I accidentally left in some stale code related to the previous interleaving, now removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better, but instead of using a new flag in ggml_tensor, would it be simpler to check if node->src[1]->name of a GGML_OP_CPY node matches "k_cache_view-%d" or "v_cache_view-%d", or "cache_k_l%d (view)" or "cache_v_l%d (view)"? (the last two are the automatic names which ggml_view uses when given k_l or v_l) (the prefixes might be enough, to avoid matching the layer number) (but maybe this would be too many string comparisons?)

Can you please check and let me know whether or not this would now work with your case?

Since this currently requires explicitly adding a new flag where k_l and v_l are ggml_cpy-ed to, this could work with Mamba (which on master uses k_l and v_l to store the recurrent states) as long as the relevant flags are also added there (which isn't yet done at the time of writing, (although it would conflict with #8526 anyway)).

But the approach used in abaf0c6 still assumes that a particular k_l or v_l tensor is never used more than once, which isn't true in #8526, where the logic to auto-defrag the recurrent states separates the non-modified writes from the writes with modified states, which uses each of k_l and v_l twice, but for different ranges.

The other thing to keep in mind is that the approach used should ideally be easily extensible to more than only k_l and v_l, since #7531 adds r_l and s_l. (and uses all four (because they have different sizes), but skips some of them for some layers, which won't work with the current k_cache_ptrs and v_cache_ptrs approach which assumes they are all used no less and no more than once1)

What I'd like is for this to:

Footnotes

  1. note that unlike the multiple uses, zero uses will be easier to fix because when building k_cache_ptrs, for example, not including the layers where hparams.n_head_kv(il) is 0 would fix this, I think, at least for Jamba.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using a new flag in ggml_tensor, would it be simpler to check if node->src[1]->name

Great idea - I didn't realize this info was encoded in the name. I've now actually used this approach to further simplify things by extracting the layer id from the name, which allows direct updates of the k and v parameters rather than the previous 2-stage solution. I think this now addresses your other points:

Allow multiple uses of ggml_cpy per layer per k_l or v_l tensor
Allow omitting some per-layer states for some layers

I think these should both now work OK with the new approach.

Make it possible to relatively easily add new types of per-layer states

This should now be straightforward by augmenting the part where the k and v parameters are updated.
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agray3 What I see in 3241b3d is better than I expected, especially with extracting the layer number from the name. Nice work! 👍

It definitely now seems like it should work both when there are multiple ggml_cpy per K or V tensor per layer and when not all layers use them. The only thing left regarding recurrent models is to add some cb in build_mamba so that the target views are called k_cache_view and v_cache_view over there too.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024
src/llama.cpp Outdated Show resolved Hide resolved
@Nexesenex
Copy link
Contributor

@agray3 : Any follow-up on this PR? I had to revert it on my own LlamaCPP to follow up the recent commits, with regret due to the performance boost it gave me back then.

@agray3
Copy link
Contributor Author

agray3 commented Oct 10, 2024

@agray3 : Any follow-up on this PR? I had to revert it on my own LlamaCPP to follow up the recent commits, with regret due to the performance boost it gave me back then.

See the comment by Georgi above - #8366 (comment). It's not obvious to me that there's any way to adapt this patch so make it acceptable for merging, so it remains a POC for now.

@Nexesenex
Copy link
Contributor

@agray3 : Well, your POC is working. Would you consider to maintain it, so it can be merged without conflicts with current master for those who want to use it? The perf boost is a no brainer for me, and I suppose, for some other folks.

Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
@agray3 agray3 force-pushed the ag_ggml_graph_caching branch from d9c7b61 to 23214c9 Compare October 11, 2024 10:24
@agray3
Copy link
Contributor Author

agray3 commented Oct 11, 2024

@agray3 : Well, your POC is working. Would you consider to maintain it, so it can be merged without conflicts with current master for those who want to use it? The perf boost is a no brainer for me, and I suppose, for some other folks.

Sure, I have now rebased it on the latest master branch. Let me know if/when it has conflicts again. Glad it is useful.

@Nexesenex
Copy link
Contributor

Fantastic, @agray3. TYVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants