-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: avoid rebuild of GGML graph for each token (#7456) #8366
base: master
Are you sure you want to change the base?
Conversation
Here are the BS 1 inference performance improvements I have measured for this optimization (all on Linux):
@slaren @JohannesGaessler @ggerganov this is currently working OK for my local tests but I'd appreciate any further testing from your side to help determine if it needs to be made more robust. |
Did you check whether with this optimization the performance for large batch sizes becomes better with CUDA graphs enabled? |
No, at the moment both this optimization and CUDA graphs are only activated with batch size 1 (investigating BS > 1 is future work). |
This is currently causing a segfault for llama-bench - working on it. |
Now fixed. |
src/llama.cpp
Outdated
ggml_tensor * tmp_tensor = kv_self.k_l[il]; | ||
size_t tmp_offset = (ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa))*kv_head; | ||
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset); | ||
tmp_tensor = kv_self.v_l[il]; | ||
if (cparams.flash_attn) { | ||
tmp_offset = (kv_head)*ggml_row_size(kv_self.v_l[il]->type, n_embd_v_gqa); | ||
} else { | ||
tmp_offset = (kv_head)*ggml_element_size(kv_self.v_l[il]); | ||
} | ||
kv_cache_ptrs.push_back(static_cast<char*>(tmp_tensor->data) + tmp_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this going to be hard to adapt to Jamba (#7531) if this assumes alernating ggml_cpy
to k_l
and v_l
in the graph.
(Jamba uses more than only the KV cache; it also has a recurrent state cache, and no simple alternating rule can work with how the layers of Jamba are stacked (some use the KV cache, some don't))
Maybe graph caching would simply be disabled for that kind of model architecture to keep it simpler, but it would be nice for a more general way to update the cached graph. (Or it could be a special case I can handle along with hybrid models, so no need to generalize this first if it's non-trivial.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see a more general-purpose solution. This is a good PoC to see what are the possible gains, but it is too hacky.
Probably we should profile llama_build_graph()
and see if we can brush off any overhead in order to improve the baseline. I think the llm_build_cb
could be adding some non-negligible overhead, so it's something to look into. Then maybe the ggml_
calls are adding too much overhead - are we sure those are properly inlined by the compiler? Are there any hotspots that we should try to optimize first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both - I agree it would be better to make this more generic and I'll think about how that can be done. Probably via adding more info to each node at the graph build stage, which can then be checked when updating the KV cache parameters. @ggerganov Note that it's not just llama_build_graph() avoided here, also ggml_backend_sched_split_graph() and ggml_backend_sched_alloc_splits() - see the profile at #7456
Across all these CPU activities the profile is fairly flat so I haven't found any easy overhead optimizations that would have a large effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate. Can you please check and let me know whether or not this would now work with your case? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now improved identification of K and V nodes - it is now cleaner and no longer assumes that they alternate
I realised that I accidentally left in some stale code related to the previous interleaving, now removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems better, but instead of using a new flag in ggml_tensor
, would it be simpler to check if node->src[1]->name
of a GGML_OP_CPY
node matches "k_cache_view-%d"
or "v_cache_view-%d"
, or "cache_k_l%d (view)"
or "cache_v_l%d (view)"
? (the last two are the automatic names which ggml_view
uses when given k_l
or v_l
) (the prefixes might be enough, to avoid matching the layer number) (but maybe this would be too many string comparisons?)
Can you please check and let me know whether or not this would now work with your case?
Since this currently requires explicitly adding a new flag where k_l
and v_l
are ggml_cpy
-ed to, this could work with Mamba (which on master
uses k_l
and v_l
to store the recurrent states) as long as the relevant flags are also added there (which isn't yet done at the time of writing, (although it would conflict with #8526 anyway)).
But the approach used in abaf0c6 still assumes that a particular k_l
or v_l
tensor is never used more than once, which isn't true in #8526, where the logic to auto-defrag the recurrent states separates the non-modified writes from the writes with modified states, which uses each of k_l
and v_l
twice, but for different ranges.
The other thing to keep in mind is that the approach used should ideally be easily extensible to more than only k_l
and v_l
, since #7531 adds r_l
and s_l
. (and uses all four (because they have different sizes), but skips some of them for some layers, which won't work with the current k_cache_ptrs
and v_cache_ptrs
approach which assumes they are all used no less and no more than once1)
What I'd like is for this to:
- Allow multiple uses of
ggml_cpy
per layer perk_l
orv_l
tensor- Used by llama : simplify Mamba with advanced batch splits #8526 to simplify making the recurrent state slots always contiguous
- Make it possible to relatively easily add new types of per-layer states
- No need to add anything up-front, but simply keep in mind this will need to be extended.
- For hybrid model architectures, as in llama : support Jamba hybrid Transformer-Mamba models #7531
- Allow omitting some per-layer states for some layers
- Jamba doesn't have Attention on all layers
Footnotes
-
note that unlike the multiple uses, zero uses will be easier to fix because when building
k_cache_ptrs
, for example, not including the layers wherehparams.n_head_kv(il)
is0
would fix this, I think, at least for Jamba. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using a new flag in ggml_tensor, would it be simpler to check if node->src[1]->name
Great idea - I didn't realize this info was encoded in the name. I've now actually used this approach to further simplify things by extracting the layer id from the name, which allows direct updates of the k and v parameters rather than the previous 2-stage solution. I think this now addresses your other points:
Allow multiple uses of ggml_cpy per layer per k_l or v_l tensor
Allow omitting some per-layer states for some layers
I think these should both now work OK with the new approach.
Make it possible to relatively easily add new types of per-layer states
This should now be straightforward by augmenting the part where the k and v parameters are updated.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agray3 What I see in 3241b3d is better than I expected, especially with extracting the layer number from the name. Nice work! 👍
It definitely now seems like it should work both when there are multiple ggml_cpy
per K or V tensor per layer and when not all layers use them. The only thing left regarding recurrent models is to add some cb
in build_mamba
so that the target views are called k_cache_view
and v_cache_view
over there too.
@agray3 : Any follow-up on this PR? I had to revert it on my own LlamaCPP to follow up the recent commits, with regret due to the performance boost it gave me back then. |
See the comment by Georgi above - #8366 (comment). It's not obvious to me that there's any way to adapt this patch so make it acceptable for merging, so it remains a POC for now. |
@agray3 : Well, your POC is working. Would you consider to maintain it, so it can be merged without conflicts with current master for those who want to use it? The perf boost is a no brainer for me, and I suppose, for some other folks. |
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.
d9c7b61
to
23214c9
Compare
Sure, I have now rebased it on the latest master branch. Let me know if/when it has conflicts again. Glad it is useful. |
Fantastic, @agray3. TYVM. |
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.