-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : fix segfault on long system prompt #8987
Conversation
Hello @compilade, I was trying to fix it as well, looks like you beat me to it 😅. Won't my fix be a little faster since it avoids the nested loop? Aside from that I have a small question. If clearing the batch doesn't clear the kv cache then this pr looks good to me👍 |
@pritam-dey3 Nice to see you were also trying to fix the problem you found :)
The copies of the system prompt tokens into the batch are done anyway, so I don't see why the nested loop would be slower. It's simply chunking the copy (Also, calling |
|
||
for (int32_t i = 0; i < batch.n_tokens; i += n_batch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is some outdated logic in update_slots
as well, because there is a similar loop there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem I see in update_slots
is when the batch size is smaller than the number of slots, in which case tokens are added to the batch anyway, which could be out-of-bounds.
llama.cpp/examples/server/server.cpp
Line 2002 in 06943a6
llama_batch_add(batch, slot.sampled, system_tokens.size() + slot_npast, { slot.id + 1 }, true); |
When the batch size is bigger than the number of slots, it does not seem to have the same problem as the system prompt, because there is already never more than n_batch
tokens added to a batch:
llama.cpp/examples/server/server.cpp
Lines 2243 to 2254 in 06943a6
// add prompt tokens for processing in the current batch | |
// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow | |
for (; slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch; ++slot.n_past) { | |
if (slot.ga_n != 1) { | |
while (slot_npast >= ga_i + ga_w) { | |
const int bd = (ga_w/ga_n)*(ga_n - 1); | |
slot_npast -= bd; | |
ga_i += ga_w/ga_n; | |
} | |
} | |
llama_batch_add(batch, prompt_tokens[slot.n_past], system_tokens.size() + slot_npast, { slot.id + 1 }, false); |
So yes, this loop:
llama.cpp/examples/server/server.cpp
Line 2313 in 06943a6
for (int32_t i = 0; i < batch.n_tokens; i += n_batch) { |
might be a bit outdated, but I think it's still relevant for the error handling logic:
llama.cpp/examples/server/server.cpp
Lines 2375 to 2376 in 06943a6
// retry with half the batch size to try to find a free slot in the KV cache | |
n_batch /= 2; |
From the above I think what could be fixed is parallel generation with small batch sizes:
diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index a1bbafbc..b2bf5bb5 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -753,13 +753,13 @@ struct server_context {
default_generation_settings_for_props = get_formated_generation(slots.front());
default_generation_settings_for_props["seed"] = -1;
- // the update_slots() logic will always submit a maximum of n_batch tokens
+ // the update_slots() logic will always submit a maximum of n_batch or n_parallel tokens
// note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used)
{
const int32_t n_batch = llama_n_batch(ctx);
// only a single seq_id per token is needed
- batch = llama_batch_init(n_batch, 0, 1);
+ batch = llama_batch_init(std::max(n_batch, params.n_parallel), 0, 1);
}
metrics.init();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with the server code, but the fix looks good to me.
* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment
Should fix #8975 (reported by @pritam-dey3).
The problem was introduced in #6017 by limiting the size of the
llama_batch
used in the server ton_batch
instead ofn_ctx
, while not updating the logic managing the system prompt processing to not put all the system prompt in thellama_batch
.This fix limits how many tokens of the system prompt are put into the
llama_batch
of the server at a time so that it never exceedsn_batch
, while still allowing larger system prompts by splitting them across batches.cc @slaren