server : fix segfault on long system prompt #8987

compilade · 2024-08-11T18:37:45Z

Should fix #8975 (reported by @pritam-dey3).

The problem was introduced in #6017 by limiting the size of the llama_batch used in the server to n_batch instead of n_ctx, while not updating the logic managing the system prompt processing to not put all the system prompt in the llama_batch.

This fix limits how many tokens of the system prompt are put into the llama_batch of the server at a time so that it never exceeds n_batch, while still allowing larger system prompts by splitting them across batches.

cc @slaren

I have read the contributing guidelines
Self-reported review complexity:
- Low

pritam-dey3 · 2024-08-11T19:15:25Z

Hello @compilade, I was trying to fix it as well, looks like you beat me to it 😅. Won't my fix be a little faster since it avoids the nested loop?

Aside from that I have a small question. If clearing the batch doesn't clear the kv cache then this pr looks good to me👍

compilade · 2024-08-11T19:26:54Z

Hello @compilade, I was trying to fix it as well, looks like you beat me to it 😅.

@pritam-dey3 Nice to see you were also trying to fix the problem you found :)

Won't my fix be a little faster since it avoids the nested loop?

The copies of the system prompt tokens into the batch are done anyway, so I don't see why the nested loop would be slower. It's simply chunking the copy n_batch tokens at a time instead of doing it all up-front.

(Also, calling llama_batch_init each time the system prompt is updated without calling llama_batch_free on the previous llama_batch contained in batch will cause a memory leak. The approach I took here re-uses the same batch buffers every time.)

slaren · 2024-08-13T19:00:13Z

examples/server/server.cpp


-            for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {


I guess there is some outdated logic in update_slots as well, because there is a similar loop there.

The only problem I see in update_slots is when the batch size is smaller than the number of slots, in which case tokens are added to the batch anyway, which could be out-of-bounds.

llama.cpp/examples/server/server.cpp

Line 2002 in 06943a6

llama_batch_add(batch, slot.sampled, system_tokens.size() + slot_npast, { slot.id + 1 }, true);

When the batch size is bigger than the number of slots, it does not seem to have the same problem as the system prompt, because there is already never more than n_batch tokens added to a batch:

llama.cpp/examples/server/server.cpp

Lines 2243 to 2254 in 06943a6

// add prompt tokens for processing in the current batch

// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow

for (; slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch; ++slot.n_past) {

if (slot.ga_n != 1) {

while (slot_npast >= ga_i + ga_w) {

const int bd = (ga_w/ga_n)*(ga_n - 1);

slot_npast -= bd;

ga_i += ga_w/ga_n;

}

}

llama_batch_add(batch, prompt_tokens[slot.n_past], system_tokens.size() + slot_npast, { slot.id + 1 }, false);

So yes, this loop:

llama.cpp/examples/server/server.cpp

Line 2313 in 06943a6

for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {

might be a bit outdated, but I think it's still relevant for the error handling logic:

llama.cpp/examples/server/server.cpp

Lines 2375 to 2376 in 06943a6

// retry with half the batch size to try to find a free slot in the KV cache

n_batch /= 2;

From the above I think what could be fixed is parallel generation with small batch sizes:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp index a1bbafbc..b2bf5bb5 100644 --- a/examples/server/server.cpp +++ b/examples/server/server.cpp @@ -753,13 +753,13 @@ struct server_context { default_generation_settings_for_props = get_formated_generation(slots.front()); default_generation_settings_for_props["seed"] = -1; - // the update_slots() logic will always submit a maximum of n_batch tokens + // the update_slots() logic will always submit a maximum of n_batch or n_parallel tokens // note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used) { const int32_t n_batch = llama_n_batch(ctx); // only a single seq_id per token is needed - batch = llama_batch_init(n_batch, 0, 1); + batch = llama_batch_init(std::max(n_batch, params.n_parallel), 0, 1); } metrics.init();

slaren

I am not very familiar with the server code, but the fix looks good to me.

* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment

server : fix segfault on long system prompt

7eda558

compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server labels Aug 11, 2024

github-actions bot added the examples label Aug 11, 2024

slaren reviewed Aug 13, 2024

View reviewed changes

slaren approved these changes Aug 13, 2024

View reviewed changes

compilade added 3 commits August 13, 2024 22:03

server : fix parallel generation with very small batch sizes

c1b738e

Merge branch 'master' into compilade/fix-server-long-system-prompt

af2f84c

server : fix typo in comment

93ec58b

ggerganov approved these changes Aug 14, 2024

View reviewed changes

ggerganov merged commit 98a532d into master Aug 14, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : fix segfault on long system prompt #8987

server : fix segfault on long system prompt #8987

compilade commented Aug 11, 2024

pritam-dey3 commented Aug 11, 2024 •

edited

Loading

compilade commented Aug 11, 2024

slaren Aug 13, 2024

compilade Aug 14, 2024 •

edited

Loading

slaren left a comment

	// add prompt tokens for processing in the current batch
	// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow
	for (; slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch; ++slot.n_past) {
	if (slot.ga_n != 1) {
	while (slot_npast >= ga_i + ga_w) {
	const int bd = (ga_w/ga_n)*(ga_n - 1);
	slot_npast -= bd;
	ga_i += ga_w/ga_n;
	}
	}

	llama_batch_add(batch, prompt_tokens[slot.n_past], system_tokens.size() + slot_npast, { slot.id + 1 }, false);

	// retry with half the batch size to try to find a free slot in the KV cache
	n_batch /= 2;

server : fix segfault on long system prompt #8987

server : fix segfault on long system prompt #8987

Conversation

compilade commented Aug 11, 2024

pritam-dey3 commented Aug 11, 2024 • edited Loading

compilade commented Aug 11, 2024

slaren Aug 13, 2024

Choose a reason for hiding this comment

compilade Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

slaren left a comment

Choose a reason for hiding this comment

pritam-dey3 commented Aug 11, 2024 •

edited

Loading

compilade Aug 14, 2024 •

edited

Loading