Regression in prompt processing speed using a batch size of 1024 #6075

Dampfinchen · 2024-03-15T07:53:06Z

Hello,

I've noticed a significant speed reduction in prompt processing when comparing the latest llama.cpp builds to slightly older ones.

I think it has something to do with the batch size. The speed at a batch size of 512 is the same as it always has been, but if I'm using -b 1024 it's significantly slower.

Comparison latest llama.cpp: -n 180 -c 4096 -t 6 --gpu-layers 5 --ignore-eos -b 1024, Mixtral IQ4_XS, Core i7 9750H, 32 GB RAM, RTX 2060

version: 2431 (4755afd)

llama_print_timings:        load time =    2339,43 ms
llama_print_timings:      sample time =      67,74 ms /   180 runs   (    0,38 ms per token,  2657,10 tokens per second)
llama_print_timings: prompt eval time =   72387,34 ms /  3602 tokens (   20,10 ms per token,    49,76 tokens per second)
llama_print_timings:        eval time =   44119,33 ms /   179 runs   (  246,48 ms per token,     4,06 tokens per second)
llama_print_timings:       total time =  116631,73 ms /  3781 tokens

version: 2405 (5cdb371)

llama_print_timings:        load time =    2482,92 ms
llama_print_timings:      sample time =      69,55 ms /   180 runs   (    0,39 ms per token,  2587,99 tokens per second)
llama_print_timings: prompt eval time =   51669,64 ms /  3602 tokens (   14,34 ms per token,    69,71 tokens per second)
llama_print_timings:        eval time =   42287,08 ms /   179 runs   (  236,24 ms per token,     4,23 tokens per second)
llama_print_timings:       total time =   94085,31 ms /  3781 tokens

@slaren Do you think there is a commit that could have caused this? Listening to the coil whine of my laptop while processing the prompt, there's a very noticeable different in the sound. With the recent commit, it sounds like it's processing two 512 batches instead of one 1024 batch (there's a noticeable pause in the coil whine at some point) even though in the terminal it looks like the usual 1024 batch size. With the older commit, there is no such pause and the sound is continuous for the whole 1024 tokens.

The speed difference is quite stark (20 ms/t vs 14 ms/t). I hope you can take a look at this! Thank you

The text was updated successfully, but these errors were encountered:

LostRuins · 2024-03-15T08:21:13Z

Probably happened after #6017

From that PR

Automatic batch splitting in llama_decode
llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
Adds n_ubatch (-ub in the command line) to llama_context_params parameter
n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
n_ubatch sets the maximum batch size for computation
By default n_batch is 4096, n_ubatch is 512
This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic

What happens if you run -ub 1024

Dampfinchen · 2024-03-15T08:35:10Z

Probably happened after #6017

From that PR

Automatic batch splitting in llama_decode
llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
Adds n_ubatch (-ub in the command line) to llama_context_params parameter
n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
n_ubatch sets the maximum batch size for computation
By default n_batch is 4096, n_ubatch is 512
This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic

What happens if you run -ub 1024

Yes, I can confirm this fixes it.

Although I have the feeling it uses more VRAM than before. Needs more testing.

Edit: Nope, my testing shows no increase of VRAM. All is good.

slaren · 2024-03-15T11:07:34Z

Looks like you already figured it out, the parameter to change the physical batch size is now -ub. I will open a PR later today that should improve batch performance with partial offloading significantly.

Dampfinchen added the bug-unconfirmed label Mar 15, 2024

slaren closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in prompt processing speed using a batch size of 1024 #6075

Regression in prompt processing speed using a batch size of 1024 #6075

Dampfinchen commented Mar 15, 2024 •

edited

Loading

LostRuins commented Mar 15, 2024 •

edited

Loading

Dampfinchen commented Mar 15, 2024 •

edited

Loading

slaren commented Mar 15, 2024

Regression in prompt processing speed using a batch size of 1024 #6075

Regression in prompt processing speed using a batch size of 1024 #6075

Comments

Dampfinchen commented Mar 15, 2024 • edited Loading

LostRuins commented Mar 15, 2024 • edited Loading

Dampfinchen commented Mar 15, 2024 • edited Loading

slaren commented Mar 15, 2024

Dampfinchen commented Mar 15, 2024 •

edited

Loading

LostRuins commented Mar 15, 2024 •

edited

Loading

Dampfinchen commented Mar 15, 2024 •

edited

Loading