-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel.cpp exits when encountering a long prompt. #4086
Comments
Your cache actually got cleared correctly (probably, anyway!). The problem is that |
Thank you for your response. I am using the low-level API of llama-cpp-python (which is completely consistent with llama.h) to implement dynamic batch processing in Python, following the example in parallel.cpp. However, I am encountering a strange phenomenon where running the program with parallel > 1 causes an additional increase of 30-200MB in GPU memory when reaching the llama_decode() method. This increase accumulates over time until the program crashes. Yesterday, I suspected that it might be an issue with KV_cache. Currently, I am reviewing the differences between my implementation and parallel.cpp. Do you have any suggestions? For example, which parts might be causing the memory leak? Today, I observed that after the abnormal increase in GPU memory, llama_get_state_size also shows a significant increase. I'm not sure if there is any correlation. It seems that this might be related to a variable in the context. |
Well, I can't tell you what the problem is but I can basically rule out the KV cache for you. You can pretty much trust that clearing the KV cache works correctly, but even if it didn't, it wouldn't matter for the purposes of memory leaks. As far as I know, all the KV cache memory gets allocated up front based on the context size you set. So for the most part, it just doesn't matter what's in it.
Are you saying that the memory increases while running decode on a large parallel sequence, or that the memory continues increasing in between calls to decode with the parallel sequence? In other words, when the call to decode ends and you get your result, does the memory usage go back down?
I'd say you'd probably have better luck asking in their repo. These other projects that build on llama.cpp aren't necessarily using the latest version, they may have their own patches applied, they may be tweaking settings, etc. Debugging problems is basically a process of elimination, and there are too many unknowns for someone who just knows about llama.cpp to deal with in this case. Or, you can try reproducing the issue using the latest version of llama.cpp directly. |
Diagnosing the Regarding |
Definitely would be easy. Slightly harder is answering the question "what's a token"? Should it return the number of populated cells or the sum of the cells sequence lengths? Like if a cell belongs to 10 sequences, is that 10 tokens? |
I guess, we can have 2 counters - "number of tokens" and "number of occupied cells". And add API for the cells |
If we're changing the API, how about something that basically just exports the whole KV cache state so people can extract whatever information is useful? Maybe even add the token id to it. Even for a 200,000 context size, that's only 800k size it's a 32bit type. Something like the batch api when you create/destroy a batch, then another function could copy state into it (since you wouldn't want to allocate every time it was fetched probably). #4035 wanted that functionality, I think (and also having the count function fixed as well). |
The memory usage increases during large-scale parallel decoding, but it doesn't increase with every decoding operation. Once the memory usage increases, it doesn't decrease even as the number of completed sentences increases. It keeps accumulating.
Additionally, could you provide more detailed comments for the parallel.cpp example? For instance, explaining which parts of the code might have potential issues to facilitate understanding. I'm not proficient in the C++ language, so it would be helpful to have clearer explanations. |
It's my apologies for disturbing you again. @KerfuffleV2 @ggerganov
I think I have found the reason. When I changed the example question in parallel.cpp to a longer prompt of about 1500-2000 tokens, I can consistently reproduce the issue using the following command: ./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16324 -b 4096 --cont_batching --parallel 2 --sequences 600 --n-gpu-layers 1000 Failure InformationGPU memory suddenly increases. cuBLAS error 13 at ggml-cuda.cu:6464: the function failed to launch on the GPU
current device: 0 I am running only one task of parallel.cpp on the GPU. Clearly, I have sufficient resources to run it, but the program throws an error.
What configurations should I adjust or what actions should I take to avoid this problem? Or is this a bug? Steps to ReproduceUsing the latest llama.cpp repository code: commit_id: dae06c0 python convert.py ./CodeLlama-7B/ --outtype q8_0
|
Sorry, you're not disturbing me but I don't really have much to add at this point. I don't really know enough about your specific problem to say something helpful. 4096 (or even 1024) sounds like a very, very high batch size though. Maybe it's normal for those cards. Hopefully GG will be able to help you, he wrote the |
I found that the comments on this pull request are similar to the issue I encountered, but verifying this problem is beyond my ability. |
Does it work with |
@littlebai3618 I was able to reproduce the issue and find the root cause. It is a pathological case of the issue with the CUDA memory pool described here: #3903 (comment) Combined with an non-optimal KV cache utilization and fragmentation in this specific case, it leads to extra memory allocations at runtime after several sequences are processed. A quick workaround is the following: diff --git a/llama.cpp b/llama.cpp
index c2ad0486..5851a2ee 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5469,6 +5469,8 @@ static int llama_decode_internal(
batch.seq_id = seq_id_arr.data();
}
+ kv_self.head = 0;
+
if (!llama_kv_cache_find_slot(kv_self, batch)) {
return 1;
} Can you confirm that fixes the issue on your side? |
Also, try the following branch: #4170 Should resolve the issue in a better way. |
I tested the code for the branch 'kv-cache-opts' on V100S-PCIE-32G. I conducted the test using my modified '
I also compiled this branch test my python continuous batch code with llama-cpp-python 0.2.19 and I used -n 4096 -c 8162 for two concurrent requests for 1 hour without any errors occurring. However, I have a new question, which may sound silly, but I'm not familiar with this field. What is the relationship
Here are the test results:
Test detail1 test: -c 4096 -b 4096 --cont_batching --parallel 2./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 4096 -b 4096 --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000 output:
2 test: -c 4096 -b 4096 --cont_batching --parallel 4./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 4096 -b 4096 --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000 output: main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
main: Evaluating the system prompt ...
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, started decoding ...
Client 1, seq 1, started decoding ...
Segmentation fault (core dumped) 3 test: -c 8162 -b 4096 --cont_batching --parallel 2./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096 --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000 output: run parameters as at 2023-11-23 03:22:20
main: n_parallel = 2, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used: ./CodeLlama-7B/ggml-model-q8_0.gguf
Total prompt tokens: 106860, speed: 1182.29 t/s
Total gen tokens: 433, speed: 4.79 t/s
Total speed (AVG): speed: 1187.08 t/s
Cache misses: 0
llama_print_timings: load time = 6436.07 ms
llama_print_timings: sample time = 218.26 ms / 493 runs ( 0.44 ms per token, 2258.82 tokens per second)
llama_print_timings: prompt eval time = 89482.04 ms / 107290 tokens ( 0.83 ms per token, 1199.01 tokens per second)
llama_print_timings: eval time = 72.03 ms / 4 runs ( 18.01 ms per token, 55.53 tokens per second)
llama_print_timings: total time = 90384.80 ms 4 test: -c 8162 -b 4096 --cont_batching --parallel 4./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096 --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000 output: main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
main: Evaluating the system prompt ...
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, started decoding ...
Client 1, seq 1, started decoding ...
Client 2, seq 2, started decoding ...
Client 3, seq 3, started decoding ...
CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0 5 test: -c 8162 -b 3072 --cont_batching --parallel 4./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 3072 --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000 output: main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
main: Evaluating the system prompt ...
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, started decoding ...
Client 1, seq 1, started decoding ...
Client 2, seq 2, started decoding ...
Client 3, seq 3, started decoding ...
CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0 5 test: -c 8162 -b 2048 --cont_batching --parallel 4./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 2048 --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000 output: run parameters as at 2023-11-23 03:35:29
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used: ./CodeLlama-7B/ggml-model-q8_0.gguf
Total prompt tokens: 106860, speed: 875.11 t/s
Total gen tokens: 301, speed: 2.46 t/s
Total speed (AVG): speed: 877.58 t/s
Cache misses: 65
llama_print_timings: load time = 5547.56 ms
llama_print_timings: sample time = 152.90 ms / 361 runs ( 0.42 ms per token, 2361.10 tokens per second)
llama_print_timings: prompt eval time = 121359.56 ms / 107159 tokens ( 1.13 ms per token, 882.99 tokens per second)
llama_print_timings: eval time = 48.57 ms / 3 runs ( 16.19 ms per token, 61.77 tokens per second)
llama_print_timings: total time = 122110.98 ms 6 test: -c 16384 -b 4096 --cont_batching --parallel 2./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16384 -b 4096 --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000 output: run parameters as at 2023-11-23 03:30:15
main: n_parallel = 2, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used: ./CodeLlama-7B/ggml-model-q8_0.gguf
Total prompt tokens: 106860, speed: 1105.28 t/s
Total gen tokens: 422, speed: 4.36 t/s
Total speed (AVG): speed: 1109.64 t/s
Cache misses: 0
llama_print_timings: load time = 11462.06 ms
llama_print_timings: sample time = 218.06 ms / 482 runs ( 0.45 ms per token, 2210.38 tokens per second)
llama_print_timings: prompt eval time = 95798.40 ms / 107280 tokens ( 0.89 ms per token, 1119.85 tokens per second)
llama_print_timings: eval time = 49.38 ms / 3 runs ( 16.46 ms per token, 60.76 tokens per second)
llama_print_timings: total time = 96682.93 ms 7 test: -c 16384 -b 4096 --cont_batching --parallel 4./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16384 -b 4096 --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000 output: main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
main: Evaluating the system prompt ...
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, started decoding ...
Client 1, seq 1, started decoding ...
Client 2, seq 2, started decoding ...
Client 3, seq 3, started decoding ...
CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0 8 test: -c 8162 -b 4096 --cont_batching --parallel 2./parallel -m /aistudio/workspace/system-default/models/CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096 --cont_batching --parallel 2 --sequences 600 --n-gpu-layers 1000 output: run parameters as at 2023-11-23 04:15:22
main: n_parallel = 2, n_sequences = 600, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used: /aistudio/workspace/system-default/models/CodeLlama-7B/ggml-model-q8_0.gguf
Total prompt tokens: 1068600, speed: 1175.07 t/s
Total gen tokens: 4699, speed: 5.17 t/s
Total speed (AVG): speed: 1180.23 t/s
Cache misses: 0
llama_print_timings: load time = 6629.32 ms
llama_print_timings: sample time = 2562.92 ms / 5299 runs ( 0.48 ms per token, 2067.57 tokens per second)
llama_print_timings: prompt eval time = 900202.27 ms / 1073290 tokens ( 0.84 ms per token, 1192.28 tokens per second)
llama_print_timings: eval time = 166.83 ms / 10 runs ( 16.68 ms per token, 59.94 tokens per second)
llama_print_timings: total time = 909396.98 ms |
I wouldn't say it's a silly question. For adding parallel sequences to the mix there are two scenarios: The sequences all share a prompt or (at least some of them do) have their own prompts. In the first scenario, suppose you set On the other hand, if each sequence has a unique prompt and you use those same settings and prompt size (64 sequences, each with 500 tokens of unique prompt) then you have 32,000 tokens to evaluate at the start (500 * 64). Now you actually will be submitting batches of 4,000 initially. Once the prompt tokens have all been evaluated, then you'll be back to only submitting batches equal to the size of the number of sequences: so 64. Hope this helps explain it. I think the large number of sequences combined with unique prompts is the main way you'd run into a case where setting a very high batch size matters. The other scenario is of course if you prompt is just greater or equal to the batch size. |
In my setup, I set -n 4096. If I input two sequences, one with 2500 tokens and the other with 2000 tokens, based on my understanding, do I need to execute the llama_decode method twice? Once with 4096 tokens and once with 404 tokens, is that correct? However, when I input it this way, it results in an "illegal memory access" error. I am unsure if this is normal or if it's a bug, or if the value of n is too large for my hardware. |
If you meant you set I don't really know what a reasonable value for It's also something that's pretty much only going to have an effect during prompt processing, since you probably aren't going to actually be doing generation with 4,000+ parallel sequences. |
@littlebai3618 Currently, there is no reason to use Using llama.cpp/examples/parallel/parallel.cpp Lines 276 to 318 in 8e672ef
Thanks for looking into the |
Do you need me to provide the results obtained by running with -n 512, or are these results sufficient? I didn't understand the meaning of "batch" correctly before. I need to do some further validation before closing this issue. By the way, will this branch be merged soon? |
I think it would be useful to see if you can still reproduce your original problem with a more normal |
You are confusing the two parameters:
You just have to set
llama.cpp/examples/parallel/parallel.cpp Lines 346 to 351 in ff8238f
You can adapt it to your needs. |
I tested the code for the branch 'kv-cache-opts' on V100S-PCIE-32G. I conducted the test using my modified ' Here are the test results: kv-cache-opts:
master: d103d93
When using -c 16384, the memory usage gradually increases from 52.4% to 59.6%, then to 63.9%, and finally to 65.6%. However, it does not appear to crash. I have only one sequence in my use case, so it seems that I cannot trigger the fragmentation issue of the KV cache. I believe the problems I encountered before were due to my incorrect usage of the -n parameter. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
In ./examples/parallel/parallel.cpp, I added the following two lines to the final output:
I believe that the logic in line 221 of parallel.cpp:
should release all the occupied cache when the entire task is completed. However, in reality, it does not release the cache.
Current Behavior
I expected the value of cache_count to be 0, but in reality, it is 1153.
Environment and Context
It appears that you are using the "CodeLlama-7B-HF" model from the repository you mentioned (https://huggingface.co/codellama/CodeLlama-7b-hf). You mentioned that you performed the conversion using the "convert.py" script included in the repository.
$ lscpu
$ uname -a
Linux studio-0 4.19.96 #1 SMP Tue Mar 10 10:34:01 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)
I expected the value of cache_count to be 0, but in reality, it is 1153.
Steps to Reproduce
please download model from https://huggingface.co/codellama/CodeLlama-7b-hf
Failure Logs
Example run with the Linux command perf
The text was updated successfully, but these errors were encountered: