-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using llama.cpp, the entire context gets reprocessed each generation #866
Comments
I agree that this optimization would be beneficial, but it's not obvious to me how to implement it. Maybe someone can shed some light in a PR. |
The llama cpp is very slow. |
I'm actually trying to look into this but it'll take some work to figure out how to implement it here as I'm more of a C programmer. What llama.cpp interactive mode does as far as I know is they only compute the new tokens the user provides rather than the entire prompt. Note that they don't have the option to edit prompts like us, so we will have to force a full recomputation if an edit occurs. As a start it may be simpler to first target saving the initial computation for our character prompt so at the very least those several hundred tokens won't be reprocessed every time. As an aside OpenBlas works with the latest llama-cpp-python abetlen/llama-cpp-python#32 (comment) and in my tests I get a nice 2x improvement in ingestion speed. It's very helpful for long prompts. |
Not a python guy either, but I might have it baseline working with the new interactive mode inclusion and the speeds are great. I essentially gutted the Problem is, I don't know my way around the ins and outs of webui so I am only managing to dump it into the console as is. The basics are this:
if shared.is_llamacpp:
for k in ['temperature', 'top_p', 'top_k', 'repetition_penalty']:
generate_params[k] = generate_state[k]
generate_params['token_count'] = generate_state['max_new_tokens']
try:
output = ""
if shared.args.no_stream:
shared.model.input(question)
reply = shared.model.output()
print(reply)
if not shared.is_chat():
reply = original_question + apply_extensions(reply, 'output')
yield formatted_outputs(reply, shared.model_name)
else:
if not shared.is_chat():
yield formatted_outputs(question, shared.model_name)
# RWKV has proper streaming, which is very nice.
# No need to generate 8 tokens at a time.
shared.model.input(question)
for reply in shared.model.output():
output = original_question + output + reply
print(output)
if not shared.is_chat():
reply = original_question + apply_extensions(reply, 'output')
yield formatted_outputs(reply, shared.model_name)
except Exception:
traceback.print_exc()
finally:
t1 = time.time()
#original_tokens = len(encode(original_question)[0])
#new_tokens = len(encode(output)[0]) - original_tokens
#print(f'Output generated in {(t1-t0):.2f} seconds ({new_tokens/(t1-t0):.2f} tokens/s, {new_tokens} tokens, context {original_tokens})')
return
That's basically it, I think. That should be outputting things into your console. I don't know the code structure or what it expects well enough to get past that at the moment, sorry for not helping more. Keep in mind this is using 0.1.26 or 0.1.27 or whatever the latest wheel is from their repo (post-merge of abetlen/llama-cpp-python#15). Hopefully that is enough to get people with more time to dig into the code start. Sorry for not formatting as a pull request, I don't know the code well enough to beat it into shape but I still wanted to contribute. |
Progress is currently being made on this issue at abetlen/llama-cpp-python#68, where they are making llama-cpp-python only compute the new tokens if the previous inputs have not been altered. |
It feels like it uses the old CPU mode even when you're trying to run ggml models. Token speed is like 0.3s and it takes 3 minutes to get a reply that's like 10 words.
|
What hardware are you running it on? This is what I get with a 16GB i5-6500, with no AVX512 like you. I have BLAS which helps but this speed for the first generation is similar to what I get with the original llama.cpp.
Now if you're talking about continued generations then yes it will take several minutes each time as it has to recompute all the old tokens. |
I think that the speed issues should be fixed after this d2ea925 See the discussion here abetlen/llama-cpp-python#44 |
I've managed to make cache work 3 times by typing in input, clicking "send dummy message" then "continue", then it stopped working. It's not consistent, can just decide to reprocess context out of the blue without (apparent to me) changes in what I'm doing. |
I've been playing with the updated UI today and I see these inconsistencies too. Most of the time the cache works well but there are occasional times where it does a full regeneration even though the previous text remains unaltered. This happens when I'm only using the Generate button and nothing else. Below is a particularly bad example with no editing taking place where I would expect a cache hit every time. I think I need to add some debug code and figure out what is actually being sent over to llama-cpp-python.
|
You should tell @abetlen about it in this thread abetlen/llama-cpp-python#44 |
@oobabooga Does the UI modify newlines seen in the responses returned by the model? Looks like that's tripping up the cache system. Context sent to model:
Text inside the llamacpp-python cache:
Notice the double newline
When the model returns a multiline response the above happens and I get a guaranteed cache miss. |
The newline issue has been fixed after picking up your change 😄. FYI there are two more things I saw which will cause a cache miss. I guess you notice these things when you print out debug data for every run.
The UI automatically cuts off the second AI: generated by the misbehaving model but that remains in the cache, therefore we get a cache miss.
These can't be resolved on our end and hopefully llama-cpp-python will soon have a real caching system which can help with this. |
Do you think that it's worth it to limit the context size to less than 2048 tokens for llama.cpp? I see that the original llama.cpp repository uses 512 tokens by default. |
No that would be terrible as a prompt can easily be 300+ tokens and it's probably better to have it run slowly than have the AI forget things after a couple lines. For llama.cpp the ctx size (and therefore the rotating buffer) honestly should be a user-configurable option, along with n_batch. Optimization wise one interesting idea assuming there is proper caching support is to run two llama.cpp instances and have the second instance continually begin caching the results of a 1-message rotation, 2-message rotation, and so forth. Then when the user finally hits the token limit the UI will start using the cached results to avoid doing a full 2k token ingestion. @Answer-is-not-42 What happens during continue is that "\n{{user}}" ends up in the cache as that's what appears at the end of the model's response. The UI chops that off before sending it back to the model thus we get a cache miss. |
That's an interesting idea. |
@eiery It's may partially be because of that, but the same thing(no cache hit) occurs in default interface mode too! Even without adding anything and just hitting "continue", it recomputes the context. I expect the "raw" output tab to be raw, unchanged, but aparently that isn't the case, or there's some another thing at work |
This has been mostly solved on the llama-cpp-python side. It caches recent prompt ingestions automatically, and there is an additional --cache-capacity flag now to extend the cache further |
When using llama.cpp, the entire prompt gets processed at each generation making things like chat mode unbearably slow. The problem compounds as the context gets larger and larger as well.
Perhaps using interactive mode in the binding might work? Or, maybe more likely, implementing something similar to the prompt fast-forwarding seen here.
The text was updated successfully, but these errors were encountered: