Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using llama.cpp, the entire context gets reprocessed each generation #866

Closed
ClassicDirt opened this issue Apr 7, 2023 · 19 comments
Closed

Comments

@ClassicDirt
Copy link

When using llama.cpp, the entire prompt gets processed at each generation making things like chat mode unbearably slow. The problem compounds as the context gets larger and larger as well.

Perhaps using interactive mode in the binding might work? Or, maybe more likely, implementing something similar to the prompt fast-forwarding seen here.

@ClassicDirt ClassicDirt changed the title Using llama.cpp reprocesses the entire prompt each generation Using llama.cpp, the entire context gets reprocessed each generation Apr 7, 2023
@oobabooga
Copy link
Owner

I agree that this optimization would be beneficial, but it's not obvious to me how to implement it. Maybe someone can shed some light in a PR.

@djaffer
Copy link

djaffer commented Apr 8, 2023

The llama cpp is very slow.

@ghost
Copy link

ghost commented Apr 9, 2023

I'm actually trying to look into this but it'll take some work to figure out how to implement it here as I'm more of a C programmer. What llama.cpp interactive mode does as far as I know is they only compute the new tokens the user provides rather than the entire prompt. Note that they don't have the option to edit prompts like us, so we will have to force a full recomputation if an edit occurs. As a start it may be simpler to first target saving the initial computation for our character prompt so at the very least those several hundred tokens won't be reprocessed every time.

As an aside OpenBlas works with the latest llama-cpp-python abetlen/llama-cpp-python#32 (comment) and in my tests I get a nice 2x improvement in ingestion speed. It's very helpful for long prompts.

@digiwombat
Copy link

digiwombat commented Apr 9, 2023

Not a python guy either, but I might have it baseline working with the new interactive mode inclusion and the speeds are great.

I essentially gutted the llama_model_alternative.py script and used the newly merged interactive mode example.

Problem is, I don't know my way around the ins and outs of webui so I am only managing to dump it into the console as is.

The basics are this:

  1. Copy over the high level chat example into llamacpp_model_alternative.py and the common.py classes that it uses into a new file named llamacpp_common.py. Here's a gist with the edited llama alt file.
  2. Make functions in the chat example into class functions so they run.
  3. Add a llama.cpp specific function to the generate_reply class in text_generation.py (basically just a copy of the current RWKV/llamacpp version. Mine looks like this:
if shared.is_llamacpp:
        for k in ['temperature', 'top_p', 'top_k', 'repetition_penalty']:
            generate_params[k] = generate_state[k]
        generate_params['token_count'] = generate_state['max_new_tokens']
        try:
            output = ""
            if shared.args.no_stream:
                shared.model.input(question)
                reply = shared.model.output()
                print(reply)
                if not shared.is_chat():
                    reply = original_question + apply_extensions(reply, 'output')
                yield formatted_outputs(reply, shared.model_name)
            else:
                if not shared.is_chat():
                    yield formatted_outputs(question, shared.model_name)

                # RWKV has proper streaming, which is very nice.
                # No need to generate 8 tokens at a time.
                shared.model.input(question)
                for reply in shared.model.output():
                    output = original_question + output + reply
                    print(output)
                    if not shared.is_chat():
                        reply = original_question + apply_extensions(reply, 'output')
                    yield formatted_outputs(reply, shared.model_name)

        except Exception:
            traceback.print_exc()
        finally:
            t1 = time.time()
            #original_tokens = len(encode(original_question)[0])
            #new_tokens = len(encode(output)[0]) - original_tokens
            #print(f'Output generated in {(t1-t0):.2f} seconds ({new_tokens/(t1-t0):.2f} tokens/s, {new_tokens} tokens, context {original_tokens})')
            return
  1. Add if not shared.is_llamacpp: to line 39-ish and fix indent of chat.py since we don't need to encode for llamacpp in this mode.

That's basically it, I think. That should be outputting things into your console. I don't know the code structure or what it expects well enough to get past that at the moment, sorry for not helping more.

Keep in mind this is using 0.1.26 or 0.1.27 or whatever the latest wheel is from their repo (post-merge of abetlen/llama-cpp-python#15). Hopefully that is enough to get people with more time to dig into the code start. Sorry for not formatting as a pull request, I don't know the code well enough to beat it into shape but I still wanted to contribute.

@ghost
Copy link

ghost commented Apr 12, 2023

Progress is currently being made on this issue at abetlen/llama-cpp-python#68, where they are making llama-cpp-python only compute the new tokens if the previous inputs have not been altered.

@Enferlain
Copy link

It feels like it uses the old CPU mode even when you're trying to run ggml models. Token speed is like 0.3s and it takes 3 minutes to get a reply that's like 10 words.

Loading llama-13b-ggml-q4_0...
llama.cpp weights detected: models\llama-13b-ggml-q4_0\ggml-model-q4_0.bin

llama.cpp: loading model from models\llama-13b-ggml-q4_0\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

@ghost
Copy link

ghost commented Apr 15, 2023

What hardware are you running it on? This is what I get with a 16GB i5-6500, with no AVX512 like you. I have BLAS which helps but this speed for the first generation is similar to what I get with the original llama.cpp.

llama.cpp: loading model from models/llama-13B-ggml/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 3216.00 MB per state)
llama_init_from_file: kv self size  = 3200.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 81.47 seconds (0.49 tokens/s, 40 tokens, context 346, seed 1943500325)

Now if you're talking about continued generations then yes it will take several minutes each time as it has to recompute all the old tokens.

@oobabooga
Copy link
Owner

I think that the speed issues should be fixed after this d2ea925

See the discussion here abetlen/llama-cpp-python#44

@Answer-is-not-42
Copy link

I've managed to make cache work 3 times by typing in input, clicking "send dummy message" then "continue", then it stopped working. It's not consistent, can just decide to reprocess context out of the blue without (apparent to me) changes in what I'm doing.

@ghost
Copy link

ghost commented Apr 16, 2023

I've been playing with the updated UI today and I see these inconsistencies too. Most of the time the cache works well but there are occasional times where it does a full regeneration even though the previous text remains unaltered. This happens when I'm only using the Generate button and nothing else.

Below is a particularly bad example with no editing taking place where I would expect a cache hit every time. I think I need to add some debug code and figure out what is actually being sent over to llama-cpp-python.

Output generated in 118.49 seconds (0.47 tokens/s, 56 tokens, context 384, seed 1483172622)
generate cache hit
Output generated in 33.98 seconds (0.97 tokens/s, 33 tokens, context 477, seed 1813301482)
generate cache hit
Output generated in 70.04 seconds (0.99 tokens/s, 69 tokens, context 570, seed 179168234)
Output generated in 130.12 seconds (0.20 tokens/s, 26 tokens, context 657, seed 1809474820)
generate cache hit
Output generated in 54.90 seconds (0.84 tokens/s, 46 tokens, context 742, seed 1525464511)
generate cache hit
Output generated in 50.84 seconds (0.75 tokens/s, 38 tokens, context 840, seed 1646691917)
generate cache hit
Output generated in 70.03 seconds (0.99 tokens/s, 69 tokens, context 917, seed 2034897901)
Output generated in 185.25 seconds (0.30 tokens/s, 56 tokens, context 1024, seed 1520712527)

@oobabooga
Copy link
Owner

You should tell @abetlen about it in this thread abetlen/llama-cpp-python#44

@ghost
Copy link

ghost commented Apr 16, 2023

@oobabooga Does the UI modify newlines seen in the responses returned by the model? Looks like that's tripping up the cache system.

Context sent to model:

b"Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.\n\nYou: So how did you get into computer engineering?\nChiharu Yamada: I've always loved tinkering with technology since I was a kid.
...
You: Studying in school mostly, and I used them in my internship. *You pause.* God, it's so stuffy in here. Let's chat outside.\nChiharu Yamada: Oh okay sure!\n\n*The two of you exit the building to get some fresh air.*\nYou: I like this campus, it's nice and airy. Shame the buildings are so old.\nChiharu Yamada:"

Text inside the llamacpp-python cache:

b"Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.\n\nYou: So how did you get into computer engineering?\nChiharu Yamada: I've always loved tinkering with technology since I was a kid.
...
You: Studying in school mostly, and I used them in my internship. *You pause.* God, it's so stuffy in here. Let's chat outside.\nChiharu Yamada: Oh okay sure!\n*The two of you exit the building to get some fresh air.*\nYou:"

Notice the double newline \n\n after "Oh okay sure". That only appears in the context sent to generate and not in the cache thus we get a miss. In the chat log I also see the double newline as shown below.

"Oh okay sure!\n\n*The two of you exit the building to get some fresh air.*"

When the model returns a multiline response the above happens and I get a guaranteed cache miss.

@oobabooga
Copy link
Owner

@eiery can you check if after this change the behavior becomes more consistent? 6a03ad0

I think that those function calls shouldn't be in chat.py anyway, since they are only relevant for turning the text into markdown, which happens ate html_generator.py and not chat.py.

@ghost
Copy link

ghost commented Apr 17, 2023

The newline issue has been fixed after picking up your change 😄.

FYI there are two more things I saw which will cause a cache miss. I guess you notice these things when you print out debug data for every run.

  1. Take this example in chat mode where the model ends the sentence with {{user}} instead of {{char}}:
You: How are you?
AI: Fine, thank you. AI:

The UI automatically cuts off the second AI: generated by the misbehaving model but that remains in the cache, therefore we get a cache miss.

  1. If we approach the 2048 token limit the UI starts rotating the text (keeping the prompt and discarding the earliest messages). Obviously that causes a cache miss every time and with the UI having to process 2k tokens the generations suddenly become super slow.

These can't be resolved on our end and hopefully llama-cpp-python will soon have a real caching system which can help with this.

@oobabooga
Copy link
Owner

Do you think that it's worth it to limit the context size to less than 2048 tokens for llama.cpp? I see that the original llama.cpp repository uses 512 tokens by default.

@ghost
Copy link

ghost commented Apr 17, 2023

No that would be terrible as a prompt can easily be 300+ tokens and it's probably better to have it run slowly than have the AI forget things after a couple lines. For llama.cpp the ctx size (and therefore the rotating buffer) honestly should be a user-configurable option, along with n_batch.

Optimization wise one interesting idea assuming there is proper caching support is to run two llama.cpp instances and have the second instance continually begin caching the results of a 1-message rotation, 2-message rotation, and so forth. Then when the user finally hits the token limit the UI will start using the cached results to avoid doing a full 2k token ingestion.


@Answer-is-not-42 What happens during continue is that "\n{{user}}" ends up in the cache as that's what appears at the end of the model's response. The UI chops that off before sending it back to the model thus we get a cache miss.

@oobabooga
Copy link
Owner

Optimization wise one interesting idea assuming there is proper caching support is to run two llama.cpp instances and have the second instance continually begin caching the results of a 1-message rotation, 2-message rotation, and so forth. Then when the user finally hits the token limit the UI will start using the cached results to avoid doing a full 2k token ingestion.

That's an interesting idea.

@Answer-is-not-42
Copy link

@eiery It's may partially be because of that, but the same thing(no cache hit) occurs in default interface mode too! Even without adding anything and just hitting "continue", it recomputes the context. I expect the "raw" output tab to be raw, unchanged, but aparently that isn't the case, or there's some another thing at work

@oobabooga oobabooga unpinned this issue Apr 20, 2023
@oobabooga
Copy link
Owner

This has been mostly solved on the llama-cpp-python side. It caches recent prompt ingestions automatically, and there is an additional --cache-capacity flag now to extend the cache further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants