Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the permanent prompt permanent #1019

Closed
sergedc opened this issue Apr 17, 2023 · 3 comments
Closed

Make the permanent prompt permanent #1019

sergedc opened this issue Apr 17, 2023 · 3 comments

Comments

@sergedc
Copy link

sergedc commented Apr 17, 2023

Expected Behavior

n_keep tokens (the params.prompt (e.g. alpaca.txt)) are always part of the context and does not need to be recalculated.

Current Behavior

auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

embd_inp is the params.prompt (e.g. alpaca.txt)

params.n_keep = (int)embd_inp.size();

n_keep is the size of the permanent prompt (e.g. alpaca.txt)

n_past = params.n_keep;
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
n_past += embd.size();

embd now has a certain amount of token from last_n_tokens + the original embd. But no longer has the permanent prompt (e.g. alpaca.txt)
n_past = size of embd +n_keep (size of permanent prompt (e.g. alpaca.txt)). But in the context, the n_keep token before embd are NOT the permanent prompt (e.g. alpaca.txt). The permanant prompt is all the way at the begining of last_n_tokens.

Are my statements correct?

Suggestions:

To solve for that we could:
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size() - n_keep, last_n_tokens.end() - embd.size());
embd.insert(embd.begin(), last_n_tokens.begin(), n_keep);

Now we have : permanent prompt (e.g. alpaca.txt) + the old context we kept + the original embd.

Is this right?

Problem: this would basically recompute the permanent prompt (e.g. alpaca.txt) every time the context reach the max size.
Why is this a problem? I run a model where the permanent prompt is 1000 tokens (multi shot prompt) and the questions are 250 tokens. Hence recomputing the permanent prompt everytime is painfull.
Question: How to we recover / save the computation of the permanent prompt and then bring it back when the context is full?

@sergedc sergedc changed the title [User] Insert summary of your issue or enhancement.. Make the permanent prompt permanent Apr 17, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 17, 2023

No.

embd only has new tokens to be evaluated, the kept tokens from the beginning do not need to be evaluated again, this is the whole idea of this performance feature.

The LLaMa model doesn't need to see the tokens themselves, the only necessary parameter is n_past which you can see always will include n_keep. The model will get the past token data from the KV cache.

If there is something I would improve in the code is to keep a representation of the exact context that the model has at the moment around. This way n_keep could be derived simply by getting the length of the initial common substring (of tokens) of the new text and the old.

EDIT: I should also mention that last_n_tokens is kind of special in that it remembers all tokens, even if the context is truncated, but it is not used for evaluation, only for sampling.

@sergedc
Copy link
Author

sergedc commented Apr 17, 2023

last_n_tokens is not the actual context. I understand that. Is there a way to see the actual context? Is that what you would like to be able to see?

n_past is the number of token reused from the past tokens (ie the context). It is n_past tokens starting from the end or the beginning of the context?

I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx))
The only line of code is embd.insert(), which will ultimately add more to the context. Where is the line that truncates the context?

Thanks a ton for your help.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 17, 2023

I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx))

It is the line:

n_past = params.n_keep;

That is it. That is all the model needs to know. The model will now calculate as if only n_keep tokens have been evaluated. You can see that n_past is a parameter into the evaluation function. It doesn't need the actual tokens. The state is actually stored in the KV cache.

embd contains new tokens to be evaluated. The complicated-looking insert() adds some of the last seen tokens into it before the new tokens form the user. Note that last_n_tokens are always added to the end of this array so that's why it is calculating it in that way.

@sergedc sergedc closed this as completed Apr 19, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
F16_KV appears to have been removed here: ggerganov@af99c6f

This addresses two issues:

 - ggerganov#995 which just requests to add the KV cache offloading param
 - ggerganov#1006 a NULL ptr exception when using the embeddings (introduced by
   leaving f16_kv in the fields struct)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants