-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the permanent prompt permanent #1019
Comments
No.
The LLaMa model doesn't need to see the tokens themselves, the only necessary parameter is If there is something I would improve in the code is to keep a representation of the exact context that the model has at the moment around. This way EDIT: I should also mention that |
last_n_tokens is not the actual context. I understand that. Is there a way to see the actual context? Is that what you would like to be able to see? n_past is the number of token reused from the past tokens (ie the context). It is n_past tokens starting from the end or the beginning of the context? I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx)) Thanks a ton for your help. |
It is the line: n_past = params.n_keep; That is it. That is all the model needs to know. The model will now calculate as if only
|
F16_KV appears to have been removed here: ggerganov@af99c6f This addresses two issues: - ggerganov#995 which just requests to add the KV cache offloading param - ggerganov#1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)
Expected Behavior
n_keep tokens (the params.prompt (e.g. alpaca.txt)) are always part of the context and does not need to be recalculated.
Current Behavior
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
params.n_keep = (int)embd_inp.size();
n_past = params.n_keep;
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
n_past += embd.size();
Are my statements correct?
Suggestions:
To solve for that we could:
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size() - n_keep, last_n_tokens.end() - embd.size());
embd.insert(embd.begin(), last_n_tokens.begin(), n_keep);
Is this right?
Problem: this would basically recompute the permanent prompt (e.g. alpaca.txt) every time the context reach the max size.
Why is this a problem? I run a model where the permanent prompt is 1000 tokens (multi shot prompt) and the questions are 250 tokens. Hence recomputing the permanent prompt everytime is painfull.
Question: How to we recover / save the computation of the permanent prompt and then bring it back when the context is full?
The text was updated successfully, but these errors were encountered: