-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug when prompt stored in --prompt-cache
is longer than the new one
#1585
Comments
Yes, I think that's how it works. As for it being "invalid", it depends on what you mean. I didn't write the code or anything, but the way saving sessions appears to work is that the current state of the random number generator is saved with the session. This means if your prompt is a prefix of the saved one, the cached one is used. However you'll be starting generation from the RNG state after all the tokens in the saved session had been generated. In other words, if you loaded a session with: Now is the time for all good men to come to their aid of their and the LLM generated Now is the time you wouldn't necessarily get
If your definition of "valid" is "exactly reproducible as if this had been the initial prompt with that specific seed" then the answer is no. You'd only have that guarantee if your prompt exactly matches the one that was saved. My changes actually can help you here a bit. If you load the session with a prefix and specify a certain seed, you will get the same tokens every time (assuming the seed wasn't negative which implies just choosing a random one). i.e.
The current behavior is useful for anyone who doesn't require exactly reproducible results. If you're writing a program that's interfacing with llama.cpp, one thing you could do is store an additional file with some metadata like the full initial prompt. Then your application can ensure that the prompts match and regenerate if necessary.
This sounds like a very useful feature. I personally don't like the current behavior of overwriting the saved state without warning.
Related to the previous question, it's not really a space it just gets rendered that way. Models have some special tokens that control their behavior. One is the SOD (start of document) token which LLaMA models expect at the beginning of the prompt. Generating an EOD token is also how models indicate their response is completed. llama.cpp ensures that token always exists as the first token because it's required to get valid output. I think there have been previous issues about changing the behavior of actually showing those tokens. |
Thank you @KerfuffleV2 for your answer!
Is it only RNG saved, and not the whole context? I mean, if the model outputs "D" after "A B C" and session is saved at "C", then I restart it at "A" and I expect "B" but it gives "E" – does this mean that it remembers "B C" (which was stored in session), or it just randomly restarts from "A"? I didn't understand, is the older session is messing behind the scenes. I concluded that it is, because at zero temperature RNG should not randomize anything, shouldn't it?
My first idea was to just track the prompt completely in my own wrapper, but when I saw that llama.cpp can continue arbitrary tails from current cache – then I assumed that it will reasonably take any prompts: longer, shorter, different – as it is tracking the length and prints debug messages about cache state. If it would only support 1:1 exact prompts as in cache – then I wouldn't ask my question, because caching just wouldn't work for any changed prompts.
I'm fine with not reproducible, but I'm not fine with broken!
I don't like that I cannot swap the input with output and re-run as-is to continue generation. |
@aleksusklim The prompt cache stuff is new, so what's happening here isn't the intended action. This isn't related to the RNG being restored. The RNG bug was just another oversight. |
Probably should throw in a disclaimer: I'm just a random person that made a couple small contributions to this project. I'm answering to the best of my knowledge (which is far from complete) but what I say isn't official at all.
The way most random number generators work is after you generate a number, the state of the RNG is permuted in an unpredictable way. It's kind of like if you have a pot with a whole bunch of marbles, each marked with a number. You take a marble out, shake the pot, take a marble out, shake the pot, etc. The state of the pot at the end of generation is what gets saved to the session file. So in your example, you're starting generation after My pull makes it so that saved RNG state is ignored when loading the session if you specify However note that this doesn't mean this applies if you make the prompt you specify longer or shorter.
The examples you provided didn't show that occurring. The first question would be: Are you absolutely, 100% positive that it wasn't user error? Something like accidentally leaving that space you mentioned could cause that kind of effect. Accidentally specifying a blank prompt. Accidentally overwriting your saved session with a blank prompt, etc. If you're positive it's not something you did wrong, then please show an (ideally reproducible) example of it happening.
It sounds like you may be better off using llama.cpp as a library, or making your own copy of the |
I showed at temperature 0, it should not be affected by RNG in any way. Or I am wrong? |
I'm not 100% sure about that, it may depend on other sampling settings. I was talking about where you said sometimes it skipped your prompt. RNG still having an effect when temperature is 0 may be an issue, but I wouldn't expect that to really have anything to do with session restoration. |
For me, the clear evidence that it is not skipping anything is that it outputs expected strings at zero temp. I did not saved my previous results (because I was just testing around and not thinking about filing a bug yet). I can try to come up with steps to show that too, if it can be triggered on zero temp, which I'll test later, on more recent commit. |
Thanks to @DannyDaemonic, here's a reproducible example of some really weird prompt restoration behavior: #1550 (comment) However, I don't think it has anything to do with RNG (or at least not restoring the RNG state) because my pull allows ignoring that state and the issue still occurs. |
--prompt-cache
is longer than the new one
I put some thoughts in the other thread. As suggested, I believe the application of the final logits (sampling input) and RNG to a prefix of the input leads to the unexpected sampling results ( Also:
|
@ejones When saving the session, the amount of data for memory k/v, etc is based on the number of tokens that had been generated. When loading a cached prompt where the supplied prompt is only a prefix, would it be possible to truncate stuff like memory k/v to the lower number of tokens in the supplied prompt? If so, it seems like that might fix DannyDaemon's example. |
All right, I failed to replicate "broken roleplay" on zero temperature if my prompt would never be an exact prefix of the last one.
I used in.txt:
outputs
Cache size: 48 Mb Next one:
outputs
Cache size: 61 Mb Third one:
outputs
Cache size: 82 Mb Finally, repeating the second prompt:
outputs
Cache size drops back to 61 Mb Does this mean that we can provide This actually renders my wrapper useless, because I assumed that partially-restoring session is technically impossible! That's why I started experimenting with cache-swapping. But now it is enough to always provide the same cache file and don't care about it anymore, but greatly improving the execution speed. This is very good! Here are timings of the third prompt with cache, it took 20 seconds, including mmap 7B model loading into 8 Gb or RAM:
But here are timings of the same prompt (leading to the same answer) without specifying the prompt-cache, and it took 40 seconds (and it would obviously get worse as the history would get longer)
This is perfect for conversations that fit into model context (since otherwise there will be always a brand new generation with heavily trimmed prompt after each line). (I'm closing this issue in hope that PR would be merged; we can continue the discussion here if it feels more appropriate than in PR; I will reopen it later if I'll found another replicable inconsistency again). |
@KerfuffleV2 It seems the fix to this is just to recalculate the logits. @aleksusklim Let's leave this open until a PR that fixes this is merged, so that others who notice the issue can find an open issue. |
Am I understanding correctly that ZZZ-bug is not fixed yet? Only the seed randomization was? I tried the new commit 66874d4/
So it should be very useful for Contents of
First invocation: stderr:
stdout:
Next invocation: stderr:
stdout:
Perfect! But suppose that I'm not happy with the result, and I want to regenerate my entire request again.
stderr:
stdout:
This is exactly "broken roleplay" that I saw earlier – seems that it was triggered by this exact use-case "regenerate after continuing iteratively" (which I did by updating the cache manually via extra runs; now it is much simpler!) The cache dropped to 50 Mb. But let's restore the previous copy of it, this time changing stderr:
stdout:
This is indeed a different seed (the model just kinda overfitted and generally has very high confidence levels: https://huggingface.co/ehartford/WizardLM-7B-Uncensored/discussions/10)
So, everything seems to be working as intended, except for the case with EXACT prefix of the cached prompt. BTW, can we have an option "do not print initial prompt" (not user's, nor cached when user's was empty). |
Yes, that's correct. My changes only make it so you don't have to repetitively include the prompt when it's already saved in the session and to fix it so that I saw DD's post, but I'm not really sure exactly what is required to recalculate the logits. I think doing something like part of the actual model evaluation might be required?
It would probably be better to put those things in a separate issue (or issues). As for the pesky space, I'd guess you're probably pretty safe just stripping leading whitespace as a temporary workaround. I can't really imagine a situation where someone would deliberately want to put spaces at the start of a prompt. If you're taking user input to generate your prompts, you could possibly also strip leading whitespace for consistent behavior. |
Edit: @DannyDaemonic here. (Sorry for the edit.) Here's a much simpler example. First build the cache like this:
Then try this:
The joke will start with a
Z
every time. Perhaps the logits are not being reevaluated for some reason. Changing one token, even the very last, seems to work around the bug. The fix is to recalculate the logits.What happens when a prompt stored in
--prompt-cache
is "longer" than the current one?I want to make a wrapper around
main.exe
, but this behavior looks strange and buggy. I don't know whether it is a bug or not, but it looks very-very confusing. I may post it as a separate issue if that's really a bug that you cannot solve in this PR.So, here are my steps: (on version ee96541 and this model)
main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache1 -f in.txt >out.txt
(note the cache file and zero temperature)
My prompt inside
in.txt
is this:The model outputs this text inside
out.txt
(with an extra space before the first line, but let's assume I stripped it manually):It also creates prompt-cache file
cache1
with size around 13 Mb, and writes to stderr:If I repeat the same command as-is – it recreates the same text and does not update the cache file, with stderr being:
Then I copy "cache1" file to
cache2
file. I also put the resulting text back intoin.txt
, this time cutting it after the last comma, so it becomes:Then I run, pointing to
cache2
: (which is the same as cache1 for now)main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache2 -f in.txt >out.txt
It gives the exact same line, continuing with
but I eventually found safety inside a church.
and stopping as before.But this time,
cache2
is updated to 21 Mb. So far so good!Its stderr said:
Finally, I copy cache2 to
cache3
and cut the prompt back toMy
just as my very first in.txt contents.I run, pointing to cache3, and get the following:
File
cache3
stays binary equal to cache2, and the program outputs to stderr this lines among the others:What just happened? For me, it looks like the program compared for the head of the prompt discarding its cached tail; then it assumed that the cache is valid (despite it is not) and continued the generation from the cache, ending up in a broken state.
My questions:
And a few minor things:
– Why not to run the evaluation with
--n-predict 0
if I only want to cache the prompt? I have to specify--n-predict 1
to just cache the prompt, but that prevents me to use (probably reparsed?) output file since it will contain an extra word at the end. I have to use the initial file.– Why to print an extra space at the beginning of stdout? If I swap files (giving out.txt to in.txt) later, it will grow each time (two spaces, three spaces…) and most likely destroys the cache. I had to strip that manually, but this feels odd.
The text was updated successfully, but these errors were encountered: