Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some improvements to loading the session with --prompt-cache #1550

Merged

Conversation

KerfuffleV2
Copy link
Collaborator

--seed is ignored when loading session

Currently the --seed parameter is ignored when loading the prompt. However, a very common use case would be to save a prompt and then try several attempts at generation with different seeds.

The pull includes a simple change that just sets the RNG if specified. Two small notes:

  1. There isn't a way to tell if -seed was actually actually specified as far as I know, only that it's not the default -1 value. So --seed -1 is the same as not including it: it won't override the cached seed.
  2. The RNG won't be in the same state as if the seed had been specified originally. I.E. If you generate the cached prompt using --seed 123 and then load it with -seed 123 the subsequent tokens will not match. I don't think there's an easy way around this. It's not 100% ideal but still a lot better than just completely ignoring the parameter with no warning.

Blank prompt overwrites cached one

When loading a cached prompt from a session, you have to specify the prompt again. Even worse, if you forget to enter a prompt you'll get your cached prompt overwritten by the blank one because of course it has low similarity.

This pull changes that behavior to simply use the tokens from the saved session if params.prompt is empty (in other words not set via --prompt or --file).

Closes #1439

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

examples/main/main.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
@KerfuffleV2 KerfuffleV2 force-pushed the feat-session_loading_improvements branch from cfdfc2f to f07993f Compare May 23, 2023 00:54
ejones
ejones previously approved these changes May 23, 2023
@ejones
Copy link
Collaborator

ejones commented May 23, 2023

LGTM! Tested both features, works great!

@DannyDaemonic
Copy link
Contributor

I like the seed fix. Personally, I would prefer it always overrides the seed with a random one if it's set to -1. If we want to keep the seed the same, we could pass it in each time. I don't see a need to keep the seed, but if you can think of a use case, perhaps we could check if seed was set to something like -2 and only then keep the original seed?

Also, currently, prompt cache treats text given during --interactive-first as the prompt when one isn't specified. I've found this to be very useful because that's how I usually feed my prompt in. This PR seems to break that use case.

@KerfuffleV2
Copy link
Collaborator Author

@DannyDaemonic

Personally, I would prefer it always overrides the seed with a random one if it's set to -1.

I agree. The issue is -1 is the default value for seed in the gpt_params struct. So there isn't currently a way to differentiate between --seed being unspecified and --seed -1 being supplied.

The only way to fix that would be store something like a seed_was_set bool in the params also (or use some kind of container for values that could indicate if it was default or user supplied). All of that sounds too complicated for the benefit, when you can just use --seed -2 or any other negative value if you want to make sure the seed is random.

Also, currently, prompt cache treats text given during --interactive-first as the prompt when one isn't specified. [...] This PR seems to break that use case.

I will check that and get back to you. Breaking any current behavior definitely is not intended.

@DannyDaemonic
Copy link
Contributor

I meant, just always, even if they don't specify the seed explicitly. I was thinking since it's just "prompt" cache, the seed shouldn't be cached at all.

Still, if you think the seed should only be overridden when explicitly set, you could set it to -2 before the gpt_params_parse call. It might need a comment to explain why it's being set but then when you saw -1, you'd know the user set it explicitly.

@KerfuffleV2
Copy link
Collaborator Author

@DannyDaemonic

Still, if you think the seed should only be overridden when explicitly set

I don't want to change the existing behavior too much. I'm a little confused though, I think it already works the way you want without any further changes.

With this pull, if you specify any negative seed except for -1, that will make it use a random seed even when loading a session. So you can just use --seed -2 when loading a session and get new results each time.


Also, can you please check and see if the change I just pushed fixes your interactive mode issue? (The behavior should now be identical to the master branch behavior.)

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested the PR and it does preserve the use with --interactive-first. However, I found a bug related to your random seed code.


I don't want to change the existing behavior too much. I'm a little confused though, I think it already works the way you want without any further changes.

With this pull, if you specify any negative seed except for -1, that will make it use a random seed even when loading a session.

Sorry, I'm talking about two separate solutions; maybe that's where the confusion is coming from.

The first possible solution is to just always override the seed. Although your PR says otherwise, this is actually what your code currently does. Since lines 156 to 158 never do anything, they should just be removed. Simple solution.

This will mean even without --seed specified, we update the seed. I think the seed not being updated was simply an oversight, not a design decision. To me, this is the expected behavior. Again, the argument being, it's only a prompt cache, not a full session cache.

The other suggestion is that -2 is the seed you ignore instead of -1. This would maintain the existing behavior of --seed -1. To accomplish this, you could set params.seed = -2; before gpt_params_parse is called on line 54. This is a bit messier to me, and I've seen people ask why the seed isn't being randomized anymore (which is probably not being done with --seed -1). I also think it's a bit early to worry that people are depending on the seed not changing.

But either way, both of these solutions work as expected when the user specifies --seed -1.

examples/main/main.cpp Show resolved Hide resolved
@aleksusklim

This comment was marked as off-topic.

@KerfuffleV2
Copy link
Collaborator Author

@aleksusklim

If I'm reading your post correctly, it sounds like you haven't even tried this pull request? I'd like to stay on topic here since it's confusing to talk about other stuff unrelated to my changes. I initially thought you were saying my pull introduced new issues.

If you're just asking general questions about prompt caching behavior then creating a discussion in the Q&A section is probably most appropriate: https://github.com/ggerganov/llama.cpp/discussions/categories/q-a

You could possibly create an issue instead but Q&A discussion makes the most sense to me.

If you want to @ me when you do that, I can probably answer some of your questions.

@KerfuffleV2
Copy link
Collaborator Author

@DannyDaemonic Thanks for taking the time to test and catching that problem! Since llama_load_from_file already has that logic I wasn't looking for it in other places as well.

To me, this is the expected behavior. Again, the argument being, it's only a prompt cache, not a full session cache.

The thing is, it actually saves and loads the RNG state there. So apparently someone though it was important to preserve that state. I'm reluctant to change that behavior in something that's mainly supposed to be a bug fix.


I've made some further changes. Can you please test again? Hopefully I actually got it right this time - so much for a simple fix!

I don't love the approach of having to save the seed parameter before it gets overwritten but I'm not sure there's really a better way without making significant changes.

This is how the RNG/session loading interaction is supposed to work with the latest changes:

  • If --seed isn't specified (or --seed -1 is used) then the saved RNG state in the session file will be used. I also added a message to indicate this is happening (because application prints out the seed no matter what, even if it's not actually used)
  • If --seed is specified with any positive value then the session RNG state is ignored. Generation starts with a RNG initialized to the specified seed. It will also print out a message informing the user this occurred.
  • If --seed is specified with any negative value other than -1 then same as the above aside from generating a new seed.

@KerfuffleV2 KerfuffleV2 requested a review from DannyDaemonic May 24, 2023 09:04
@aleksusklim
Copy link

aleksusklim commented May 24, 2023

it sounds like you haven't even tried this pull request

No, I did not. I don't see any point to "fix things" in seed with --prompt-cache if this prompt cache is broken in other ways that prevent to use it altogether.

@KerfuffleV2
Copy link
Collaborator Author

@aleksusklim

No, I did not. I don't see any point to "fix things" in seed

Okay, well, hopefully you can appreciate how it would be confusing to start talking about different issues in a pull request aiming to fix some other problem.

Also, there's obviously a benefit to fixing existing problems even if it result in something that works perfectly for every use case. At the least, you're closer to a state where it will do what you want. Fixes and improvements are an incremental process, it's not all or nothing.

if this prompt cache is broken in other ways that prevent to use it altogether.

Maybe it's broken for your specific use case, but that doesn't mean it's broken/useless altogether. From your description, it seems like you're doing something unusual and pretty specific.

Anyway, my advise is to create a discussion. Like I said, if you @ me I will try to help you out by answering some of your questions there.

@aleksusklim
Copy link

aleksusklim commented May 24, 2023

I just created a separate issue: #1585

BTW, your PR is titled Some improvements to loading the session with --prompt-cache (and not "randomizing seed"), and the improvements of prompt cache is what I proposed.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 24, 2023

The thing is, it actually saves and loads the RNG state there. So apparently someone though it was important to preserve that state. I'm reluctant to change that behavior in something that's mainly supposed to be a bug fix.

This is using a feature in the llama.cpp API that stores and loads the entire state. This is so you can do things such as save states for a web server where you're having conversations with many people at once. Everything is ready to go from the stored state so you can just swap in whatever state you need and evaluate the next response.

In this case however, the state is only being used as a --prompt-cache. Restoring things blindly is also causing issue #1585. It's probably also causing other settings besides --seed to be ignored as well. Seed is just the most obvious because everytime you load it, you get the same response.

@KerfuffleV2
Copy link
Collaborator Author

Restoring things blindly is also causing issue #1585.

I'm not fully convinced that's in issue. At least, that person hasn't show an actual problem there yet.

But one of their problems seemed to be that you don't get the same sequence of tokens when restoring from the cache with a prompt that's a prefix of the saved one compared to when the prompt cache was initially saved. Not restoring the RNG state would still cause that "problem".

I think the only way around that would be to store the RNG state after every token had been generated. Even then, it probably wouldn't be reliable because stuff like different sampling settings could result in a different number of random numbers generated per token (maybe?).

I'm also a let less sure about making changes to the llama.cpp library itself, because that's going to affect all API users, all the other examples, etc. I don't really have the familiarity with this project to predict all the effects, I wouldn't really want to mess with it there. That couldn't fix the problem by itself anyway, since the main example is overwriting the seed param on its own.

@aleksusklim
Copy link

aleksusklim commented May 24, 2023

At least, that person hasn't show an actual problem there yet.

CharacterAI chat: https://github.com/Cohee1207/SillyTavern/blob/999a94718df39227fc5ea2bc80006351f43c5a88/public/instruct/WizardLM.json#L3

Write AI's next reply in a fictional roleplay chat between User and AI.

### Instruction:

User: Hello, I …
AI: Great! I think…
User: What if I just…

### Response:

AI: 

Then it will print something like:
In that case, we will…

And I need to craft this one:

Write AI's next reply in a fictional roleplay chat between User and AI.

### Instruction:

User: Hello, I …
AI: Great! I think…
User: What if I just…
AI: In that case, we will…
User: <NEXT USER INPUT GOES HERE>

### Response:

AI: 

This is not "adding to the end", it should be pasted after "User:" but before "### Response:", which means that I need to restore session from there somehow. For example, by cutting after last user input and fake-run just to save session to a separate file; then put Response-tail template and save another session derived from it; then regenerate at different seeds (if that would be possible of course!) or settings until I satisfied with the answer; then add it to the prompt and change the initial session the same way.
(Otherwise I will have to feed it with the same chat history over and over again, which quickly becomes the bottleneck).

That requires my wrapper to have FULL control over user prompts, and tracking which one related to each session copy.
But I see that main.exe tries to recover from incomplete sessions on its own. So I'm not sure, on which behavior I should rely?

The best behavior would be if I won't even care about sessions, just give it a file and then work with the prompt ONLY, so the main.exe will decide: will it use the cache, update it, discard it – without any effects on the final result, just as it was generated from scratch.

Currently, when given shorter prompt than in session – I get irrelevant replies, not just randomized! Roleplay just breaks if I change prompt arbitrary when using the same session file without copy-swapping it manually.

I think the only way around that would be to store the RNG state after every token had been generated.

Aren't the "state" (context) is stored? Cache files are large, I presume they have all internal variable memory of the model.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 24, 2023

Restoring things blindly is also causing issue #1585.

I'm not fully convinced that's in issue. At least, that person hasn't show an actual problem there yet.

I think I know what he's talking about. I've seen similar things myself. Try this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke. No, wait, here's a bunch of Zs: Z Z Z Z Z Z Z Z Z Z"

Then try this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke."

The joke will start with a Z every time. Something is wrong but I don't know what. I think something is being restored that shouldn't be. Again, because we're restoring the entire state, but we're just trying to use it as a prompt cache. Perhaps the logits are not being reevaluated for some reason. Changing even one token, even the very last, seems to work around the bug.

That couldn't fix the problem by itself anyway, since the main example is overwriting the seed param on its own.

The main example doesn't call llama_set_rng_seed until your patch. It's set by the code in the function I posted earlier. Upon looking more closely, it is probably restored intentionally. I just thought it might be a bug because it's initialized and then overwritten when you'd typically have a different path.

I do think setting it here in main is a good idea.

@KerfuffleV2
Copy link
Collaborator Author

The joke will start with a Z every time.

I can confirm that. Wow, really weird. I don't think my changes make that part worse but it definitely seems like an actual bug that should be looked at.

I don't really think it's related to saving the RNG state though. With my patch and overriding the RNG and reseeding it, I still see that behavior.

I do think setting it here in main is a good idea.

Are you saying you think the current approach I'm using is okay after all?

What's your opinion of the current state of this pull request. Do you think anything needs to change before merging?

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 24, 2023

I don't really think it's related to saving the RNG state though. With my patch and overriding the RNG and reseeding it, I still see that behavior.

Absolutely not. My point was just because something is restored by the state doesn't mean we need to use it for our prompt cache.

Are you saying you think the current approach I'm using is okay after all?

What's your opinion of the current state of this pull request. Do you think anything needs to change before merging?

I'm still leaning towards always replacing the seed, no matter what. (Again, it's only a cache.)

At the very least, we should honor --seed -1 since that's what's used in the --help, in the examples, and in the docs. To always honor -1 you can just set params.seed it to -2 before gpt_params_parse on line 54 and test for -2 instead of -1 as the exception.

@KerfuffleV2
Copy link
Collaborator Author

@DannyDaemonic

My point was just because something is restored by the state doesn't mean we need to use it for our prompt cache.

That's reasonable. I don't have any strong opinion here, I'm just reluctant to make changes to the behavior because I'm not familiar with the decisions that lead to stuff being there or not.

If anyone with authority (which might be you?) tells me it should be a certain way then I'm happy to accept it and do it that way.

I'm still leaning towards always replacing the seed, no matter what. (Again, it's only a cache.)

Probably the only benefit of the current behavior is that if you restart from the cache with the whole prompt, not a prefix then you actually will get the same sequence (assuming the same sampler settings and no other code changes that would affect generation).

At the very least, we should honor --seed -1 since that's what's used in the --help, in the examples, and in the docs.

That makes sense. The only thing I'd change is using something a user would be less likely to select than -2. Maybe -2147483648 (lowest value an int32_t can hold)? It's a bit weird/unintuitive for there to be a magic value but at least with this, if someone knows the lowest value for a 32bit int, presumably they'll be better equipped to deal with the problem than someone who happens to choose -2.

So the change would involve making the default value for seed in the params structure be -2147483648 instead of -1 in examples/common.h and then testing initial_seed against that.

Sound okay?

@ejones ejones dismissed their stale review May 24, 2023 13:29

I'll let @DannyDaemonic take this

@ejones
Copy link
Collaborator

ejones commented May 24, 2023

I wrote the session/prompt cache feature, building on the state saving APIs as mentioned. Overall, I went for the simplest approach (both in implementation and CLI usage) to speed up prompt evaluation. I absolutely agree it needs to be iterated on, and my impression is that nothing about prompt (or event state restoration) is particularly sacred at this point.

Here's my take on some of the points being discussed:

  • regarding what is restored, the prompt cache indeed restores the logits (input to sampling) and RNG from the original save point, by way of the state APIs. I believe this is what leads to the oddities like the original next token (the "Z" example) showing up on a reloaded prefix of the prompt, and the inconsistent sampling between original and cached runs.
  • part of llama, main : save state incrementally #1310 included forcing one token to be evaluated if you load a prefix of the saved state, thus regenerating logits, which I believe could address the "odd first token" issue
  • regarding "only a prompt cache", I've also thought about being more surgical about what is saved and potentially diverging from/decomposing state APIs. For example, my understanding is logits are purely an output of evaluation, so can be omitted and regenerated as long as one evaluation occurs after restoration
  • for RNG restoration, I had a similar thought to @KerfuffleV2 where I think you'd need to checkpoint the RNG state periodically (or somehow replay the RNG?) to approach something like perfect reproducibility
  • as for checkpointing intermediate state, I prototyped that in llama, main : save state incrementally #1310. As suggested, I believe with an approach like that, you can get more predicable/reproducible results by starting generation from a complete checkpoint
  • as for additional options for controlling session/cache files, I fully agree - I was reluctant to complicate the CLI, but I think in its current state it can be hard to reason about the exact state of the cache before and after

@DannyDaemonic
Copy link
Contributor

@KerfuffleV2

If anyone with authority (which might be you?) tells me it should be a certain way then I'm happy to accept it and do it that way.

I don't know if there's really any kind of hierarchy around here, other than ggerganov has the ultimate say since he's the mastermind behind it all. His philosophy seems to be: have fun, don't be afraid of breaking changes, and iterate. I think he generally relies on everyone to discuss changes and come up with reasonable solutions (while avoiding drama).

I wouldn't be too worried about changing the behavior yet. All of this is very new, and as ejones said, "nothing about prompt (or event state restoration) is particularly sacred at this point."

I think it's also the simpler solution; just overwrite it every time and to treat the --prompt-cache as strictly prompt cache. This means if ejones does come up with a more efficient ("surgical") state save format then he doesn't have to hold onto the mersenne_twister_engine data from the RNG.

That said, if you are depending on this feature as is, then let's go with option B and leave the seed as unchanged by default.

The only thing I'd change is using something a user would be less likely to select than -2.

At the start of this I almost suggested using MIN_INT myself, but then I was thinking -2 would be an easy number to remember. Either is fine really, but if we go this route I'll also probably have to update the documents under the --prompt-cache write-up in the README with the number as a "Here be dragons" type warning. (My thinking was, if it's a hard number to remember then the warning doesn't do as much good.)

@ggerganov
Copy link
Owner

His philosophy seems to be: have fun, don't be afraid of breaking changes, and iterate. I think he generally relies on everyone to discuss changes and come up with reasonable solutions

Yes, that's well said

I am paying less attention to the changes in the examples and trust that you will figure out what's right.
As long as the examples demonstrate the basic functionality of the model, everything else is a bonus.
Breaking stuff in the examples is fine, since that part of the code is not directly used into 3rd party projects and so we can afford to experiment

@x4080
Copy link

x4080 commented May 24, 2023

waiting for this

@KerfuffleV2
Copy link
Collaborator Author

Thanks everyone that replied, appreciate the information!

@DannyDaemonic

I think it's also the simpler solution; just overwrite it every time

I'm fully convinced! I made that change and also added a little note to the README mentioning that restoring a cached prompt doesn't mean you get the same sequence of tokens as the original generation, even when specifying the same seed.

That said, if you are depending on this feature as is

Nope, I only cared about being able to try multiple seeds from a cached prompt without having to repeatedly specify it.

have fun, don't be afraid of breaking changes

As a pessimistic, risk-averse type of person I fear that would require brain surgery...

1. Currently the --seed parameter is ignored when loading the prompt. However, a very common use case would be to save a prompt and then try several attempts at generation with different seeds.
2. When loading a cached prompt from a session, you have to specify the prompt again. Even worse, if you forget to enter a prompt you'll get your cached prompt overwritten by the blank one.
Display some helpful information to the user when loading a session to make it clear when the seed applies or not.
Add a note in the main example README about how restoring a prompt doesn't imply restoring the exact session state.
@KerfuffleV2 KerfuffleV2 force-pushed the feat-session_loading_improvements branch from 3e32104 to 156d70b Compare May 25, 2023 06:40
@KerfuffleV2 KerfuffleV2 merged commit 66874d4 into ggerganov:master May 26, 2023
@KerfuffleV2 KerfuffleV2 deleted the feat-session_loading_improvements branch May 28, 2023 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[enhancement] reseed random number when loading from cache and --seed provided
6 participants