Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main : add option to save full output to session #1338

Merged
merged 6 commits into from
May 10, 2023

Conversation

ejones
Copy link
Collaborator

@ejones ejones commented May 6, 2023

EDITED after updates

This is a much scaled-back change in place of #1310. Renames --session to --prompt-cache and adds a new option, --prompt-cache-all, that causes user input and generations to be saved to the session/cache as well. This new option allows for fast continuation of generations (with additional input).

Testing

  • --prompt-cache just saves the initial prompt:
% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin -n 5 -p 'The meaning of life is 4'     
...

 The meaning of life is 42
Posted on
llama_print_timings:        load time =  1785.67 ms
llama_print_timings:      sample time =     3.42 ms /     5 runs   (    0.68 ms per run)
llama_print_timings: prompt eval time =  1770.00 ms /     8 tokens (  221.25 ms per token)
llama_print_timings:        eval time =   702.82 ms /     4 runs   (  175.70 ms per run)
llama_print_timings:       total time =  2574.34 ms
% du -hs cache/meaning-life.30.bin                                                                                       
 12M	cache/meaning-life.30.bin
% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin -n 5 -p 'The meaning of life is 4'
...
 The meaning of life is 42
Posted on
llama_print_timings:        load time =   741.91 ms
llama_print_timings:      sample time =     3.38 ms /     5 runs   (    0.68 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  1037.25 ms /     4 runs   (  259.31 ms per run)
llama_print_timings:       total time =  1270.93 ms
  • --prompt-cache-all saves prompt + generations, allowing ~constant generation time for continuing generation on successive calls:
% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 4'
...
 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the
main: saving final output to session file 'sessions/meaning-life.30.bin'

llama_print_timings:        load time =  1323.96 ms
llama_print_timings:      sample time =    10.38 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =  1303.54 ms /     8 tokens (  162.94 ms per token)
llama_print_timings:        eval time =  2447.48 ms /    14 runs   (  174.82 ms per run)
llama_print_timings:       total time =  3959.82 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams 
quote> The Hitchhiker'\''s Guide to the'
...

 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
...
llama_print_timings:        load time =   692.69 ms
llama_print_timings:      sample time =    10.34 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2969.64 ms /    15 runs   (  197.98 ms per run)
llama_print_timings:       total time =  3349.01 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams
The Hitchhiker'\''s Guide to the Galaxy, as it appears in the computer game adaptation of the series.'
...

 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
It’s been a few years since I last read Douglas Adams’
...
llama_print_timings:        load time =   686.97 ms
llama_print_timings:      sample time =    10.31 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2979.24 ms /    15 runs   (  198.62 ms per run)
llama_print_timings:       total time =  3382.80 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams
The Hitchhiker'\''s Guide to the Galaxy, as it appears in the computer game adaptation of the series.
quote> It’s been a few years since I last read Douglas Adams’'
...
 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
It’s been a few years since I last read Douglas Adams’ “Hitchhikers” novels. They are really funny
...
llama_print_timings:        load time =   693.08 ms
llama_print_timings:      sample time =    10.30 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2971.47 ms /    15 runs   (  198.10 ms per run)
llama_print_timings:       total time =  3405.97 ms
  • also tested non-session usage and chat-13B.sh with prompt cache

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 6, 2023

I like what you're doing here. I think there are two different needs being met with session files. I wonder if, for clarity, we don't need two separate options: a prompt cache (which does what --session already does) and a session option (that's been your goal I believe?) that will save and load your state and allow you to continue your generation where it left off. I believe they could have the exact same file structure.

Here's what I would propose

And these are just my thoughts. I'm not saying it has to be this way by any means. Let me know what you think.

Prompt Cache: --prompt-cache perhaps?

This will do what's currently done by session. It will use the session data that's been restored to reuse as much of the context history as it can to speed up the evaluation of the prompt.

Saved Sessions: --session as you originally intended

This implements full session saving and resumes the states.

Implementation
  1. --prompt-cache and --session both set the path-session, but the second also sets resume = true.

  2. If the session file doesn't exist, we save the session after evaluating the initial prompt, as is done currently done.

  3. If resume is true, it will occasionally save the session post prompt evaluation. This can also be when there's a context reset which I still think might not be necessary as long as you save on exit. In this case, since we use Ctrl-C to exit, you'll also want to save the session in the signal handler before the _exit call.

  4. If a saved session is found on start:

    • If resume is false: Everything is handled as it currently is. We try to speed up the prompt by reusing as much of the evaluation as possible.
    • If resume is true: It will ignore the prompt and simply print out all the context that's been restored by using the tokens in the context to print the original text back to the terminal. Thus allowing one to resume generation from the point they left off.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 6, 2023

I just tested this out, and it looks like uninitialized tokens are 0, so we can resume by doing something like this:

            int n_resume_tokens = session_tokens.size();
            for (int i = 0; i < (int) session_tokens.size(); i++) {
                if (session_tokens[i] == 0) {
                    n_resume_tokens = i;
                    break;
                }
                printf("%s", llama_token_to_str(ctx, session_tokens[i]));
            }

@ivanstepanovftw
Copy link
Collaborator

I like idea to rename --session to --prompt-cache

@ejones
Copy link
Collaborator Author

ejones commented May 6, 2023

Yeah sounds good, I can make those adjustments. As for --prompt --session in conjunction, I'm thinking to append the prompt to the session.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 6, 2023

Hmm, I guess I'm fine with it either tossing the prompt or appending it.

I don't see the use case for a prompt when you've already got context ready to resume, but it is weird to sort of just ignore the prompt someone passed in. The only downside to appending it would mean be people can't just hit Up and Enter to resume, or if there's a script that uses --session it will always append the same initial prompt to that resumed conversation which would effectively break the script. Although, such a script could always just check for the existence of the session file and then start main with no prompt.

So, yeah. I can see both sides to it. This is the internet, so it might be something you set up one way and find it outrages people - in which case, it'd be easy enough to change.

@ejones
Copy link
Collaborator Author

ejones commented May 6, 2023

Yeah, it's so that we can resume with additional input, e.g. feedback or user messages. The current PR does this as it preserves the prefix matching on prompt.

Where this is going is I think we can accomplish interactivity, instruct etc now with repeat invocations of main rather than in process. Then we can retire all that code and refocus main on just single turn generation. I have a POC of chat working this way locally.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 6, 2023

I can see that use case. I agree appending makes the most sense. I just think we should be consistent about it. If there's a session file from a previous --session, it seems to make sense to output all that context and then append the new prompt/input resuming from the end of the session.

An end user could keep a running text of the conversation and feed that in as the prompt each time, but it will "randomly" stop working. The reason being that once you hit your context length, the reset will drop half of your context. Two things will happen, your prompt will be greater than the context so it will be refused, and if you wanted to use just the second half of your prompt so far, it'd be tricky to know what part of the prompt to trim. (It's not exactly half.) It would be a lot more practical if it just printed out the tokens from the context and used the prompt as additional input after.

I'm not sure if I'm explaining it clearly. If you have a proof of concept, you could try setting your context to something low, like -n 32, to see what I mean. If you are just feeding the full history back in each time, you could never go past the length of the context.

Edit: Typo: should be -c 32

@ejones
Copy link
Collaborator Author

ejones commented May 6, 2023

Yeah, I'm envisioning that context management can be externalized along with interactivity. This matches how the hosted LLMs work anyway. This can be provided in scripts, just doesn't need to be hard-coded in main.

@DannyDaemonic
Copy link
Contributor

Interesting. I would be hesitant to force the externalization of the context window. It would seem to me, a script tracking the output would have a lot easier of a time just appending the new output to it's log of the conversation than tracking the actual context so that it can properly resume.

@DannyDaemonic
Copy link
Contributor

Not to dissuade you from this approach, but I think what #1345 is inquiring about is an so/dll version of main. It sounds like they may have experience in that area. That might be a better approach if you're looking for more control over the context outside of main.

@ejones
Copy link
Collaborator Author

ejones commented May 7, 2023

@DannyDaemonic I implemented the renames you suggested, but I decided to keep the original semantics of passing in the full prompt. In testing the appending behavior, I felt it was too hard to reason about the state of the session file, as you alluded to. And I do believe the ability to pass in new inputs (however it is done) is what makes this worthwhile; if you're just resuming without inputs, I feel like you could've just passed a larger n_predict in the first place?

(As a side benefit, I think this reduces the surprise of the new behavior of --session to folks already trying it out, as it should roughly still function if you (ab)use it as a prompt cache going forward).

Regarding #1345, I believe that actually hits the nail on the head for what I'm saying about retiring interactivity in main. This new --session behavior gives you fast, non-interactive generation so that a main-style process could suffice for cases like that (if otherwise convenient).

@DannyDaemonic
Copy link
Contributor

Oh, sorry. I misspoke earlier. I mean try it with a low context, like -c 32. These models run with 512 by default but it effectively breaks resumes when your context is full and you can't just keep pushing the context up to get more context as things become wildly unpredictable once you pass the training boundaries for the LLaMA models.

@ejones
Copy link
Collaborator Author

ejones commented May 7, 2023

I mean, I understood what you meant. Is your point that this only works until you fill up the context size? If so I think that doesn't diminish its value?

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 7, 2023

Of the two approaches, passing in the full history each time vs resuming the session purely from the state, one keeps working. Passing in the full history only works until you hit your first context reset, then you can't resume anymore.

My thought is that it kind of defeats the point of being able to save and resume a session if sometimes when you save, you can't resume.

@ejones
Copy link
Collaborator Author

ejones commented May 7, 2023

I see. Yes, with a growing prompt, the caller (of main) is responsible for not exceeding the context. I think that's fine? Again, I think this is something that is generally understood when building on LLM completions directly. And as far as examples using session, e.g., chat, we can provide scripts that do the context swap just as much as main hard codes it.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented May 7, 2023

The nice thing about this project is while it lets you control a lot of low level options, you don't have to understand anything about Transformers, context, or even inference to use it. A session could do that as well. I originally thought that your motive was to bring a session function into main. I think perhaps it was just the original name of the option that led me to think this.

I apologize for the confusion earlier. I actually feel a bit guilty being that your original --session-full pull request is much more correct in this regard. Again, I think it was just the name "session" that confused me. If you want to roll back your changes and rename it from --session and --session full to --prompt-cache and --full-prompt-cache, this approach is actually fine.

As a side note, I don't know if the project is necessarily looking to retire the interactivity of main. A lot of people like it for the simplicity even if it is through a terminal. Your end goal of retiring all interactivity from main is actually much more in line with the idea of just having an so/dll made from main. But these two approaches don't have to be at odds. I think main can offer full prompt caching as you desire for your uses.

Edit: Somehow I hit send while editing my response. I've fixed it up.

@DannyDaemonic
Copy link
Contributor

I hit send early on accident somehow. I've cleaned up the comment on Github. Just noting here in case you're replying via email instead of on Github.

@ejones
Copy link
Collaborator Author

ejones commented May 7, 2023

Thanks, yeah, it occurred to me that the nomenclature was part of the problem. I think initially I envisioned just a straight-up restoration of state but over the past few days I've shifted to the "full prompt cache" behavior as being more valuable. I'll make those renames.

As for main, my vision is more that examples/ would provide interactivity, instruct, context swapping, etc., just not hard coded in main.cpp. Like, simplifying main and separating out the concerns of generating text vs chatting and swapping context.

@ejones
Copy link
Collaborator Author

ejones commented May 8, 2023

@DannyDaemonic updated.

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, the only real issue I see with this PR is you only write the session file on one exit path. Nothing else noted is really a problem.

There are two points where the program exits. At the end of main, and in sigint_handler. If you're in interactive mode, the only way to exit is with Ctrl-C. So as is, for most people, they will end up quitting without their cache being written.

Now that I think about it though, it's probably not safe to call llama_save_session_file in sigint_handler because you could be in the middle of an evaluation.

That makes it tricky to just throw in there. Off the top of my head, we'd have to add another variable like is_interacting but named is_evaluating (which should both be volatile sig_atomic_t btw) that we set to true and false around our llama evals and and if someone tries to Ctrl-C while is_evaluating it would warn them that it may take a second to save or that they can hit Ctrl-C again to exit without saving and we'd set params.interactive to false, n_remain to 0, and params.n_predict to 0, which would trigger this exit path.
Edit: You'll still get stuck on input.

If you don't have experience with interrupts, threading, or race conditions, I'd skip that approach.

The next best solution would be write the session out during interactive mode every time before we prompt the user for input. Around line 530, which unfortunately, could be quite often. Optionally, to make this less painful, we could watch the token count (n_past) and only save when interactive mode is about to block (530) at n_last_save + 128 (or some number that makes sense).

Once again you're stung by the whole interactivity part of main. I guess you could just state that it doesn't work in interactive or instructional mode in gpt_print_usage and in README.md, but that makes it harder to accept this pull request as is.

examples/common.h Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

I guess you could just state that it doesn't work in interactive or instructional mode in gpt_print_usage and in README.md, but that makes it harder to accept this pull request as is.

To me it seems this functionality is mostly needed for non-interactive mode anyway, so if it is difficult to come up with a proper solution for interactive mode, then merge the PR as it is and we can figure it out later.

@ejones
Copy link
Collaborator Author

ejones commented May 9, 2023

@DannyDaemonic addressed comments.

I agree with punting on interactive mode for now. I was indeed reluctant to grapple with the complexities of signal handling on that. Added usage text and an error to that effect.

Copy link
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just need to move the check for incompatible parameters to gpt_params_parse and it's ready to merge.

examples/main/main.cpp Outdated Show resolved Hide resolved
@ejones
Copy link
Collaborator Author

ejones commented May 10, 2023

@DannyDaemonic updated

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
@ejones ejones merged commit cf348a6 into ggerganov:master May 10, 2023
@ejones
Copy link
Collaborator Author

ejones commented May 10, 2023

Thanks @DannyDaemonic !

@nova706
Copy link

nova706 commented Aug 14, 2023

@ejones, Is there any possibility this could be used in server.exe as well? It could be very useful to start a server session by loading from a file, then allow saving the session back to the prompt-cache file at any given point in time via an endpoint. This would give the user the flexibility to start a session, chat, reset, chat more, save and pick up later. I have started to take a look at this but am still getting familiar with the code differences between main and server.

Also, any advice with the following concerning prompt caching:

(TL;DR: Caching the static LLM instructions so they don't have to be computed every time a new session starts.)

I feel like I am using the prompt-cache a little different in that I prefer to use it as a way to cache the initial instructions for the LLM before sending it a prompt. I start with a large prompt with instructions (Assistant characteristics, desired output formats, etc.). Because these are static between sessions, I don't want to load this every time. So I start with instructions and --prompt-cache with --n-predict 1 (unsure if this is correct, but I can't get it to save the file unless it tries to generate something). The program starts, loads instructions, interprets the next char, saves and closes. I then open another session with the same instructions, --prompt-cache, --prompt-cache-ro and --interactive-first. This starts, loads the model, loads the session (instructions) and then waits for the prompt from the user. All subsequent sessions, I only make the second call (I only make the first call again if the instructions change). Overall, this is a huge decrease in startup time by caching the bulk of the static prompt information and reusing it for each session.

Is there a better way to do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants