main : add option to save full output to session #1338

ejones · 2023-05-06T03:28:18Z

EDITED after updates

This is a much scaled-back change in place of #1310. Renames --session to --prompt-cache and adds a new option, --prompt-cache-all, that causes user input and generations to be saved to the session/cache as well. This new option allows for fast continuation of generations (with additional input).

Testing

--prompt-cache just saves the initial prompt:

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin -n 5 -p 'The meaning of life is 4'     
...

 The meaning of life is 42
Posted on
llama_print_timings:        load time =  1785.67 ms
llama_print_timings:      sample time =     3.42 ms /     5 runs   (    0.68 ms per run)
llama_print_timings: prompt eval time =  1770.00 ms /     8 tokens (  221.25 ms per token)
llama_print_timings:        eval time =   702.82 ms /     4 runs   (  175.70 ms per run)
llama_print_timings:       total time =  2574.34 ms
% du -hs cache/meaning-life.30.bin                                                                                       
 12M	cache/meaning-life.30.bin
% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin -n 5 -p 'The meaning of life is 4'
...
 The meaning of life is 42
Posted on
llama_print_timings:        load time =   741.91 ms
llama_print_timings:      sample time =     3.38 ms /     5 runs   (    0.68 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  1037.25 ms /     4 runs   (  259.31 ms per run)
llama_print_timings:       total time =  1270.93 ms

--prompt-cache-all saves prompt + generations, allowing ~constant generation time for continuing generation on successive calls:

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 4'
...
 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the
main: saving final output to session file 'sessions/meaning-life.30.bin'

llama_print_timings:        load time =  1323.96 ms
llama_print_timings:      sample time =    10.38 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =  1303.54 ms /     8 tokens (  162.94 ms per token)
llama_print_timings:        eval time =  2447.48 ms /    14 runs   (  174.82 ms per run)
llama_print_timings:       total time =  3959.82 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams 
quote> The Hitchhiker'\''s Guide to the'
...

 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
...
llama_print_timings:        load time =   692.69 ms
llama_print_timings:      sample time =    10.34 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2969.64 ms /    15 runs   (  197.98 ms per run)
llama_print_timings:       total time =  3349.01 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams
The Hitchhiker'\''s Guide to the Galaxy, as it appears in the computer game adaptation of the series.'
...

 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
It’s been a few years since I last read Douglas Adams’
...
llama_print_timings:        load time =   686.97 ms
llama_print_timings:      sample time =    10.31 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2979.24 ms /    15 runs   (  198.62 ms per run)
llama_print_timings:       total time =  3382.80 ms

% ./main -m ~/llama-models/30B/ggml-model-q4_0.bin --prompt-cache cache/meaning-life.30.bin --prompt-cache-all --seed 1 -n 15 -p 'The meaning of life is 42: Douglas Adams
The Hitchhiker'\''s Guide to the Galaxy, as it appears in the computer game adaptation of the series.
quote> It’s been a few years since I last read Douglas Adams’'
...
 The meaning of life is 42: Douglas Adams
The Hitchhiker's Guide to the Galaxy, as it appears in the computer game adaptation of the series.
It’s been a few years since I last read Douglas Adams’ “Hitchhikers” novels. They are really funny
...
llama_print_timings:        load time =   693.08 ms
llama_print_timings:      sample time =    10.30 ms /    15 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  2971.47 ms /    15 runs   (  198.10 ms per run)
llama_print_timings:       total time =  3405.97 ms

also tested non-session usage and chat-13B.sh with prompt cache

DannyDaemonic · 2023-05-06T08:36:34Z

I like what you're doing here. I think there are two different needs being met with session files. I wonder if, for clarity, we don't need two separate options: a prompt cache (which does what --session already does) and a session option (that's been your goal I believe?) that will save and load your state and allow you to continue your generation where it left off. I believe they could have the exact same file structure.

Here's what I would propose

And these are just my thoughts. I'm not saying it has to be this way by any means. Let me know what you think.

Prompt Cache: `--prompt-cache` perhaps?

This will do what's currently done by session. It will use the session data that's been restored to reuse as much of the context history as it can to speed up the evaluation of the prompt.

Saved Sessions: `--session` as you originally intended

This implements full session saving and resumes the states.

Implementation

--prompt-cache and --session both set the path-session, but the second also sets resume = true.
If the session file doesn't exist, we save the session after evaluating the initial prompt, as is done currently done.
If resume is true, it will occasionally save the session post prompt evaluation. This can also be when there's a context reset which I still think might not be necessary as long as you save on exit. In this case, since we use Ctrl-C to exit, you'll also want to save the session in the signal handler before the _exit call.
If a saved session is found on start:
- If resume is false: Everything is handled as it currently is. We try to speed up the prompt by reusing as much of the evaluation as possible.
- If resume is true: It will ignore the prompt and simply print out all the context that's been restored by using the tokens in the context to print the original text back to the terminal. Thus allowing one to resume generation from the point they left off.

DannyDaemonic · 2023-05-06T08:49:53Z

I just tested this out, and it looks like uninitialized tokens are 0, so we can resume by doing something like this:

            int n_resume_tokens = session_tokens.size();
            for (int i = 0; i < (int) session_tokens.size(); i++) {
                if (session_tokens[i] == 0) {
                    n_resume_tokens = i;
                    break;
                }
                printf("%s", llama_token_to_str(ctx, session_tokens[i]));
            }

ivanstepanovftw · 2023-05-06T12:44:10Z

I like idea to rename --session to --prompt-cache

ejones · 2023-05-06T13:47:39Z

Yeah sounds good, I can make those adjustments. As for --prompt --session in conjunction, I'm thinking to append the prompt to the session.

DannyDaemonic · 2023-05-06T14:01:44Z

Hmm, I guess I'm fine with it either tossing the prompt or appending it.

I don't see the use case for a prompt when you've already got context ready to resume, but it is weird to sort of just ignore the prompt someone passed in. The only downside to appending it would mean be people can't just hit Up and Enter to resume, or if there's a script that uses --session it will always append the same initial prompt to that resumed conversation which would effectively break the script. Although, such a script could always just check for the existence of the session file and then start main with no prompt.

So, yeah. I can see both sides to it. This is the internet, so it might be something you set up one way and find it outrages people - in which case, it'd be easy enough to change.

ejones · 2023-05-06T14:35:45Z

Yeah, it's so that we can resume with additional input, e.g. feedback or user messages. The current PR does this as it preserves the prefix matching on prompt.

Where this is going is I think we can accomplish interactivity, instruct etc now with repeat invocations of main rather than in process. Then we can retire all that code and refocus main on just single turn generation. I have a POC of chat working this way locally.

DannyDaemonic · 2023-05-06T15:07:29Z

I can see that use case. I agree appending makes the most sense. I just think we should be consistent about it. If there's a session file from a previous --session, it seems to make sense to output all that context and then append the new prompt/input resuming from the end of the session.

An end user could keep a running text of the conversation and feed that in as the prompt each time, but it will "randomly" stop working. The reason being that once you hit your context length, the reset will drop half of your context. Two things will happen, your prompt will be greater than the context so it will be refused, and if you wanted to use just the second half of your prompt so far, it'd be tricky to know what part of the prompt to trim. (It's not exactly half.) It would be a lot more practical if it just printed out the tokens from the context and used the prompt as additional input after.

I'm not sure if I'm explaining it clearly. If you have a proof of concept, you could try setting your context to something low, like -n 32, to see what I mean. If you are just feeding the full history back in each time, you could never go past the length of the context.

Edit: Typo: should be -c 32

ejones · 2023-05-06T15:10:31Z

Yeah, I'm envisioning that context management can be externalized along with interactivity. This matches how the hosted LLMs work anyway. This can be provided in scripts, just doesn't need to be hard-coded in main.

DannyDaemonic · 2023-05-06T15:21:12Z

Interesting. I would be hesitant to force the externalization of the context window. It would seem to me, a script tracking the output would have a lot easier of a time just appending the new output to it's log of the conversation than tracking the actual context so that it can properly resume.

DannyDaemonic · 2023-05-06T15:41:07Z

Not to dissuade you from this approach, but I think what #1345 is inquiring about is an so/dll version of main. It sounds like they may have experience in that area. That might be a better approach if you're looking for more control over the context outside of main.

ejones · 2023-05-07T02:57:43Z

@DannyDaemonic I implemented the renames you suggested, but I decided to keep the original semantics of passing in the full prompt. In testing the appending behavior, I felt it was too hard to reason about the state of the session file, as you alluded to. And I do believe the ability to pass in new inputs (however it is done) is what makes this worthwhile; if you're just resuming without inputs, I feel like you could've just passed a larger n_predict in the first place?

(As a side benefit, I think this reduces the surprise of the new behavior of --session to folks already trying it out, as it should roughly still function if you (ab)use it as a prompt cache going forward).

Regarding #1345, I believe that actually hits the nail on the head for what I'm saying about retiring interactivity in main. This new --session behavior gives you fast, non-interactive generation so that a main-style process could suffice for cases like that (if otherwise convenient).

DannyDaemonic · 2023-05-07T03:02:25Z

Oh, sorry. I misspoke earlier. I mean try it with a low context, like -c 32. These models run with 512 by default but it effectively breaks resumes when your context is full and you can't just keep pushing the context up to get more context as things become wildly unpredictable once you pass the training boundaries for the LLaMA models.

ejones · 2023-05-07T03:33:12Z

I mean, I understood what you meant. Is your point that this only works until you fill up the context size? If so I think that doesn't diminish its value?

DannyDaemonic · 2023-05-07T03:37:27Z

Of the two approaches, passing in the full history each time vs resuming the session purely from the state, one keeps working. Passing in the full history only works until you hit your first context reset, then you can't resume anymore.

My thought is that it kind of defeats the point of being able to save and resume a session if sometimes when you save, you can't resume.

ejones · 2023-05-07T03:56:53Z

I see. Yes, with a growing prompt, the caller (of main) is responsible for not exceeding the context. I think that's fine? Again, I think this is something that is generally understood when building on LLM completions directly. And as far as examples using session, e.g., chat, we can provide scripts that do the context swap just as much as main hard codes it.

DannyDaemonic · 2023-05-07T04:41:25Z

The nice thing about this project is while it lets you control a lot of low level options, you don't have to understand anything about Transformers, context, or even inference to use it. A session could do that as well. I originally thought that your motive was to bring a session function into main. I think perhaps it was just the original name of the option that led me to think this.

I apologize for the confusion earlier. I actually feel a bit guilty being that your original --session-full pull request is much more correct in this regard. Again, I think it was just the name "session" that confused me. If you want to roll back your changes and rename it from --session and --session full to --prompt-cache and --full-prompt-cache, this approach is actually fine.

As a side note, I don't know if the project is necessarily looking to retire the interactivity of main. A lot of people like it for the simplicity even if it is through a terminal. Your end goal of retiring all interactivity from main is actually much more in line with the idea of just having an so/dll made from main. But these two approaches don't have to be at odds. I think main can offer full prompt caching as you desire for your uses.

Edit: Somehow I hit send while editing my response. I've fixed it up.

DannyDaemonic · 2023-05-07T04:46:15Z

I hit send early on accident somehow. I've cleaned up the comment on Github. Just noting here in case you're replying via email instead of on Github.

ejones · 2023-05-07T04:57:39Z

Thanks, yeah, it occurred to me that the nomenclature was part of the problem. I think initially I envisioned just a straight-up restoration of state but over the past few days I've shifted to the "full prompt cache" behavior as being more valuable. I'll make those renames.

As for main, my vision is more that examples/ would provide interactivity, instruct, context swapping, etc., just not hard coded in main.cpp. Like, simplifying main and separating out the concerns of generating text vs chatting and swapping context.

ejones · 2023-05-08T02:49:59Z

@DannyDaemonic updated.

DannyDaemonic

The code looks fine, the only real issue I see with this PR is you only write the session file on one exit path. Nothing else noted is really a problem.

There are two points where the program exits. At the end of main, and in sigint_handler. If you're in interactive mode, the only way to exit is with Ctrl-C. So as is, for most people, they will end up quitting without their cache being written.

Now that I think about it though, it's probably not safe to call llama_save_session_file in sigint_handler because you could be in the middle of an evaluation.

That makes it tricky to just throw in there. Off the top of my head, we'd have to add another variable like is_interacting but named is_evaluating (which should both be volatile sig_atomic_t btw) that we set to true and false around our llama evals and and if someone tries to Ctrl-C while is_evaluating it would warn them that it may take a second to save or that they can hit Ctrl-C again to exit without saving and we'd set params.interactive to false, n_remain to 0, and params.n_predict to 0, which would trigger this exit path.
Edit: You'll still get stuck on input.

If you don't have experience with interrupts, threading, or race conditions, I'd skip that approach.

The next best solution would be write the session out during interactive mode every time before we prompt the user for input. Around line 530, which unfortunately, could be quite often. Optionally, to make this less painful, we could watch the token count (n_past) and only save when interactive mode is about to block (530) at n_last_save + 128 (or some number that makes sense).

Once again you're stung by the whole interactivity part of main. I guess you could just state that it doesn't work in interactive or instructional mode in gpt_print_usage and in README.md, but that makes it harder to accept this pull request as is.

examples/common.h

examples/main/main.cpp

ggerganov · 2023-05-08T16:57:29Z

I guess you could just state that it doesn't work in interactive or instructional mode in gpt_print_usage and in README.md, but that makes it harder to accept this pull request as is.

To me it seems this functionality is mostly needed for non-interactive mode anyway, so if it is difficult to come up with a proper solution for interactive mode, then merge the PR as it is and we can figure it out later.

ejones · 2023-05-09T03:16:19Z

@DannyDaemonic addressed comments.

I agree with punting on interactive mode for now. I was indeed reluctant to grapple with the complexities of signal handling on that. Added usage text and an error to that effect.

DannyDaemonic

Looks good. Just need to move the check for incompatible parameters to gpt_params_parse and it's ready to merge.

examples/main/main.cpp

ejones · 2023-05-10T02:31:21Z

@DannyDaemonic updated

examples/main/main.cpp

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

ejones · 2023-05-10T15:37:38Z

Thanks @DannyDaemonic !

nova706 · 2023-08-14T18:15:59Z

@ejones, Is there any possibility this could be used in server.exe as well? It could be very useful to start a server session by loading from a file, then allow saving the session back to the prompt-cache file at any given point in time via an endpoint. This would give the user the flexibility to start a session, chat, reset, chat more, save and pick up later. I have started to take a look at this but am still getting familiar with the code differences between main and server.

Also, any advice with the following concerning prompt caching:

(TL;DR: Caching the static LLM instructions so they don't have to be computed every time a new session starts.)

I feel like I am using the prompt-cache a little different in that I prefer to use it as a way to cache the initial instructions for the LLM before sending it a prompt. I start with a large prompt with instructions (Assistant characteristics, desired output formats, etc.). Because these are static between sessions, I don't want to load this every time. So I start with instructions and --prompt-cache with --n-predict 1 (unsure if this is correct, but I can't get it to save the file unless it tries to generate something). The program starts, loads instructions, interprets the next char, saves and closes. I then open another session with the same instructions, --prompt-cache, --prompt-cache-ro and --interactive-first. This starts, loads the model, loads the session (instructions) and then waits for the prompt from the user. All subsequent sessions, I only make the second call (I only make the first call again if the instructions change). Overall, this is a huge decrease in startup time by caching the bulk of the static prompt information and reusing it for each session.

Is there a better way to do this?

ejones mentioned this pull request May 6, 2023

llama, main : save state incrementally #1310

Closed

ejones requested a review from DannyDaemonic May 6, 2023 04:06

ejones force-pushed the save-all branch from cbe462e to 8fed512 Compare May 7, 2023 02:39

ejones force-pushed the save-all branch from 8fed512 to 87d6f8c Compare May 8, 2023 02:47

DannyDaemonic suggested changes May 8, 2023

View reviewed changes

examples/common.h Outdated Show resolved Hide resolved

examples/main/main.cpp Outdated Show resolved Hide resolved

examples/main/main.cpp Outdated Show resolved Hide resolved

examples/main/main.cpp Outdated Show resolved Hide resolved

ggerganov reviewed May 8, 2023

View reviewed changes

examples/main/main.cpp Outdated Show resolved Hide resolved

ejones force-pushed the save-all branch from 87d6f8c to a7298bd Compare May 9, 2023 03:11

DannyDaemonic suggested changes May 9, 2023

View reviewed changes

examples/main/main.cpp Outdated Show resolved Hide resolved

main : add option to save full output to session

4c76d52

ejones added 4 commits May 9, 2023 22:14

split behavior into --session and --prompt-cache

56758f0

restore original implementation with new names

e4429e9

PR comments

8c88b17

move the check for incompatible parameters to gpt_params_parse

8826fb8

ejones force-pushed the save-all branch from a7298bd to 8826fb8 Compare May 10, 2023 02:29

DannyDaemonic reviewed May 10, 2023

View reviewed changes

examples/main/main.cpp Outdated Show resolved Hide resolved

Fix whitespace

ac5584b

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

DannyDaemonic approved these changes May 10, 2023

View reviewed changes

ejones merged commit cf348a6 into ggerganov:master May 10, 2023

ejones mentioned this pull request May 12, 2023

[Feature Request] --prompt-cache-all + user input #1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main : add option to save full output to session #1338

main : add option to save full output to session #1338

ejones commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

ivanstepanovftw commented May 6, 2023

ejones commented May 6, 2023

DannyDaemonic commented May 6, 2023 •

edited

Loading

ejones commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

ejones commented May 6, 2023

DannyDaemonic commented May 6, 2023

DannyDaemonic commented May 6, 2023

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023 •

edited

Loading

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023 •

edited

Loading

DannyDaemonic commented May 7, 2023

ejones commented May 7, 2023

ejones commented May 8, 2023

DannyDaemonic left a comment

ggerganov commented May 8, 2023

ejones commented May 9, 2023

DannyDaemonic left a comment

ejones commented May 10, 2023

ejones commented May 10, 2023

nova706 commented Aug 14, 2023

main : add option to save full output to session #1338

main : add option to save full output to session #1338

Conversation

ejones commented May 6, 2023 • edited Loading

DannyDaemonic commented May 6, 2023 • edited Loading

Here's what I would propose

Prompt Cache: --prompt-cache perhaps?

Saved Sessions: --session as you originally intended

Implementation

DannyDaemonic commented May 6, 2023 • edited Loading

ivanstepanovftw commented May 6, 2023

ejones commented May 6, 2023

DannyDaemonic commented May 6, 2023 • edited Loading

ejones commented May 6, 2023 • edited Loading

DannyDaemonic commented May 6, 2023 • edited Loading

ejones commented May 6, 2023

DannyDaemonic commented May 6, 2023

DannyDaemonic commented May 6, 2023

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023 • edited Loading

ejones commented May 7, 2023

DannyDaemonic commented May 7, 2023 • edited Loading

DannyDaemonic commented May 7, 2023

ejones commented May 7, 2023

ejones commented May 8, 2023

DannyDaemonic left a comment

Choose a reason for hiding this comment

ggerganov commented May 8, 2023

ejones commented May 9, 2023

DannyDaemonic left a comment

Choose a reason for hiding this comment

ejones commented May 10, 2023

ejones commented May 10, 2023

nova706 commented Aug 14, 2023

ejones commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

Prompt Cache: `--prompt-cache` perhaps?

Saved Sessions: `--session` as you originally intended

DannyDaemonic commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

ejones commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 6, 2023 •

edited

Loading

DannyDaemonic commented May 7, 2023 •

edited

Loading

DannyDaemonic commented May 7, 2023 •

edited

Loading