-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mirostat state when using multiple sequences #3543
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, dedicated sampling state with grammar and mirostat would be better. Maybe implemented in common/sampling.h/.cpp
. It should probably inherit all sampling-related parameters gpt_params
, such as temperature, top_p, top_k, etc, so that llama_sample_token
accepts struct llama_sampling_state
instead of struct gpt_params
.
For now we can have this workaround
Do you actually prefer doing it this way for now? I don't mind changing this to do it the other way I suggested as long as you agree that approach is okay.
This actually raises another question that I'm actually dealing with in my seqrep sampler. Right now it's really awkward to have multiple source files in common. This is how I dealt with it: COMMON_DEPS = common/common.cpp common/common.h build-info.h common/log.h
COMMON_OBJS = common.o
ifndef LLAMA_DISABLE_SEQREP_SAMPLER
COMMON_DEPS += common/seqrep-sampler.cpp common/seqrep-sampler.h
COMMON_OBJS += seqrep-sampler.o
endif
common.o: $(COMMON_DEPS)
$(CXX) $(CXXFLAGS) -c $< -o $@
simple: examples/simple/simple.cpp build-info.h ggml.o llama.o $(COMMON_OBJS) $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS) and so on. It's probably less of a pain in cmake. |
We should try separating the sampling state - would be better than the current fix, so let's give it a try if you are up to. The proposed Makefile looks OK to me. |
When this commit is merged into the master branch, we need to add the parameter |
Well, this went from small and self-contained to huge and complicated. I sure hope I'm on the right track after all this. Pretty much all the sampling stuff got moved into The sampling state ( If the per-sequence state doesn't exist when If you want to use separate grammar, or separate grammar states per sequence then you'd probably have to manually manage the grammar part yourself. (Not sure this is so easy right now, so I might need to add an interface.) In the case of parallel generation, when a sequence ends and you will want to reuse it (or just free up memory) you need to call I also randomly threw in support for I think this currently doesn't break stuff, but there were some tricky parts like |
b598b76
to
e46edaf
Compare
@FSSRepo You're going to hate me when you see the next step in this pull. Calling sampling is going to look like: llama_sample_token(ctx, NULL, sampling_state, slot.tokens_prev, candidates, slot.i_batch, slot.id); It looks like #3490 doesn't support grammar currently? So that's going to make your life easier. Pretty much the only other thing to worry about is calling |
e46edaf
to
52def09
Compare
Code formatting cleanups and add some comments Silence a warning about id not being used when logging is disabled
I exported the function to fetch/create default instances of sampler state. This should fix the problem I mentioned earlier about how it would be hard to do something like parallel generation where each sequence used its own grammar. By the way, since the
Not sure if that's anything to worry about. The |
This isn't really approved, right? Even without a full review, any changes I can/should start working on? |
const std::vector<llama_token> & last_tokens, | ||
std::vector<llama_token_data> & candidates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we absorb last_tokens
and candidates
into llama_sampling_state
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last tokens can be specific to the sequence, right? So this would kind of mean stuff using last tokens would have to be aware of sequences. I kind of feel like this also might limit how people can manipulate last tokens and if there are currently examples that do that kind of thing it might be difficult to adapt them (for me anyway, since I'm not really deeply familiar with most of them).
candidates
I'm less sure about, it's basically just a scratch area for the logits in a form samplers can work with (right?) so I think moving it in there is less of a big deal. It's a pretty large structure though, I don't know if that's a consideration. Right now the current stuff in those structs is pretty lightweight.
I don't have very strong feelings about this. I'd like to say "These changes are already complicated enough, let's come back to that" but... I probably never would. :)
common/sampling.cpp
Outdated
if (seq_state.grammar != NULL) { | ||
llama_grammar_accept_token(ctx, seq_state.grammar, id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should add llama_sampling_accept_token()
and move this call in there together with update of last_tokens
member (if we decide it should become part of llama_sampling_state
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit of a pain since it depends on a number of other static functions like decode_utf8
, llama_grammar_accept
, llama_token_to_str
. It looks like the grammar stuff is the only thing that uses them, so maybe they could be moved too. I'm not sure what other parts of the grammar code depends on them though, so it might not be that simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change. I actually think that we should merge llama_sampling
directly into llama.cpp
. But let's do this after this PR is merged and tested for some time
Not sure if that's anything to worry about.
These errors look benign, but we will look in ways to fix them anyway
Fix comments that were out of sync with the pull.
Current status: I took the suggestions to rename the functions/types but I didn't do stuff like moving |
common/sampling.cpp
Outdated
llama_token llama_sampling_sample( | ||
struct llama_context * ctx, | ||
struct llama_context * ctx_guidance, | ||
struct llama_sampling_context & sampling_ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct llama_sampling_context & sampling_ctx, | |
struct llama_sampling_context & ctx_sampling, |
There are a few other places that need similar change for consistency sake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "a few" you mean 20 or so? :) Hopefully I caught them all. Everything seems to compile/work still.
Thanks for this - I accidentally merged this too quickly with the old title. Should have updated to the more relevant change of introducing the |
|
||
// map of sequence ids to sampler contexts | ||
std::unordered_map<llama_seq_id, llama_sampler_sequence_context> sequence_contexts; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to have single sequence data in llama_sampling_context
?
When we want to sample multiple sequences, we will create on llama_sampling_context
for each.
This way, each sequence can also have a separate llama_grammar
instance which seems to make sense.
I.e. have it like this:
// general sampler context
typedef struct llama_sampling_context {
~llama_sampling_context();
// parameters that will be used for sampling
llama_sampling_params params;
float mirostat_mu; // mirostat sampler state
llama_grammar * grammar;
} llama_sampling_context;
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...
As mentioned in #3537, mirostat currently isn't compatible with using multiple sequences.
The main selling point of the way this pull is implemented is such a way that it's pretty simple and uninvasive.
However, I really don't like storing mutable sampler state in
gpt_params
(even though it's only in common and not the mainllama.cpp
API). This also required removing theconst
from theparams
argument tollama_sample_token
. As far as I can see, the existing examples don't care about that.I feel like the right way to do this is probably to move the sampler state out of
gpt_params
and have it passed separately. In that case, this is probably also where grammar should be since it's a type of sampler state. So we wouldn't add a new argument tollama_sample_token
, we'd replace the current grammar one with sampler state. This of course would require changing a lot more stuff, including all the examples that usellama_sample_token
(I don't think it would be too bad though).Thoughts?
Closes #3537