-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : require first token to be BOS #1303
Conversation
Hi! I'm one of the creators of OpenLLaMA. We have realized that our model is indeed sensitive to BOS token and many existing implementations do not prepend the BOS token at generation time. To make our model more compatible with existing implementations, we have released our new 300b token checkpoint that is less sensitive to BOS tokens. |
examples/perplexity/perplexity.cpp
Outdated
const int batch_size = std::min(end - batch_start, params.n_batch); | ||
|
||
// TODO: not perfect since this can be in the middle of a word, but it is better than nothing | ||
tokens[batch_start] = llama_token_bos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will break for small batch sizes - as we'll never predict the bos token? And down below, in the perplexity calculation, we use this same tokens vector as the target for prediction.
maybe we should pad out the batch by 1? So include 1 less token within each one. We'd then need to replace all uses of param.n_batch with bos_batch
or something like that which would just be bos_batch = params.n_batch - 1
.
@young-geng Thanks for the info @glinscott Good point - will fix this The new perplexity results are out:
I think the decision of whether to always have a My guess is that during training, all sequences do have a |
For OpenLLaMA, we always add BOS to the sequence during training. We believe that they also did this for the official LLaMA, as adding BOS improves a few percent on NLP benchmarks. I think always add BOS in the beginning is a better choice. |
I have done some small testing where to use the BOS and EOS tokens. Perplexity measurement tests: If all text is of the same topic, use Generation tests: The prompt The prompt This needs to be tested and verified on a large scale. Maybe using a dataset were different topics are separated from other topics using BOS EOS would be the best overall solution for doing perplexity testing properly. Something like |
did a run, WITHOUT this pr, on the new 300 checkpoint with q5_1 and the perplexity is WAY better: 11.3132 |
The tokenizer always insert one BOS at the start of the sequence. OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 512 / batchsize 512 = 1 batch per ctx) Tests using ctx 2048 and batchsize 256, that is 8 batches per ctx: OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx) LLaMA 7B q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx) This suggests that the BOS token should only be prepended to the first batch of each ctx chunk. Insert BOS token at start of each ctx chunkI modified the loading of tokens in perplexity.cpp (without this pr) to insert BOS tokens instead of overwriting the tokens:
Perplexity results on wiki.test.raw with the modified token loading: ctx 512 / batchsize 512: OpenLLaMA 300bt q5_1 : 11.3117 (w/o this pr 11.3132, w/ this pr 11.2568) LLaMA 7B q5_1: 5.9379 (w/o this pr 5.9934, w/ this pr 5.9481) LLaMA 13B q5_1 : 5.2467 (w/o this pr 5.2582, w/ this pr 5.2706) ctx 2048 / batchsize 256: OpenLLaMA 300bt q5_1 : 9.7981 (w/o this pr 9.8347,w/ this pr 10.7192) LLaMA 7B q5_1: 5.2927 (w/o this pr 5.309, w/ this pr 9.3651) This seems to be a way to do perplexity measurements properly. |
I think this should handle smaller batch sizes correctly now |
I tested this and the save/restore of the original token does not seem to make any difference in ppl. |
Looks good, thanks! And interesting that it doesn't have any impact, @klosax did you test with eg. batch size 8? that should have had a significant impact with the previous implementation. |
Yes indeed, the BOS token should only be added to the first batch of each ctx chunk, not to every batch. Saving/restoring the original token does not make any difference to the ppl, as the softmax does not use the first half of the ctx chunk tokens. My testing also indicate that the ppl goes down even further if the BOS token is properly inserted instead of simply overwriting the original tokens. |
This is likely necessary to make the generation more accurate.
We first noticed this with running the new OpenLLaMA models. The generation completely fails if the first token is not
BOS
.#1291 (comment)
Setting the first token in each chunk of the perplexity computation to be
BOS
drives down the ppl values slightly (~0.05 for 7B), which indicates that this is the right thing to do. Still, will be happy if somebody with better understanding chimes in and clarifies if we do need to enforce the first token to beBOS
.Another interesting observation is that the vanilla LLaMA models seem "resilient" to not having a
BOS
.This seems to not be the case for OpenLLaMA. What is the difference that is causing this?
After merging this (or before), will recompute all perplexity values for 7B and 13B LLaMA.
Another effect from this change is that generation after context swap should be better, since before this change, we were "losing" the
BOS
token whenn_keep == 0
(i.e. default value).Perplexity after the change
For reference - before the change