Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : require first token to be BOS #1303

Merged
merged 5 commits into from
May 8, 2023
Merged

llama : require first token to be BOS #1303

merged 5 commits into from
May 8, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 3, 2023

This is likely necessary to make the generation more accurate.
We first noticed this with running the new OpenLLaMA models. The generation completely fails if the first token is not BOS.

#1291 (comment)

Setting the first token in each chunk of the perplexity computation to be BOS drives down the ppl values slightly (~0.05 for 7B), which indicates that this is the right thing to do. Still, will be happy if somebody with better understanding chimes in and clarifies if we do need to enforce the first token to be BOS.

Another interesting observation is that the vanilla LLaMA models seem "resilient" to not having a BOS.
This seems to not be the case for OpenLLaMA. What is the difference that is causing this?

After merging this (or before), will recompute all perplexity values for 7B and 13B LLaMA.

Another effect from this change is that generation after context swap should be better, since before this change, we were "losing" the BOS token when n_keep == 0 (i.e. default value).


Perplexity after the change

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9066 6.1620 6.0910 6.1466 5.9862 5.9481 5.9069
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2543 5.3863 5.3607 5.3513 5.2856 5.2706 5.2548
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

For reference - before the change

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

@ggerganov ggerganov requested a review from glinscott May 3, 2023 17:35
@ggerganov ggerganov added high priority Very important issue generation quality Quality of model output labels May 3, 2023
@young-geng
Copy link

Hi! I'm one of the creators of OpenLLaMA. We have realized that our model is indeed sensitive to BOS token and many existing implementations do not prepend the BOS token at generation time. To make our model more compatible with existing implementations, we have released our new 300b token checkpoint that is less sensitive to BOS tokens.

const int batch_size = std::min(end - batch_start, params.n_batch);

// TODO: not perfect since this can be in the middle of a word, but it is better than nothing
tokens[batch_start] = llama_token_bos();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break for small batch sizes - as we'll never predict the bos token? And down below, in the perplexity calculation, we use this same tokens vector as the target for prediction.

maybe we should pad out the batch by 1? So include 1 less token within each one. We'd then need to replace all uses of param.n_batch with bos_batch or something like that which would just be bos_batch = params.n_batch - 1.

@ggerganov
Copy link
Owner Author

@young-geng Thanks for the info

@glinscott Good point - will fix this

The new perplexity results are out:

  • for 7B the ppl goes down ~-0.04 across all quantization modes
  • for the 13B, the ppl increases slightly by ~0.01

I think the decision of whether to always have a BOS token at the start of the context depends on how the training was done. If all sequences during training had a BOS token, then this is what the model has "seen" and we have to always make sure it is there. If not, then we can keep the original implementation as it is on master.

My guess is that during training, all sequences do have a BOS token at the start - just need someone to confirm this

@young-geng
Copy link

young-geng commented May 4, 2023

For OpenLLaMA, we always add BOS to the sequence during training. We believe that they also did this for the official LLaMA, as adding BOS improves a few percent on NLP benchmarks. I think always add BOS in the beginning is a better choice.

@klosax
Copy link
Contributor

klosax commented May 4, 2023

I have done some small testing where to use the BOS and EOS tokens.
The tests seems to indicate that they are used to separate text of different topics, maybe even languages and writing styles:

Perplexity measurement tests:

If all text is of the same topic, use [ BOS <text> ]
If the text contains two different topics, A and B, use [ BOS <text topic A> EOS BOS <text topic B> ]

Generation tests:

The prompt [ BOS <text about language models> ] will start generating text on the same topic.
The prompt [ BOS <text about language models> EOS BOS ] will start generating text on a different topic.

The prompt [ BOS <text about language models> BOS ] will start generating text on a different topic using llama-7b, but on the same topic using open_llama-7b.

This needs to be tested and verified on a large scale.

Maybe using a dataset were different topics are separated from other topics using BOS EOS would be the best overall solution for doing perplexity testing properly. Something like [ BOS some text within the same topic EOS ] for each batch / context window?

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 4, 2023

Hi! I'm one of the creators of OpenLLaMA. We have realized that our model is indeed sensitive to BOS token and many existing implementations do not prepend the BOS token at generation time. To make our model more compatible with existing implementations, we have released our new 300b token checkpoint that is less sensitive to BOS tokens.

did a run, WITHOUT this pr, on the new 300 checkpoint with q5_1 and the perplexity is WAY better: 11.3132
edit: f16 11.2321

@klosax
Copy link
Contributor

klosax commented May 5, 2023

The tokenizer always insert one BOS at the start of the sequence.
This pr overwrites the first token in each batch with a BOS token.

OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 512 / batchsize 512 = 1 batch per ctx)
without this pr (1 BOS token): 11.31323402
with this pr (616 ctx x 1 batches = 616 BOS tokens): 11.25677669

Tests using ctx 2048 and batchsize 256, that is 8 batches per ctx:
I also tested with modified code (one BOS per ctx) so that only the first batch in each context window was altered.

OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx)
without this pr (1 BOS token): 9.83469015
with this pr (154 ctx x 8 batches = 1232 BOS tokens): 10.71918048
one BOS per ctx (154 ctx x 1 batches = 154 BOS tokens): 9.80628216

LLaMA 7B q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx)
without this pr (1 BOS token): 5.30899093
with this pr (163 ctx x 8 batches = 1304 BOS tokens): 9.36514984
one BOS per ctx (163 x 1 batches = 163 BOS tokens): 5.31047705

This suggests that the BOS token should only be prepended to the first batch of each ctx chunk.

Insert BOS token at start of each ctx chunk

I modified the loading of tokens in perplexity.cpp (without this pr) to insert BOS tokens instead of overwriting the tokens:

    // Tokenize input without prepending BOS token
    auto tokens_in = ::llama_tokenize(ctx, params.prompt, false);

    // Insert BOS token at start of each ctx chunk
    std::vector<llama_token> tokens;
    size_t k=0;
    while( k < tokens_in.size() )
    {
        tokens.push_back( llama_token_bos() );

        size_t j = params.n_ctx-1;

        if( k+j > tokens_in.size() )
            j = tokens_in.size()-k;

        for( size_t i=0; i < j ;i++ )
            tokens.push_back( tokens_in[k++] );
    }

Perplexity results on wiki.test.raw with the modified token loading:

ctx 512 / batchsize 512:

OpenLLaMA 300bt q5_1 : 11.3117 (w/o this pr 11.3132, w/ this pr 11.2568)

LLaMA 7B q5_1: 5.9379 (w/o this pr 5.9934, w/ this pr 5.9481)
LLaMA 7B q8_0 : 5.8977 (w/o this pr 5.9571, w/ this pr 5.9069)

LLaMA 13B q5_1 : 5.2467 (w/o this pr 5.2582, w/ this pr 5.2706)

ctx 2048 / batchsize 256:

OpenLLaMA 300bt q5_1 : 9.7981 (w/o this pr 9.8347,w/ this pr 10.7192)

LLaMA 7B q5_1: 5.2927 (w/o this pr 5.309, w/ this pr 9.3651)

This seems to be a way to do perplexity measurements properly.

@ggerganov
Copy link
Owner Author

I think this should handle smaller batch sizes correctly now

@klosax
Copy link
Contributor

klosax commented May 7, 2023

I think this should handle smaller batch sizes correctly now

I tested this and the save/restore of the original token does not seem to make any difference in ppl.
On LLaMA 7B q5_1 using ctx 512 / batch 512 the ppl is the same with or without saving the token: 5.9479

@glinscott
Copy link
Collaborator

Looks good, thanks! And interesting that it doesn't have any impact, @klosax did you test with eg. batch size 8? that should have had a significant impact with the previous implementation.

@klosax
Copy link
Contributor

klosax commented May 8, 2023

Looks good, thanks! And interesting that it doesn't have any impact, @klosax did you test with eg. batch size 8? that should have had a significant impact with the previous implementation.

Yes indeed, the BOS token should only be added to the first batch of each ctx chunk, not to every batch. Saving/restoring the original token does not make any difference to the ppl, as the softmax does not use the first half of the ctx chunk tokens.

My testing also indicate that the ppl goes down even further if the BOS token is properly inserted instead of simply overwriting the original tokens.

@ggerganov
Copy link
Owner Author

The initial implementation in this branch indeed was broken for smaller batches because I was putting the BOS token for each batch of the chunk instead of only just in the first batch. After the fix in 7f33230 it should be good, as reported by @klosax

@ggerganov ggerganov merged commit f9a6364 into master May 8, 2023
@ggerganov ggerganov deleted the fix-eval-bos branch May 8, 2023 14:41
@klosax klosax mentioned this pull request Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants