llama : require first token to be BOS #1303

ggerganov · 2023-05-03T17:26:44Z

This is likely necessary to make the generation more accurate.
We first noticed this with running the new OpenLLaMA models. The generation completely fails if the first token is not BOS.

#1291 (comment)

Setting the first token in each chunk of the perplexity computation to be BOS drives down the ppl values slightly (~0.05 for 7B), which indicates that this is the right thing to do. Still, will be happy if somebody with better understanding chimes in and clarifies if we do need to enforce the first token to be BOS.

Another interesting observation is that the vanilla LLaMA models seem "resilient" to not having a BOS.
This seems to not be the case for OpenLLaMA. What is the difference that is causing this?

After merging this (or before), will recompute all perplexity values for 7B and 13B LLaMA.

Another effect from this change is that generation after context swap should be better, since before this change, we were "losing" the BOS token when n_keep == 0 (i.e. default value).

Perplexity after the change

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9066	6.1620	6.0910	6.1466	5.9862	5.9481	5.9069
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2543	5.3863	5.3607	5.3513	5.2856	5.2706	5.2548
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

For reference - before the change

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

young-geng · 2023-05-04T02:14:49Z

Hi! I'm one of the creators of OpenLLaMA. We have realized that our model is indeed sensitive to BOS token and many existing implementations do not prepend the BOS token at generation time. To make our model more compatible with existing implementations, we have released our new 300b token checkpoint that is less sensitive to BOS tokens.

glinscott · 2023-05-03T21:16:24Z

examples/perplexity/perplexity.cpp

+            const int batch_size  = std::min(end - batch_start, params.n_batch);
+
+            // TODO: not perfect since this can be in the middle of a word, but it is better than nothing
+            tokens[batch_start] = llama_token_bos();


I think this will break for small batch sizes - as we'll never predict the bos token? And down below, in the perplexity calculation, we use this same tokens vector as the target for prediction.

maybe we should pad out the batch by 1? So include 1 less token within each one. We'd then need to replace all uses of param.n_batch with bos_batch or something like that which would just be bos_batch = params.n_batch - 1.

ggerganov · 2023-05-04T07:08:18Z

@young-geng Thanks for the info

@glinscott Good point - will fix this

The new perplexity results are out:

for 7B the ppl goes down ~-0.04 across all quantization modes
for the 13B, the ppl increases slightly by ~0.01

I think the decision of whether to always have a BOS token at the start of the context depends on how the training was done. If all sequences during training had a BOS token, then this is what the model has "seen" and we have to always make sure it is there. If not, then we can keep the original implementation as it is on master.

My guess is that during training, all sequences do have a BOS token at the start - just need someone to confirm this

young-geng · 2023-05-04T07:10:52Z

For OpenLLaMA, we always add BOS to the sequence during training. We believe that they also did this for the official LLaMA, as adding BOS improves a few percent on NLP benchmarks. I think always add BOS in the beginning is a better choice.

klosax · 2023-05-04T11:01:11Z

I have done some small testing where to use the BOS and EOS tokens.
The tests seems to indicate that they are used to separate text of different topics, maybe even languages and writing styles:

Perplexity measurement tests:

If all text is of the same topic, use [ BOS <text> ]
If the text contains two different topics, A and B, use [ BOS <text topic A> EOS BOS <text topic B> ]

Generation tests:

The prompt [ BOS <text about language models> ] will start generating text on the same topic.
The prompt [ BOS <text about language models> EOS BOS ] will start generating text on a different topic.

The prompt [ BOS <text about language models> BOS ] will start generating text on a different topic using llama-7b, but on the same topic using open_llama-7b.

This needs to be tested and verified on a large scale.

Maybe using a dataset were different topics are separated from other topics using BOS EOS would be the best overall solution for doing perplexity testing properly. Something like [ BOS some text within the same topic EOS ] for each batch / context window?

Green-Sky · 2023-05-04T23:28:16Z

Hi! I'm one of the creators of OpenLLaMA. We have realized that our model is indeed sensitive to BOS token and many existing implementations do not prepend the BOS token at generation time. To make our model more compatible with existing implementations, we have released our new 300b token checkpoint that is less sensitive to BOS tokens.

did a run, WITHOUT this pr, on the new 300 checkpoint with q5_1 and the perplexity is WAY better: 11.3132
edit: f16 11.2321

klosax · 2023-05-05T13:01:09Z

The tokenizer always insert one BOS at the start of the sequence.
This pr overwrites the first token in each batch with a BOS token.

OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 512 / batchsize 512 = 1 batch per ctx)
without this pr (1 BOS token): 11.31323402
with this pr (616 ctx x 1 batches = 616 BOS tokens): 11.25677669

Tests using ctx 2048 and batchsize 256, that is 8 batches per ctx:
I also tested with modified code (one BOS per ctx) so that only the first batch in each context window was altered.

OpenLLaMA 300bt q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx)
without this pr (1 BOS token): 9.83469015
with this pr (154 ctx x 8 batches = 1232 BOS tokens): 10.71918048
one BOS per ctx (154 ctx x 1 batches = 154 BOS tokens): 9.80628216

LLaMA 7B q5_1 perplexity on wiki.test.raw (ctx 2048 / batchsize 256 = 8 batches per ctx)
without this pr (1 BOS token): 5.30899093
with this pr (163 ctx x 8 batches = 1304 BOS tokens): 9.36514984
one BOS per ctx (163 x 1 batches = 163 BOS tokens): 5.31047705

This suggests that the BOS token should only be prepended to the first batch of each ctx chunk.

Insert BOS token at start of each ctx chunk

I modified the loading of tokens in perplexity.cpp (without this pr) to insert BOS tokens instead of overwriting the tokens:

    // Tokenize input without prepending BOS token
    auto tokens_in = ::llama_tokenize(ctx, params.prompt, false);

    // Insert BOS token at start of each ctx chunk
    std::vector<llama_token> tokens;
    size_t k=0;
    while( k < tokens_in.size() )
    {
        tokens.push_back( llama_token_bos() );

        size_t j = params.n_ctx-1;

        if( k+j > tokens_in.size() )
            j = tokens_in.size()-k;

        for( size_t i=0; i < j ;i++ )
            tokens.push_back( tokens_in[k++] );
    }

Perplexity results on wiki.test.raw with the modified token loading:

ctx 512 / batchsize 512:

OpenLLaMA 300bt q5_1 : 11.3117 (w/o this pr 11.3132, w/ this pr 11.2568)

LLaMA 7B q5_1: 5.9379 (w/o this pr 5.9934, w/ this pr 5.9481)
LLaMA 7B q8_0 : 5.8977 (w/o this pr 5.9571, w/ this pr 5.9069)

LLaMA 13B q5_1 : 5.2467 (w/o this pr 5.2582, w/ this pr 5.2706)

ctx 2048 / batchsize 256:

OpenLLaMA 300bt q5_1 : 9.7981 (w/o this pr 9.8347,w/ this pr 10.7192)

LLaMA 7B q5_1: 5.2927 (w/o this pr 5.309, w/ this pr 9.3651)

This seems to be a way to do perplexity measurements properly.

ggerganov · 2023-05-07T08:19:33Z

I think this should handle smaller batch sizes correctly now

klosax · 2023-05-07T23:21:51Z

I think this should handle smaller batch sizes correctly now

I tested this and the save/restore of the original token does not seem to make any difference in ppl.
On LLaMA 7B q5_1 using ctx 512 / batch 512 the ppl is the same with or without saving the token: 5.9479

glinscott · 2023-05-08T01:16:14Z

Looks good, thanks! And interesting that it doesn't have any impact, @klosax did you test with eg. batch size 8? that should have had a significant impact with the previous implementation.

klosax · 2023-05-08T06:31:34Z

Looks good, thanks! And interesting that it doesn't have any impact, @klosax did you test with eg. batch size 8? that should have had a significant impact with the previous implementation.

Yes indeed, the BOS token should only be added to the first batch of each ctx chunk, not to every batch. Saving/restoring the original token does not make any difference to the ppl, as the softmax does not use the first half of the ctx chunk tokens.

My testing also indicate that the ppl goes down even further if the BOS token is properly inserted instead of simply overwriting the original tokens.

ggerganov · 2023-05-08T14:40:34Z

The initial implementation in this branch indeed was broken for smaller batches because I was putting the BOS token for each batch of the chunk instead of only just in the first batch. After the fix in 7f33230 it should be good, as reported by @klosax

llama : require first token to be BOS

0652b42

ggerganov requested a review from glinscott May 3, 2023 17:35

ggerganov added high priority Very important issue generation quality Quality of model output labels May 3, 2023

scripts : add ppl-run-all.sh

cad6ff5

glinscott approved these changes May 4, 2023

View reviewed changes

perplexity : add BOS for each chunk

7f33230

glinscott approved these changes May 8, 2023

View reviewed changes

readme : update perplexity values after BOS fix

cdf40a9

perplexity : add clarifying comments

0e94ea6

ggerganov merged commit f9a6364 into master May 8, 2023

ggerganov deleted the fix-eval-bos branch May 8, 2023 14:41

klosax mentioned this pull request Aug 22, 2023

Strided perplexity #2714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : require first token to be BOS #1303

llama : require first token to be BOS #1303

ggerganov commented May 3, 2023 •

edited

Loading

young-geng commented May 4, 2023

glinscott May 3, 2023

ggerganov commented May 4, 2023

young-geng commented May 4, 2023 •

edited

Loading

klosax commented May 4, 2023 •

edited

Loading

Green-Sky commented May 4, 2023 •

edited

Loading

klosax commented May 5, 2023 •

edited

Loading

ggerganov commented May 7, 2023

klosax commented May 7, 2023

glinscott commented May 8, 2023

klosax commented May 8, 2023

ggerganov commented May 8, 2023

llama : require first token to be BOS #1303

llama : require first token to be BOS #1303

Conversation

ggerganov commented May 3, 2023 • edited Loading

Perplexity after the change

For reference - before the change

young-geng commented May 4, 2023

glinscott May 3, 2023

Choose a reason for hiding this comment

ggerganov commented May 4, 2023

young-geng commented May 4, 2023 • edited Loading

klosax commented May 4, 2023 • edited Loading

Green-Sky commented May 4, 2023 • edited Loading

klosax commented May 5, 2023 • edited Loading

Insert BOS token at start of each ctx chunk

ggerganov commented May 7, 2023

klosax commented May 7, 2023

glinscott commented May 8, 2023

klosax commented May 8, 2023

ggerganov commented May 8, 2023

ggerganov commented May 3, 2023 •

edited

Loading

young-geng commented May 4, 2023 •

edited

Loading

klosax commented May 4, 2023 •

edited

Loading

Green-Sky commented May 4, 2023 •

edited

Loading

klosax commented May 5, 2023 •

edited

Loading