🚨🚨🚨 [`SPM`] Finish fix spm models 🚨🚨🚨 #25224

ArthurZucker · 2023-08-01T07:29:22Z

What does this PR do?

Modifies Llama and T5 other sentencepiece based tokenizer will follow.

Previous behaviour is always possible with tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = True)

The goal of `transformers`'s wrapping around `sentencepiece`

To clarify, we want to:

be able to choose the behaviour of the special/added tokens. This means handling the stripping, encoding and decoding of such tokens
allow users to easily add new tokens, with tokenenizer.add_tokens(...) instead of having to load the protobuf file, modify the vocab, save it and reload the sentencepiece processor.

The current and past problems with our wrappers

Let's use both T5 and Llama as reference models. Currently, we do not mimic the behaviour of adding words to the actual sentencepiece vocabulary. This is an issue for anyone expecting (and rightfully) that adding tokens does not modify the behaviour of the model.

Adding a word to sentencepiece's vocab

This can be done using: (source)

>>> # wget https://huggingface.co/huggyllama/llama-7b/resolve/main/tokenizer.model
>>> from sentencepiece import sentencepiece_model_pb2 as model
>>> import sentencepiece as spm
>>> sp_model = model.ModelProto()
>>> sp_model.ParseFromString(open('tokenizer.model', 'rb').read())
>>> token = "your_token"
>>> sp_model.pieces.add(piece=f"{token}",score=0.0,type=model.ModelProto.SentencePiece.USER_DEFINED,)
>>> with open('new.model', 'wb') as f:
...     f.write(sp_model.SerializeToString())

then load the sp_model:

>>> sp_model = spm.SentencePieceProcessor()
>>> sp_model.Load('new.model')

Then, try the following :

>>> sp_model.encode("your_tokenHello", out_type=str)
["_", "your_token", "Hello"]

Adding a word to a `PretrainedTokenizer

This can be done using tokenizer.add_tokens(["your_token"]). It is a lot simpler indeed.
But the output you will get is:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = True, use_fast = False)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["your_token", "_Hello"]

>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = False, use_fast = False)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["your_token", "Hello"]

This is because we always split the text on the added tokens, and give the text on the left and right to the sentencepiece model. But, most sentencepiece models add a prefix space _ (the SPIECE_UNDERLINE character). Thus, when the transformers tokenizers splits "your_tokenHello", it encode your_token with the tokenizer.added_tokens_encoder and thus does not add a prefix space, and then encode Hello with the sentencepiece model, which adds a prefix space and thus outputs _Hello.

Other missmatches:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

TLDR; this shows the only way we can actually and properly handle added tokens and sentencepiece. We have to disable automatic prefix addition, and always encode with a token that is part of the vocab at the beginning to properly encode the first token, whether it has a prefix space or not. Yes this is dirty and sad, but the previous fix was removing the extra space, which was cleaner but had a corner cases #25176.

The same issue happens with fast tokenizers:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast = True)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["_your_token", "Hello"]

>>> tokenizer.add_tokens(["your_token_special"], True)
>>> tokenizer.tokenize("your_token_specialHello")
['your_token_special', '▁Hello']

Another issue 😈

So, here, the issue is that before the special token, even if there is no rstrip or lstrip (both are set to False), we have very strange behaviours:

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast = True)

>>> tokenizer.tokenize("<s>inform<s>")
# prefix space is eaten
['<s>', '▁inform', '<s>']

>>> tokenizer.tokenize("<s>inform <s>")
# prefix space is not eaten for the second <s>
['<s>', '▁inform', '▁', '<s>']

>>> tokenizer.tokenize(" <s>inform <s>")
# prefix space is not eaten for the second <s>
['▁▁', '<s>', '▁inform', '▁', '<s>']

>>> tokenizer.tokenize(" <s>inform <s> ")
# prefix space is not eaten for the first <s>, extra space added (known)
['▁▁', '<s>', '▁inform', '▁', '<s>', '▁▁']

>>> tokenizer.tokenize("inform <s> ")
# prefix space is added to inform
['▁inform', '▁', '<s>', '▁▁']

Note that tokenizer.convert_tokens_to_ids("▁▁") = 259 while tokenizer.convert_tokens_to_ids("▁") = 29871
Also if we add a prefix space to special tokens the beginning, we are probably gonna break a lot of things

ArthurZucker

Help for reviewers

ArthurZucker · 2023-08-01T08:22:24Z

tests/models/t5/test_tokenization_t5.py

The previous test values were not really good, with this update it makes more sense

ArthurZucker · 2023-08-01T08:22:41Z

tests/models/llama/test_tokenization_llama.py

@@ -534,15 +555,19 @@ def test_remove_extra_whitespaces(self):
        input_ids = self.tokenizer.encode("       . Hello")
        self.assertEqual(input_ids, [7, 4, 156, 86, 20])
        sp_encode = self.tokenizer.sp_model.encode("       . Hello")
-        self.assertEqual(input_ids, sp_encode)
+        self.assertEqual(input_ids, [7] + sp_encode)


Manually add the _ (spiece underline)

ArthurZucker · 2023-08-01T08:26:36Z

src/transformers/models/t5/tokenization_t5.py

+            text = self.unk_token + text
+            tokens = self.sp_model.encode(text, out_type=str)
+            return tokens[self.unk_token_length :]


That's the Hack:

all spm models have a tokenizer. Whether or not it is in the sentencepiece vocab does not matter.

we need to do this because since add_dummy_prefix = False the sentencpiece model always ALWAYS strips any SPIECE_UNDERLINE. So sp_model.encode(SPIECE_UNDERLINE + "Hello" , out_type=str) will give [Hel,llo] instead of [_Hel, llo].

previously, we removed added extra space. This is okay, but fails for words that should be split like inform. What happened before was that we would tokenize as _inform then remove _ and we have inform. But, the actual tokenization of inform is in,form and inform is not part of the vocab!

HuggingFaceDocBuilderDev · 2023-08-01T09:04:15Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

This makes sense to me. As usual with breaking changes, can you put three 🚨 in the title and show in the description how to enable the past behavior for users who want it?

…to finish-fix-spm-models

amyeroberts

LGTM! Thanks for working on this v. tricky problem 🤗

src/transformers/models/llama/tokenization_llama.py

tests/models/t5/test_tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

ArthurZucker · 2023-08-02T10:15:21Z

Will fix the prefixing of special tokens!

faaany · 2023-08-15T01:06:54Z

@ArthurZucker any update to this PR?

…to finish-fix-spm-models

ArthurZucker · 2023-08-16T11:30:27Z

Hey @faaany, I am updating it right now!

ArthurZucker · 2023-08-17T10:32:01Z

Reverted the changes as adding proper support for add_prefix_space is actully questionable. The usecase is already wrong as you should be reverse looking for ids not strings. See #24846 (adding prefix space was almost never done properly as the decoders were not updated as well)

ArthurZucker · 2023-08-17T13:19:40Z

pinging @sgugger for a final review !

sgugger

LGTM! Would be nice to make the two tests skipped smaller so that they pass.

ArthurZucker · 2023-08-17T15:07:50Z

Will do so in a follow up PR!

zhacmsra · 2023-08-20T13:42:15Z

I have added the legacy=True as "enc = AutoTokenizer.from_pretrained(model_path, legacy = True, use_fast=False)" but I have still gotten an error, which is a "Not a string" error. Anyone can give a hint what is going on here?

ArthurZucker · 2023-08-21T09:17:47Z

@zhacmsra the issue is in loading the vocabulary file, not 100% sure it's related to this. Can you open a new Issue with a reproducer please?

zhacmsra · 2023-08-21T12:29:02Z

Hi Arthur, you are correct. I figured out that it is not related to this PR. The networking problem broke the input model and result in error inputs. Sorry for this trouble. Thank you for the kind and timely response.

src/transformers/models/llama/tokenization_llama.py

* fix EVERYTHING * more fixes * ⚗️⚗️ Tokenizer magic ⚗️⚗️ * wrong value but test passes for the TODO * update * updat * safe protobuf import? * style * non gated repo * update * fixup * Update src/transformers/models/llama/tokenization_llama.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/llama/tokenization_llama.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/t5/test_tokenization_t5.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * nits * fix t5 too * use assert equal * fix llama decoding * nits on t5 * fixup * only remove the prefix space, not other spaces * more deconding tests and more todos * fix CI as well * fixup * skip failing test on CI (its tf its ok) * skip test_subword_regularization_tokenizer that is also crashing on the CI for TF * update llama * revert good fixes * fixup * empty * explain why we need to encode with an additional token * better warning? * nits --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

ArthurZucker added 6 commits August 1, 2023 06:14

fix EVERYTHING

c99c130

more fixes

acf31e2

⚗️⚗️ Tokenizer magic ⚗️⚗️

7305aff

wrong value but test passes for the TODO

01b8347

update

b9ddbbb

updat

83af718

ArthurZucker commented Aug 1, 2023

View reviewed changes

ArthurZucker added 3 commits August 1, 2023 08:30

safe protobuf import?

0babe38

style

0fdf51e

non gated repo

2d197a1

ArthurZucker added 2 commits August 1, 2023 09:52

update

e9c7a72

fixup

94964cd

ArthurZucker marked this pull request as ready for review August 1, 2023 09:56

ArthurZucker requested review from amyeroberts and sgugger August 1, 2023 10:58

sgugger approved these changes Aug 1, 2023

View reviewed changes

ArthurZucker changed the title ~~[SPM] Finish fix spm models~~ 🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 Aug 1, 2023

Merge branch 'main' of https://github.com/huggingface/transformers in…

cc9ddcf

…to finish-fix-spm-models

amyeroberts approved these changes Aug 1, 2023

View reviewed changes

ArthurZucker and others added 5 commits August 2, 2023 10:00

Update src/transformers/models/llama/tokenization_llama.py

45cae43

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/models/llama/tokenization_llama.py

53557a9

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update tests/models/t5/test_tokenization_t5.py

e049d11

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

nits

b64b2d2

fix t5 too

cb95361

ArthurZucker added 4 commits August 2, 2023 12:44

use assert equal

a86bf78

fix llama decoding

913cd1d

nits on t5

ef28574

fixup

4f65261

Merge branch 'main' of https://github.com/huggingface/transformers in…

a4ed16f

…to finish-fix-spm-models

ArthurZucker force-pushed the finish-fix-spm-models branch from bf55915 to a4ed16f Compare August 17, 2023 10:27

ArthurZucker added 7 commits August 17, 2023 10:32

update llama

e7906c2

revert good fixes

ad33c97

fixup

f890882

empty

b7f98bc

explain why we need to encode with an additional token

bb79083

better warning?

3f8ac96

nits

4249986

ArthurZucker requested review from sgugger and Narsil August 17, 2023 13:19

sgugger approved these changes Aug 17, 2023

View reviewed changes

ArthurZucker merged commit b4d5548 into huggingface:main Aug 17, 2023
3 checks passed

ArthurZucker deleted the finish-fix-spm-models branch August 18, 2023 10:44

andreaskoepf mentioned this pull request Aug 28, 2023

Tokenization problem, unexpected <unk> tokens huggingface/text-generation-inference#938

Closed

4 tasks

polm-stability mentioned this pull request Aug 30, 2023

Error：AutoTokenizer.from_pretrained，UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment #25848

Closed

4 tasks

mpu mentioned this pull request Aug 31, 2023

Support overriding the model add_dummy_prefix setting google/sentencepiece#908

Closed

ArthurZucker mentioned this pull request Aug 31, 2023

Inconsistency between CodeLlamaTokenizer and CodeLlamaTokenizerFast #25881

Closed

4 tasks

CoaLee reviewed Sep 18, 2023

View reviewed changes

src/transformers/models/llama/tokenization_llama.py Show resolved Hide resolved

ArthurZucker mentioned this pull request Sep 21, 2023

Adding new non-latin tokens to the T5 tokenizer creates unnecessary whitespaces #26101

Closed

4 tasks

cebtenzzre mentioned this pull request Nov 14, 2023

tokenizer : special token handling ggerganov/llama.cpp#3538

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨🚨 [`SPM`] Finish fix spm models 🚨🚨🚨 #25224

🚨🚨🚨 [`SPM`] Finish fix spm models 🚨🚨🚨 #25224

ArthurZucker commented Aug 1, 2023 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Aug 1, 2023

ArthurZucker Aug 1, 2023

ArthurZucker Aug 1, 2023

HuggingFaceDocBuilderDev commented Aug 1, 2023 •

edited

Loading

sgugger left a comment

amyeroberts left a comment

ArthurZucker commented Aug 2, 2023

faaany commented Aug 15, 2023

ArthurZucker commented Aug 16, 2023

ArthurZucker commented Aug 17, 2023

ArthurZucker commented Aug 17, 2023

sgugger left a comment

ArthurZucker commented Aug 17, 2023

zhacmsra commented Aug 20, 2023

ArthurZucker commented Aug 21, 2023

zhacmsra commented Aug 21, 2023

🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 #25224

🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 #25224

Conversation

ArthurZucker commented Aug 1, 2023 • edited Loading

What does this PR do?

The goal of transformers's wrapping around sentencepiece

The current and past problems with our wrappers

Adding a word to sentencepiece's vocab

Adding a word to a `PretrainedTokenizer

The same issue happens with fast tokenizers:

Another issue 😈

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 1, 2023

Choose a reason for hiding this comment

ArthurZucker Aug 1, 2023

Choose a reason for hiding this comment

ArthurZucker Aug 1, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 1, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 2, 2023

faaany commented Aug 15, 2023

ArthurZucker commented Aug 16, 2023

ArthurZucker commented Aug 17, 2023

ArthurZucker commented Aug 17, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 17, 2023

zhacmsra commented Aug 20, 2023

ArthurZucker commented Aug 21, 2023

zhacmsra commented Aug 21, 2023

🚨🚨🚨 [`SPM`] Finish fix spm models 🚨🚨🚨 #25224

🚨🚨🚨 [`SPM`] Finish fix spm models 🚨🚨🚨 #25224

ArthurZucker commented Aug 1, 2023 •

edited

Loading

The goal of `transformers`'s wrapping around `sentencepiece`

HuggingFaceDocBuilderDev commented Aug 1, 2023 •

edited

Loading