-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨🚨 [SPM
] Finish fix spm models 🚨🚨🚨
#25224
🚨🚨🚨 [SPM
] Finish fix spm models 🚨🚨🚨
#25224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help for reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous test values were not really good, with this update it makes more sense
@@ -534,15 +555,19 @@ def test_remove_extra_whitespaces(self): | |||
input_ids = self.tokenizer.encode(" . Hello") | |||
self.assertEqual(input_ids, [7, 4, 156, 86, 20]) | |||
sp_encode = self.tokenizer.sp_model.encode(" . Hello") | |||
self.assertEqual(input_ids, sp_encode) | |||
self.assertEqual(input_ids, [7] + sp_encode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually add the _
(spiece underline)
text = self.unk_token + text | ||
tokens = self.sp_model.encode(text, out_type=str) | ||
return tokens[self.unk_token_length :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the Hack:
- all spm models have a tokenizer. Whether or not it is in the sentencepiece vocab does not matter.
- we need to do this because since
add_dummy_prefix = False
the sentencpiece model always ALWAYS strips anySPIECE_UNDERLINE
. Sosp_model.encode(SPIECE_UNDERLINE + "Hello" , out_type=str)
will give [Hel
,llo
] instead of [_Hel
,llo
]. - previously, we removed added extra space. This is okay, but fails for words that should be split like
inform
. What happened before was that we would tokenize as_inform
then remove_
and we haveinform
. But, the actual tokenization ofinform
isin
,form
andinform
is not part of the vocab!
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. As usual with breaking changes, can you put three 🚨 in the title and show in the description how to enable the past behavior for users who want it?
SPM
] Finish fix spm modelsSPM
] Finish fix spm models 🚨🚨🚨
…to finish-fix-spm-models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this v. tricky problem 🤗
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Will fix the prefixing of special tokens! |
@ArthurZucker any update to this PR? |
…to finish-fix-spm-models
Hey @faaany, I am updating it right now! |
bf55915
to
a4ed16f
Compare
Reverted the changes as adding proper support for |
pinging @sgugger for a final review ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Would be nice to make the two tests skipped smaller so that they pass.
Will do so in a follow up PR! |
@zhacmsra the issue is in loading the vocabulary file, not 100% sure it's related to this. Can you open a new Issue with a reproducer please? |
Hi Arthur, you are correct. I figured out that it is not related to this PR. The networking problem broke the input model and result in error inputs. Sorry for this trouble. Thank you for the kind and timely response. |
* fix EVERYTHING * more fixes * ⚗️⚗️ Tokenizer magic ⚗️⚗️ * wrong value but test passes for the TODO * update * updat * safe protobuf import? * style * non gated repo * update * fixup * Update src/transformers/models/llama/tokenization_llama.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/llama/tokenization_llama.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/t5/test_tokenization_t5.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * nits * fix t5 too * use assert equal * fix llama decoding * nits on t5 * fixup * only remove the prefix space, not other spaces * more deconding tests and more todos * fix CI as well * fixup * skip failing test on CI (its tf its ok) * skip test_subword_regularization_tokenizer that is also crashing on the CI for TF * update llama * revert good fixes * fixup * empty * explain why we need to encode with an additional token * better warning? * nits --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Modifies
Llama
andT5
other sentencepiece based tokenizer will follow.Previous behaviour is always possible with
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = True)
The goal of
transformers
's wrapping aroundsentencepiece
To clarify, we want to:
stripping
, encoding and decoding of such tokenstokenenizer.add_tokens(...)
instead of having to load the protobuf file, modify the vocab, save it and reload the sentencepiece processor.The current and past problems with our wrappers
Let's use both T5 and Llama as reference models. Currently, we do not mimic the behaviour of adding words to the actual
sentencepiece
vocabulary. This is an issue for anyone expecting (and rightfully) that adding tokens does not modify the behaviour of the model.Adding a word to sentencepiece's vocab
This can be done using: (source)
then load the
sp_model
:Then, try the following :
Adding a word to a `PretrainedTokenizer
This can be done using
tokenizer.add_tokens(["your_token"])
. It is a lot simpler indeed.But the output you will get is:
This is because we always split the text on the added tokens, and give the text on the left and right to the
sentencepiece
model. But, most sentencepiece models add a prefix space_
(theSPIECE_UNDERLINE
character). Thus, when thetransformers
tokenizers splits"your_tokenHello"
, it encodeyour_token
with thetokenizer.added_tokens_encoder
and thus does not add a prefix space, and then encodeHello
with the sentencepiece model, which adds a prefix space and thus outputs_Hello
.Other missmatches:
TLDR; this shows the only way we can actually and properly handle added tokens and sentencepiece. We have to disable automatic prefix addition, and always encode with a token that is part of the vocab at the beginning to properly encode the first token, whether it has a prefix space or not. Yes this is dirty and sad, but the previous fix was removing the extra space, which was cleaner but had a corner cases #25176.
The same issue happens with fast tokenizers:
Another issue 😈
So, here, the issue is that before the special token, even if there is no
rstrip
orlstrip
(both are set to False), we have very strange behaviours:Note that
tokenizer.convert_tokens_to_ids("▁▁") = 259
whiletokenizer.convert_tokens_to_ids("▁") = 29871
Also if we add a prefix space to special tokens the beginning, we are probably gonna break a lot of things