-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llama tokenizer #22402
Fix llama tokenizer #22402
Conversation
The documentation is not available anymore as the PR was closed or merged. |
cc @Narsil for visibility! |
…to fix-llama-tokenizer
This will need to wait for #22341 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put some tests before merging. (And a PR description)
Probably focusing on the breaking changes we're making here.
Yes, on it! |
Will finish this tomorrow! |
Hi! Does this PR the decoding part of the tokenizer? Seems like it always prefixes the output with space. For instance, |
Waiting for huggingface/transformers#22402 to fix llama tokenizer
Yes, it does: |
assert tokenizer_fast.clean_up_tokenization_spaces is False | ||
assert tokenizer.clean_up_tokenization_spaces is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is such a small nit that I included it 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebasing, this test fails for me :( just reproduced on main:
> assert decoded == "[CLS] this shouldn ' t be! he ' ll go. [SEP]"
E assert "[CLS] this s...'ll go. [SEP]" == "[CLS] this s... ll go. [SEP]"
E - [CLS] this shouldn ' t be! he ' ll go. [SEP]
E ? - - - -
E + [CLS] this shouldn't be! he'll go. [SEP]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not pointing to the correct part of the test. If the cleanup_tokenization_spaces
is indeed False, the fail can happen for cache reasons or anything else (also failed for me at some point).
Will check again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for all the fixes and for adding the tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style
What does this PR do?
Draft but: