-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lllama] Update tokenization code to ensure parsing of the special tokens [core] #24042
Conversation
Ok, narrowed it down to this line: # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
added_tokens = tokenizer.sanitize_special_tokens() When converting the model from a slow one, the tokenizer correctly processes the inputs up until this point. Meaning that before, the special tokens where already registered as special tokens, but adding them once more most probably breaks the internal regex. Still checking but should be this. |
The documentation is not available anymore as the PR was closed or merged. |
After debugging with @Narsil it seems that the special tokens have to be not normalised, otherwise the normalizer prepends a space when adding it, which is why the token is not recognized. I suspect that there is another bug, as I tried with special tokens set to normalized = True (when calling A big discrepancy is that the default |
…to fix-llama-fast
We have to update the online models to change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
…kens [core] (huggingface#24042) * preventllama fast from returning token type ids * remove type hints * normalised False
What does this PR do?
Adresses the issues with the fast tokenizer of LLama. Namely:
There seems to be an issue with the conversion: before the python layer, just loading the tokenizer_config.json file with the rust backend still produced:
tokenizer.encode("this is not<s>").tokens
,['<s>', '▁This', '▁is', '▁not', '</', 's', '>']