-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refector TransformersTokenizer
and change fallback behavior for byte_tokens
#973
Refector TransformersTokenizer
and change fallback behavior for byte_tokens
#973
Conversation
…et byte_tokens from it..?
# except ImportError: | ||
# # HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers) | ||
# # TODO: decide whether to warn and fallback to fast tokenizer or to raise (should at least warn...) | ||
# raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncommenting this would cause huggingface to raise an exception about sentencepiece early. With it commented out, we'll try a few approaches to getting the byte_tokens and then eventually raise our own exception if we couldn't make it work.
try: | ||
tokenizer = transformers_package.AutoTokenizer.from_pretrained( | ||
model, use_fast=False, **kwargs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we asserted _byte_decoder_has_all_bytes
here, potentially causing a fallback to fast
. I instead moved that check into _byte_tokens
.
Note: we could call _byte_tokens
inside of this try/except instead of further up the stack if we want to fall-back to fast
on failure... I'm not sure what semantics are better...
byte_tokens[i] = byte_coded.replace(space_prefix, b" ") | ||
return byte_tokens | ||
|
||
def _byte_tokens_from_vocab( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path used to call transformers_tokenizer.get_vocab
and then do nothing with it. It was also only happening if the tokenizer actually had that method. As far as I can tell, it always has that method (again, could be a version thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think originally I may have used that method to get the tokenizer key/value pairs before replacing it with just iterating through a range the same length as the tokenizer, I guess I didn't realize that we just weren't using the vocabulary at all anymore, so I think removing it makes sense if the functionality stays the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
+ Coverage 58.00% 62.45% +4.45%
==========================================
Files 63 63
Lines 4848 4898 +50
==========================================
+ Hits 2812 3059 +247
+ Misses 2036 1839 -197 ☔ View full report in Codecov by Sentry. |
encoded_str = transformers_tokenizer.encode(token_str) | ||
if len(encoded_str) != 1: | ||
raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}") | ||
roundtrip_id = encoded_str[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, here's where that IndexError
was happening before
Refactoring makes sense; especially if we might want to change the order of the methods we attempt. A bit hard to test though :-/ |
Definitely hard to test... On my end, I'm getting the sentencepiece exception in the fallback method when I don't have sentencepiece installed, which is a good signal. But I just experimentally merged this with the rust PR (which drops our protobuf dependency) because I knew that the So I think (unless there are any objections), I'd like to uncomment the |
except ImportError as e: | ||
# HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers) | ||
raise RuntimeError( | ||
f"Could not load tokenizer for model {model}. Please install the necessary dependencies for the tokenizer (see traceback for info)." | ||
) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll now fail early rather than falling back to the fast tokenizer if the cause of the exception here is a missing dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting -- was guidance just eating a silent ImportFailure
with e.g. attempting to load phi-3 without sentencepiece? I really like this change if we can catch those reasonably reliably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we were trying to use the fast=False
tokenizer and had a catch-all except, which would fall back to the fast tokenizer.
Raising the ImportError
seems like a good idea to me, as it gives the user something really actionable and specific to do to fix the issue. I'm still falling back to the fast tokenizer for any other exception, but maybe that's a bad idea if we know they are generally unreliable...
Best case for me would be to only catch a specific exception class, e.g. NotImplementedError
if that's what's thrown when there is no non-fast implementation or something... Unbounded except
statements that silently fall back are scary (hence me adding a warning...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I dislike the runtime error. We should try to be as consistent as possible in this repo if we are raising for "you are missing a dependency" reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on dropping the runtime error
) from e | ||
except Exception as e: | ||
# Fall back for other exceptions | ||
warnings.warn(f"Falling back to fast tokenizer. Could not load tokenizer for model {model} due to exception {e.__class__.__name__}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a warning. 100% unsure of whether this branch is ever taken now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good question. I agree -- it feels like we shouldn't hit this fallback anymore, particularly on a pathway dependent on import issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Richard said, this is sadly just really hard to test. At least without making the community do it for us...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or blowing up our test matrix even more
except Exception as e: | ||
msg = textwrap.dedent( | ||
f""" | ||
The tokenizer being used is unable to convert a special character in {s}. | ||
For models with sentencepiece based tokenizers (e.g. llama, phi-3-mini), | ||
installing sentencepiece often fixes this issue (pip install sentencepiece). | ||
""" | ||
) | ||
raise ValueError(msg) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of whether this method will ever be called now that we "fail early" with respect to the fast tokenizer. If we hit this exception, the message may be wrong now, as missing an import probably isn't the cause of the exception anymore...
# run a quick spot check to verify we can rebuild complex multi-token unicode symbols | ||
s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨" | ||
reconstructed = b"" | ||
try: | ||
input_ids = transformers_tokenizer(s)["input_ids"] | ||
for i in input_ids: | ||
nxt_bytes = [] | ||
token_str = transformers_tokenizer.convert_ids_to_tokens(i) | ||
for c in token_str: | ||
nxt_bytes.append(byte_decoder[c]) | ||
reconstructed += bytes(nxt_bytes) | ||
# Check if the tokenizer has a bos_token attribute, and if it does, check | ||
# if it's at the start of the reconstructed bytes | ||
# Some tokenizers add this automatically as part of the call function, so | ||
# we need to remove it to compare | ||
if hasattr(transformers_tokenizer, "bos_token") and reconstructed.startswith( | ||
transformers_tokenizer.bos_token.encode() | ||
): | ||
reconstructed = reconstructed[len(transformers_tokenizer.bos_token) :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, these reconstitution checks should happen more generally than just in the fast-fallback path. Very possible that guidance wasn't doing this before, but I think it should happen after all the paths are exhausted and a single method is picked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it was doing it in any case but the fallback, but I could be wrong. But agreed with you that this is reasonable to do always... at least to give a warning if not an exception
raise ByteDecoderError( | ||
f"Byte decoder is missing bytes: {all_bytes - set(byte_decoder.keys())}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the recommended action for users here if this does get triggered? I think we previously had fallbacks if we failed this assertion, but do we just throw a hard error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only get raised if the last-ditch fallback (gpt2 tokenizer) doesn't have all the bytes or fails the complex unicode round-trip (it is raised on line 138 as a ValueError
). We didn't have any kind of fallback to this in the original code. Really not sure what can be done besides opening an issue here on GH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Please ask the model creator to provide a byte decoder"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost but not quite. There may have been one on the slow branch that was missing bytes or something, causing the fallback to gpt2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hudson-ai @Harsha-Nori @riedgar-ms Phi 3 vision is hitting this code path and it's blocking me from progressing. I'll leave a more detailed write up in our teams channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nking-1 I believe that the check on the byte_decoder
here is a sufficient (but not necessary) condition for creating a well-formed set of byte_tokens
.
I think you could either proceed by modifying the byte_decoder
such that it handles spaces/underscores correctly generates the right byte_tokens
for you (then calling this check just to verify) OR you could figure out how to translate the check on byte_decoder
to one on the byte_tokens
themselves... @riedgar-ms any thoughts?
tokenizer = transformers_package.AutoTokenizer.from_pretrained( | ||
model, use_fast=False, **kwargs | ||
) | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any point to this except
if all it does is re-raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh sorry, it's because of the generic 'exception' catch which is next. Perhaps add a comment that you're filtering out the ImportErrors
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment!
) | ||
byte_tokens[i] = byte_coded | ||
except ByteDecoderError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log a warning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Let me know if the language looks good.
try: | ||
return self._byte_tokens_by_encoding_token_strings(transformers_tokenizer) | ||
except ValueError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, should we issue a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, ditto on the language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions for extra comments/warnings, but otherwise this LGTM.
@riedgar-ms I just added an extra layer of nesting on the exceptions, essentially trying to build byte_tokens on the slow branch, falling back to the fast branch if that fails (previously we only fell back if we couldn't instantiate the slow one or if it happened to have a byte_decoder that was bad -- note that we have other things we can still try on this branch in that case) |
LGTM :). Thanks for all the iteration on this! |
byte_tokens
from the original huggingface tokenizer out of__init__
and into separate methodbyte_decoder
,sp_model
, etc.)byte_decoder
doesn't have all bytes. Instead, just make having all bytes part of the conditional the decides whether to use thebyte_decoder
to getbyte_tokens
hasattr(tokenizer, "get_vocab")
check, as this is always true (@riedgar-ms is this only in newtransformers
versions?)_byte_tokens_from_vocab
(@ambisinister's code, my name) in a try-except. Since the tokenizer always has aget_vocab
method, hitting the except is the only path to the gpt-2 fallback byte decoder (@Harsha-Nori it's also the only path to thesentencepiece
exception that Better error messages when failing to load models with sentencepiece based tokenizers #971 complains is missing...)Fixes #958 (well, at least it makes the
sentencepiece
exception get raised correctly)