Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refector TransformersTokenizer and change fallback behavior for byte_tokens #973

Merged
merged 25 commits into from
Aug 16, 2024

Conversation

hudson-ai
Copy link
Collaborator

  • Break logic for getting byte_tokens from the original huggingface tokenizer out of __init__ and into separate method
  • Break that method down into various different methods, one per approach (using byte_decoder, sp_model, etc.)
  • Don't fall back to fast tokenizer if byte_decoder doesn't have all bytes. Instead, just make having all bytes part of the conditional the decides whether to use the byte_decoder to get byte_tokens
  • Remove hasattr(tokenizer, "get_vocab") check, as this is always true (@riedgar-ms is this only in new transformers versions?)
  • Wrap _byte_tokens_from_vocab (@ambisinister's code, my name) in a try-except. Since the tokenizer always has a get_vocab method, hitting the except is the only path to the gpt-2 fallback byte decoder (@Harsha-Nori it's also the only path to the sentencepiece exception that Better error messages when failing to load models with sentencepiece based tokenizers #971 complains is missing...)

Fixes #958 (well, at least it makes the sentencepiece exception get raised correctly)

Comment on lines 91 to 94
# except ImportError:
# # HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers)
# # TODO: decide whether to warn and fallback to fast tokenizer or to raise (should at least warn...)
# raise
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncommenting this would cause huggingface to raise an exception about sentencepiece early. With it commented out, we'll try a few approaches to getting the byte_tokens and then eventually raise our own exception if we couldn't make it work.

try:
tokenizer = transformers_package.AutoTokenizer.from_pretrained(
model, use_fast=False, **kwargs
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we asserted _byte_decoder_has_all_bytes here, potentially causing a fallback to fast. I instead moved that check into _byte_tokens.

Note: we could call _byte_tokens inside of this try/except instead of further up the stack if we want to fall-back to fast on failure... I'm not sure what semantics are better...

byte_tokens[i] = byte_coded.replace(space_prefix, b" ")
return byte_tokens

def _byte_tokens_from_vocab(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path used to call transformers_tokenizer.get_vocab and then do nothing with it. It was also only happening if the tokenizer actually had that method. As far as I can tell, it always has that method (again, could be a version thing?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think originally I may have used that method to get the tokenizer key/value pairs before replacing it with just iterating through a range the same length as the tokenizer, I guess I didn't realize that we just weren't using the vocabulary at all anymore, so I think removing it makes sense if the functionality stays the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 80.86957% with 22 lines in your changes missing coverage. Please review.

Project coverage is 62.45%. Comparing base (b66f2a0) to head (5ee72e2).
Report is 1 commits behind head on main.

Files Patch % Lines
guidance/models/transformers/_transformers.py 80.86% 22 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #973      +/-   ##
==========================================
+ Coverage   58.00%   62.45%   +4.45%     
==========================================
  Files          63       63              
  Lines        4848     4898      +50     
==========================================
+ Hits         2812     3059     +247     
+ Misses       2036     1839     -197     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +195 to +198
encoded_str = transformers_tokenizer.encode(token_str)
if len(encoded_str) != 1:
raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}")
roundtrip_id = encoded_str[0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, here's where that IndexError was happening before

@riedgar-ms
Copy link
Collaborator

Refactoring makes sense; especially if we might want to change the order of the methods we attempt. A bit hard to test though :-/

@hudson-ai
Copy link
Collaborator Author

Refactoring makes sense; especially if we might want to change the order of the methods we attempt. A bit hard to test though :-/

Definitely hard to test...

On my end, I'm getting the sentencepiece exception in the fallback method when I don't have sentencepiece installed, which is a good signal. But I just experimentally merged this with the rust PR (which drops our protobuf dependency) because I knew that the fast=False Phi-3 model requires protobuf... The same "install sentencepiece" exception gets raised even though it's not the actual source of the problem.

So I think (unless there are any objections), I'd like to uncomment the ImportError exception handler I highlighted above. We should additionally probably warn the user if we're falling back to the less preferred "fast" tokenizer... what do you think?

Comment on lines 92 to 96
except ImportError as e:
# HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers)
raise RuntimeError(
f"Could not load tokenizer for model {model}. Please install the necessary dependencies for the tokenizer (see traceback for info)."
) from e
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll now fail early rather than falling back to the fast tokenizer if the cause of the exception here is a missing dependency

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting -- was guidance just eating a silent ImportFailure with e.g. attempting to load phi-3 without sentencepiece? I really like this change if we can catch those reasonably reliably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we were trying to use the fast=False tokenizer and had a catch-all except, which would fall back to the fast tokenizer.

Raising the ImportError seems like a good idea to me, as it gives the user something really actionable and specific to do to fix the issue. I'm still falling back to the fast tokenizer for any other exception, but maybe that's a bad idea if we know they are generally unreliable...

Best case for me would be to only catch a specific exception class, e.g. NotImplementedError if that's what's thrown when there is no non-fast implementation or something... Unbounded except statements that silently fall back are scary (hence me adding a warning...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I dislike the runtime error. We should try to be as consistent as possible in this repo if we are raising for "you are missing a dependency" reasons.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on dropping the runtime error

) from e
except Exception as e:
# Fall back for other exceptions
warnings.warn(f"Falling back to fast tokenizer. Could not load tokenizer for model {model} due to exception {e.__class__.__name__}: {e}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning. 100% unsure of whether this branch is ever taken now...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good question. I agree -- it feels like we shouldn't hit this fallback anymore, particularly on a pathway dependent on import issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Richard said, this is sadly just really hard to test. At least without making the community do it for us...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or blowing up our test matrix even more

Comment on lines 250 to 258
except Exception as e:
msg = textwrap.dedent(
f"""
The tokenizer being used is unable to convert a special character in {s}.
For models with sentencepiece based tokenizers (e.g. llama, phi-3-mini),
installing sentencepiece often fixes this issue (pip install sentencepiece).
"""
)
raise ValueError(msg) from e
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of whether this method will ever be called now that we "fail early" with respect to the fast tokenizer. If we hit this exception, the message may be wrong now, as missing an import probably isn't the cause of the exception anymore...

Comment on lines 231 to 249
# run a quick spot check to verify we can rebuild complex multi-token unicode symbols
s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
reconstructed = b""
try:
input_ids = transformers_tokenizer(s)["input_ids"]
for i in input_ids:
nxt_bytes = []
token_str = transformers_tokenizer.convert_ids_to_tokens(i)
for c in token_str:
nxt_bytes.append(byte_decoder[c])
reconstructed += bytes(nxt_bytes)
# Check if the tokenizer has a bos_token attribute, and if it does, check
# if it's at the start of the reconstructed bytes
# Some tokenizers add this automatically as part of the call function, so
# we need to remove it to compare
if hasattr(transformers_tokenizer, "bos_token") and reconstructed.startswith(
transformers_tokenizer.bos_token.encode()
):
reconstructed = reconstructed[len(transformers_tokenizer.bos_token) :]
Copy link
Collaborator

@Harsha-Nori Harsha-Nori Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, these reconstitution checks should happen more generally than just in the fast-fallback path. Very possible that guidance wasn't doing this before, but I think it should happen after all the paths are exhausted and a single method is picked

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it was doing it in any case but the fallback, but I could be wrong. But agreed with you that this is reasonable to do always... at least to give a warning if not an exception

Comment on lines +252 to +254
raise ByteDecoderError(
f"Byte decoder is missing bytes: {all_bytes - set(byte_decoder.keys())}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the recommended action for users here if this does get triggered? I think we previously had fallbacks if we failed this assertion, but do we just throw a hard error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only get raised if the last-ditch fallback (gpt2 tokenizer) doesn't have all the bytes or fails the complex unicode round-trip (it is raised on line 138 as a ValueError). We didn't have any kind of fallback to this in the original code. Really not sure what can be done besides opening an issue here on GH

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Please ask the model creator to provide a byte decoder"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost but not quite. There may have been one on the slow branch that was missing bytes or something, causing the fallback to gpt2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudson-ai @Harsha-Nori @riedgar-ms Phi 3 vision is hitting this code path and it's blocking me from progressing. I'll leave a more detailed write up in our teams channel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nking-1 I believe that the check on the byte_decoder here is a sufficient (but not necessary) condition for creating a well-formed set of byte_tokens.

I think you could either proceed by modifying the byte_decoder such that it handles spaces/underscores correctly generates the right byte_tokens for you (then calling this check just to verify) OR you could figure out how to translate the check on byte_decoder to one on the byte_tokens themselves... @riedgar-ms any thoughts?

tokenizer = transformers_package.AutoTokenizer.from_pretrained(
model, use_fast=False, **kwargs
)
except ImportError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any point to this except if all it does is re-raise?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry, it's because of the generic 'exception' catch which is next. Perhaps add a comment that you're filtering out the ImportErrors first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment!

)
byte_tokens[i] = byte_coded
except ByteDecoderError:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log a warning here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Let me know if the language looks good.

try:
return self._byte_tokens_by_encoding_token_strings(transformers_tokenizer)
except ValueError:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, should we issue a warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, ditto on the language

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions for extra comments/warnings, but otherwise this LGTM.

@hudson-ai
Copy link
Collaborator Author

@riedgar-ms I just added an extra layer of nesting on the exceptions, essentially trying to build byte_tokens on the slow branch, falling back to the fast branch if that fails (previously we only fell back if we couldn't instantiate the slow one or if it happened to have a byte_decoder that was bad -- note that we have other things we can still try on this branch in that case)

@Harsha-Nori
Copy link
Collaborator

LGTM :). Thanks for all the iteration on this!

@Harsha-Nori Harsha-Nori merged commit 1352c7d into guidance-ai:main Aug 16, 2024
100 of 104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot instantiate Transformers("microsoft/Phi-3-mini-4k-instruct")
6 participants