Refector `TransformersTokenizer` and change fallback behavior for `byte_tokens` #973

hudson-ai · 2024-07-30T18:44:05Z

Break logic for getting byte_tokens from the original huggingface tokenizer out of __init__ and into separate method
Break that method down into various different methods, one per approach (using byte_decoder, sp_model, etc.)
Don't fall back to fast tokenizer if byte_decoder doesn't have all bytes. Instead, just make having all bytes part of the conditional the decides whether to use the byte_decoder to get byte_tokens
Remove hasattr(tokenizer, "get_vocab") check, as this is always true (@riedgar-ms is this only in new transformers versions?)
Wrap _byte_tokens_from_vocab (@ambisinister's code, my name) in a try-except. Since the tokenizer always has a get_vocab method, hitting the except is the only path to the gpt-2 fallback byte decoder (@Harsha-Nori it's also the only path to the sentencepiece exception that Better error messages when failing to load models with sentencepiece based tokenizers #971 complains is missing...)

Fixes #958 (well, at least it makes the sentencepiece exception get raised correctly)

…et byte_tokens from it..?

…ry/except

hudson-ai · 2024-07-30T18:45:52Z

guidance/models/transformers/_transformers.py

+        # except ImportError:
+        #     # HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers)
+        #     # TODO: decide whether to warn and fallback to fast tokenizer or to raise (should at least warn...)
+        #     raise


Uncommenting this would cause huggingface to raise an exception about sentencepiece early. With it commented out, we'll try a few approaches to getting the byte_tokens and then eventually raise our own exception if we couldn't make it work.

hudson-ai · 2024-07-30T18:48:22Z

guidance/models/transformers/_transformers.py

+        try:
+            tokenizer = transformers_package.AutoTokenizer.from_pretrained(
+                model, use_fast=False, **kwargs
+            )


Previously, we asserted _byte_decoder_has_all_bytes here, potentially causing a fallback to fast. I instead moved that check into _byte_tokens.

Note: we could call _byte_tokens inside of this try/except instead of further up the stack if we want to fall-back to fast on failure... I'm not sure what semantics are better...

hudson-ai · 2024-07-30T18:50:23Z

guidance/models/transformers/_transformers.py

+            byte_tokens[i] = byte_coded.replace(space_prefix, b" ")
+        return byte_tokens
+
+    def _byte_tokens_from_vocab(


This path used to call transformers_tokenizer.get_vocab and then do nothing with it. It was also only happening if the tokenizer actually had that method. As far as I can tell, it always has that method (again, could be a version thing?)

I think originally I may have used that method to get the tokenizer key/value pairs before replacing it with just iterating through a range the same length as the tokenizer, I guess I didn't realize that we just weren't using the vocabulary at all anymore, so I think removing it makes sense if the functionality stays the same.

Got it, thanks!

codecov-commenter · 2024-07-30T18:55:47Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 80.86957% with 22 lines in your changes missing coverage. Please review.

Project coverage is 62.45%. Comparing base (b66f2a0) to head (5ee72e2).
Report is 1 commits behind head on main.

Files	Patch %	Lines
guidance/models/transformers/_transformers.py	80.86%	22 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #973      +/-   ##
==========================================
+ Coverage   58.00%   62.45%   +4.45%     
==========================================
  Files          63       63              
  Lines        4848     4898      +50     
==========================================
+ Hits         2812     3059     +247     
+ Misses       2036     1839     -197

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hudson-ai · 2024-07-30T19:09:19Z

guidance/models/transformers/_transformers.py

+                        encoded_str = transformers_tokenizer.encode(token_str)
+                        if len(encoded_str) != 1:
+                            raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}")
+                        roundtrip_id = encoded_str[0]


FYI, here's where that IndexError was happening before

riedgar-ms · 2024-07-30T20:07:50Z

Refactoring makes sense; especially if we might want to change the order of the methods we attempt. A bit hard to test though :-/

hudson-ai · 2024-07-30T21:27:21Z

Refactoring makes sense; especially if we might want to change the order of the methods we attempt. A bit hard to test though :-/

Definitely hard to test...

On my end, I'm getting the sentencepiece exception in the fallback method when I don't have sentencepiece installed, which is a good signal. But I just experimentally merged this with the rust PR (which drops our protobuf dependency) because I knew that the fast=False Phi-3 model requires protobuf... The same "install sentencepiece" exception gets raised even though it's not the actual source of the problem.

So I think (unless there are any objections), I'd like to uncomment the ImportError exception handler I highlighted above. We should additionally probably warn the user if we're falling back to the less preferred "fast" tokenizer... what do you think?

… tokenizer

hudson-ai · 2024-07-30T21:52:21Z

guidance/models/transformers/_transformers.py

+        except ImportError as e:
+            # HuggingFace needs us to install something (sentencepiece, protobuf, etc. for some non-fast tokenizers)
+            raise RuntimeError(
+                f"Could not load tokenizer for model {model}. Please install the necessary dependencies for the tokenizer (see traceback for info)."
+            ) from e


We'll now fail early rather than falling back to the fast tokenizer if the cause of the exception here is a missing dependency

Interesting -- was guidance just eating a silent ImportFailure with e.g. attempting to load phi-3 without sentencepiece? I really like this change if we can catch those reasonably reliably.

Yes, we were trying to use the fast=False tokenizer and had a catch-all except, which would fall back to the fast tokenizer.

Raising the ImportError seems like a good idea to me, as it gives the user something really actionable and specific to do to fix the issue. I'm still falling back to the fast tokenizer for any other exception, but maybe that's a bad idea if we know they are generally unreliable...

Best case for me would be to only catch a specific exception class, e.g. NotImplementedError if that's what's thrown when there is no non-fast implementation or something... Unbounded except statements that silently fall back are scary (hence me adding a warning...)

Now I dislike the runtime error. We should try to be as consistent as possible in this repo if we are raising for "you are missing a dependency" reasons.

Agreed on dropping the runtime error

hudson-ai · 2024-07-30T21:52:51Z

guidance/models/transformers/_transformers.py

+            ) from e
+        except Exception as e:
+            # Fall back for other exceptions
+            warnings.warn(f"Falling back to fast tokenizer. Could not load tokenizer for model {model} due to exception {e.__class__.__name__}: {e}")


Added a warning. 100% unsure of whether this branch is ever taken now...

Yeah, good question. I agree -- it feels like we shouldn't hit this fallback anymore, particularly on a pathway dependent on import issues.

As Richard said, this is sadly just really hard to test. At least without making the community do it for us...

... or blowing up our test matrix even more

hudson-ai · 2024-07-30T21:55:18Z

guidance/models/transformers/_transformers.py

+        except Exception as e:
+            msg = textwrap.dedent(
+                f"""
+                The tokenizer being used is unable to convert a special character in {s}.
+                For models with sentencepiece based tokenizers (e.g. llama, phi-3-mini),
+                installing sentencepiece often fixes this issue (pip install sentencepiece).
+                """
+            )
+            raise ValueError(msg) from e


I'm unsure of whether this method will ever be called now that we "fail early" with respect to the fast tokenizer. If we hit this exception, the message may be wrong now, as missing an import probably isn't the cause of the exception anymore...

Harsha-Nori · 2024-08-01T05:30:43Z

guidance/models/transformers/_transformers.py

+        # run a quick spot check to verify we can rebuild complex multi-token unicode symbols
+        s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
+        reconstructed = b""
+        try:
+            input_ids = transformers_tokenizer(s)["input_ids"]
+            for i in input_ids:
+                nxt_bytes = []
+                token_str = transformers_tokenizer.convert_ids_to_tokens(i)
+                for c in token_str:
+                    nxt_bytes.append(byte_decoder[c])
+                reconstructed += bytes(nxt_bytes)
+            # Check if the tokenizer has a bos_token attribute, and if it does, check
+            # if it's at the start of the reconstructed bytes
+            # Some tokenizers add this automatically as part of the call function, so
+            # we need to remove it to compare
+            if hasattr(transformers_tokenizer, "bos_token") and reconstructed.startswith(
+                transformers_tokenizer.bos_token.encode()
+            ):
+                reconstructed = reconstructed[len(transformers_tokenizer.bos_token) :]


IIRC, these reconstitution checks should happen more generally than just in the fast-fallback path. Very possible that guidance wasn't doing this before, but I think it should happen after all the paths are exhausted and a single method is picked

I don't believe it was doing it in any case but the fallback, but I could be wrong. But agreed with you that this is reasonable to do always... at least to give a warning if not an exception

Harsha-Nori · 2024-08-15T15:54:48Z

guidance/models/transformers/_transformers.py

+                raise ByteDecoderError(
+                    f"Byte decoder is missing bytes: {all_bytes - set(byte_decoder.keys())}"
+                )


What's the recommended action for users here if this does get triggered? I think we previously had fallbacks if we failed this assertion, but do we just throw a hard error here?

This will only get raised if the last-ditch fallback (gpt2 tokenizer) doesn't have all the bytes or fails the complex unicode round-trip (it is raised on line 138 as a ValueError). We didn't have any kind of fallback to this in the original code. Really not sure what can be done besides opening an issue here on GH

"Please ask the model creator to provide a byte decoder"

Almost but not quite. There may have been one on the slow branch that was missing bytes or something, causing the fallback to gpt2.

@hudson-ai @Harsha-Nori @riedgar-ms Phi 3 vision is hitting this code path and it's blocking me from progressing. I'll leave a more detailed write up in our teams channel.

@nking-1 I believe that the check on the byte_decoder here is a sufficient (but not necessary) condition for creating a well-formed set of byte_tokens.

I think you could either proceed by modifying the byte_decoder such that it handles spaces/underscores correctly generates the right byte_tokens for you (then calling this check just to verify) OR you could figure out how to translate the check on byte_decoder to one on the byte_tokens themselves... @riedgar-ms any thoughts?

riedgar-ms · 2024-08-15T18:35:28Z

guidance/models/transformers/_transformers.py

+            tokenizer = transformers_package.AutoTokenizer.from_pretrained(
+                model, use_fast=False, **kwargs
+            )
+        except ImportError:


Is there any point to this except if all it does is re-raise?

Ahh sorry, it's because of the generic 'exception' catch which is next. Perhaps add a comment that you're filtering out the ImportErrors first?

Added a comment!

riedgar-ms · 2024-08-15T18:38:07Z

guidance/models/transformers/_transformers.py

                )
-                byte_tokens[i] = byte_coded
+            except ByteDecoderError:
+                pass


Should we log a warning here?

Added. Let me know if the language looks good.

riedgar-ms · 2024-08-15T18:38:20Z

guidance/models/transformers/_transformers.py

+        try:
+            return self._byte_tokens_by_encoding_token_strings(transformers_tokenizer)
+        except ValueError:
+            pass


Again, should we issue a warning?

Added, ditto on the language

riedgar-ms

A few minor suggestions for extra comments/warnings, but otherwise this LGTM.

…nch if that fails

hudson-ai · 2024-08-15T19:36:29Z

@riedgar-ms I just added an extra layer of nesting on the exceptions, essentially trying to build byte_tokens on the slow branch, falling back to the fast branch if that fails (previously we only fell back if we couldn't instantiate the slow one or if it happened to have a byte_decoder that was bad -- note that we have other things we can still try on this branch in that case)

Harsha-Nori · 2024-08-16T04:43:48Z

LGTM :). Thanks for all the iteration on this!

hudson-ai added 9 commits July 30, 2024 09:55

Encapsulate byte_tokens logic

b4ce184

split different ways of getting byte_tokens into separate methods

9b66001

annotations

b28f447

no need to assert byte decoder has all bytes unless we're trying to g…

f6fd1df

…et byte_tokens from it..?

Slightly more informative exception

9a3c918

mypy says we always have a grammar... instead do unconditionally in t…

7dde112

…ry/except

model must be a str by the time we call _tokenizer

d4dfd72

reorganize methods for readability

60554f7

Add some comments about alternate except behavior

8ee05cd

hudson-ai commented Jul 30, 2024

View reviewed changes

hudson-ai requested review from Harsha-Nori and riedgar-ms July 30, 2024 19:06

hudson-ai commented Jul 30, 2024

View reviewed changes

Add special case for ImportError; warn user when falling back to fast…

3a6c55f

… tokenizer

hudson-ai commented Jul 30, 2024

View reviewed changes

Harsha-Nori reviewed Aug 1, 2024

View reviewed changes

hudson-ai added 7 commits August 8, 2024 09:39

raise ImportError

9272c7d

just reraise caught ImportError (no RuntimeError)

4ec2b7b

Merge branch 'main' into transformers_fast_slow

979488f

reuse work in _byte_tokens_from_byte_decoder

7e08688

factor out check_byte_decoder to reuse across cases

37d824f

more informative method name

2e49e70

annotations

0b8536e

hudson-ai added 6 commits August 12, 2024 16:23

more informative method name

28bf6fb

reorder methods

86b6c89

check byte decoder has all bytes in both byte_decoder branches

e4a4f45

encapsulate both check_byte_decoder funcs into one

cb1462e

get_vocab

5d28242

check right byte decoder

023dcd7

Harsha-Nori reviewed Aug 15, 2024

View reviewed changes

riedgar-ms reviewed Aug 15, 2024

View reviewed changes

riedgar-ms approved these changes Aug 15, 2024

View reviewed changes

hudson-ai added 2 commits August 15, 2024 12:25

try to build byte_tokens in the slow branch and fall back to fast bra…

fbd68b2

…nch if that fails

add more warnings

5ee72e2

hudson-ai requested review from riedgar-ms and Harsha-Nori August 15, 2024 19:36

Harsha-Nori merged commit 1352c7d into guidance-ai:main Aug 16, 2024
100 of 104 checks passed

Refector TransformersTokenizer and change fallback behavior for byte_tokens #973

Refector TransformersTokenizer and change fallback behavior for byte_tokens #973

Conversation

hudson-ai commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 30, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

riedgar-ms commented Jul 30, 2024

hudson-ai commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harsha-Nori Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riedgar-ms left a comment

Choose a reason for hiding this comment

hudson-ai commented Aug 15, 2024

Harsha-Nori commented Aug 16, 2024

Refector `TransformersTokenizer` and change fallback behavior for `byte_tokens` #973

Refector `TransformersTokenizer` and change fallback behavior for `byte_tokens` #973

codecov-commenter commented Jul 30, 2024 •

edited

Loading

Harsha-Nori Aug 1, 2024 •

edited

Loading