-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602
Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602
Conversation
if isinstance(v, torch.Tensor): | ||
raise ValueError(f"Unsupported input: {v}") | ||
|
||
inputs: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you are adapting an existing implementation, though this seems more complicated than I would expect (for example, why do we have a get_tokens()
function that returns its own input?).
@geoffreyangus, ooc does this also look strange to you, or is this imposed on us by torchscript?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like NgramTokenizer
, which subclasses SpaceStringToListTokenizer
(which in turn subclasses the new StringSplitTokenizer
), seems to override get_tokens
: https://github.com/ludwig-ai/ludwig/pull/3602/files#diff-5cbace55f4f4fd07725c061b9f981b83fe43cb53b0045cf1257c9fb5d4931f0dR132-R142
The
comma
,underscore
andcharacters
preprocessors are not currently TorchScriptable because they do not implement the TorchScript Module methods. As part of this PR, we now have a generalStringSplitTokenizer
that can be re-used across the various defaults we support. Unfortunately, since TorchScript only supports basic data types,CharactersToListTokenizer
had to be written as a separate class instead of using Lamda Function.Will help users run into this error less often:
This covered by the
tests/integration_tests/test_torchscript.py::test_torchscript_e2e_text
test because this PR updatesTORCHSCRIPT_COMPATIBLE_TOKENIZERS
to include the new TorchScriptable tokenizers.