Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

martindavis · 2023-09-13T20:54:24Z

The comma, underscore and characters preprocessors are not currently TorchScriptable because they do not implement the TorchScript Module methods. As part of this PR, we now have a general StringSplitTokenizer that can be re-used across the various defaults we support. Unfortunately, since TorchScript only supports basic data types, CharactersToListTokenizer had to be written as a separate class instead of using Lamda Function.

Will help users run into this error less often:

ValueError: comma is not supported by torchscript. Please use one of {'sentencepiece', 'space_punct', 'clip', 
'gpt2bpe', 'bert', 'space'}.

This covered by the tests/integration_tests/test_torchscript.py::test_torchscript_e2e_text test because this PR updates TORCHSCRIPT_COMPATIBLE_TOKENIZERS to include the new TorchScriptable tokenizers.

github-actions · 2023-09-13T20:55:25Z

Unit Test Results

  6 files ±0   6 suites ±0 47m 17s ⏱️ + 2m 17s
31 tests ±0 26 ✔️ ±0   5 💤 ±0 0 ❌ ±0
82 runs ±0 66 ✔️ ±0 16 💤 ±0 0 ❌ ±0

Results for commit 7401aa6. ± Comparison against base commit d15a0c5.

♻️ This comment has been updated with latest results.

ludwig/utils/tokenizers.py

justinxzhao · 2023-09-14T14:32:25Z

ludwig/utils/tokenizers.py

+        if isinstance(v, torch.Tensor):
+            raise ValueError(f"Unsupported input: {v}")
+
+        inputs: List[str] = []


I see that you are adapting an existing implementation, though this seems more complicated than I would expect (for example, why do we have a get_tokens() function that returns its own input?).

@geoffreyangus, ooc does this also look strange to you, or is this imposed on us by torchscript?

It looks like NgramTokenizer, which subclasses SpaceStringToListTokenizer (which in turn subclasses the new StringSplitTokenizer), seems to override get_tokens: https://github.com/ludwig-ai/ludwig/pull/3602/files#diff-5cbace55f4f4fd07725c061b9f981b83fe43cb53b0045cf1257c9fb5d4931f0dR132-R142

This reverts commit 42ba7e3.

…ptable

martindavis added 2 commits September 13, 2023 16:28

Updated characters, underscore and comma to be torchscriptable.

3e6ce11

Updated Class descriptions.

95f0501

martindavis requested review from justinxzhao and geoffreyangus September 13, 2023 20:54

martindavis self-assigned this Sep 13, 2023

Updated apt-get before installing dependencies.

42ba7e3

justinxzhao reviewed Sep 14, 2023

View reviewed changes

martindavis added 4 commits September 14, 2023 11:14

Removed unused import.

23459ec

Added test for SplitString Tokenizer.

5635f8d

Revert "Updated apt-get before installing dependencies."

6fa1e63

This reverts commit 42ba7e3.

Merge branch 'master' into update-basic-preprocessors-to-be-torchscri…

7401aa6

…ptable

justinxzhao approved these changes Sep 14, 2023

View reviewed changes

martindavis merged commit 2365de7 into master Sep 14, 2023
16 checks passed

martindavis deleted the update-basic-preprocessors-to-be-torchscriptable branch September 14, 2023 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

martindavis commented Sep 13, 2023 •

edited

Loading

github-actions bot commented Sep 13, 2023 •

edited

Loading

justinxzhao Sep 14, 2023

geoffreyangus Sep 14, 2023 •

edited

Loading

Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

Updated characters, underscore and comma preprocessors to be TorchScriptable. #3602

Conversation

martindavis commented Sep 13, 2023 • edited Loading

github-actions bot commented Sep 13, 2023 • edited Loading

Unit Test Results

justinxzhao Sep 14, 2023

Choose a reason for hiding this comment

geoffreyangus Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

martindavis commented Sep 13, 2023 •

edited

Loading

github-actions bot commented Sep 13, 2023 •

edited

Loading

geoffreyangus Sep 14, 2023 •

edited

Loading