Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with setPadding and setTruncation overriding configurations… #2741

Merged
merged 1 commit into from
Aug 10, 2023

Conversation

siddvenk
Copy link
Contributor

@siddvenk siddvenk commented Aug 9, 2023

… set in tokenizer.json

Description

This fixes an issue where our implementation of Tokenizer overwrites any padding or truncation configuration set in the tokenizer.json file.

I have added a fake tokenizer.json here to validate with a unit test.

Fixes #2669

String[] expected = {
"<s>", "▁", "test", "▁sentence", "</s>", "<pad>", "<pad>", "<pad>"
};
System.out.println(Arrays.toString(tokens));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean println in test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@codecov-commenter
Copy link

codecov-commenter commented Aug 9, 2023

Codecov Report

Patch coverage: 54.72% and project coverage change: +0.06% 🎉

Comparison is base (bb5073f) 72.08% compared to head (dc89acf) 72.15%.
Report is 865 commits behind head on master.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2741      +/-   ##
============================================
+ Coverage     72.08%   72.15%   +0.06%     
- Complexity     5126     7029    +1903     
============================================
  Files           473      698     +225     
  Lines         21970    31282    +9312     
  Branches       2351     3228     +877     
============================================
+ Hits          15838    22570    +6732     
- Misses         4925     7171    +2246     
- Partials       1207     1541     +334     
Files Changed Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...djl/modality/cv/transform/RandomFlipLeftRight.java 25.00% <0.00%> (-25.00%) ⬇️
...djl/modality/cv/transform/RandomFlipTopBottom.java 25.00% <0.00%> (-25.00%) ⬇️
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java 67.10% <0.00%> (+7.89%) ⬆️
api/src/main/java/ai/djl/modality/nlp/Decoder.java 63.63% <ø> (ø)
... and 226 more

... and 368 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@siddvenk siddvenk merged commit 17bfda1 into deepjavalibrary:master Aug 10, 2023
@siddvenk siddvenk deleted the tokenizer-fix branch August 10, 2023 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json
4 participants