Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decoding special tokens in SentencePiece tokenizer #7233

Merged

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Sep 5, 2024

There was corner case while decoding special tokens Ids in Sentence Piece tokenizer and the special tokens ids show up in the beginning of the decoding list. The change here addresses that and ensures the special tokens will get decoded as expected.

@tarekgh
Copy link
Member Author

tarekgh commented Sep 6, 2024

CC @LittleLittleCloud

Copy link

codecov bot commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 77.90698% with 19 lines in your changes missing coverage. Please review.

Project coverage is 68.83%. Comparing base (4e364e4) to head (50a86c0).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...t.ML.Tokenizers/Model/SentencePieceBpeTokenizer.cs 75.64% 14 Missing and 5 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7233   +/-   ##
=======================================
  Coverage   68.82%   68.83%           
=======================================
  Files        1453     1453           
  Lines      271527   271562   +35     
  Branches    28094    28094           
=======================================
+ Hits       186885   186929   +44     
+ Misses      77424    77411   -13     
- Partials     7218     7222    +4     
Flag Coverage Δ
Debug 68.83% <77.90%> (+<0.01%) ⬆️
production 63.35% <75.64%> (+<0.01%) ⬆️
test 89.03% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs 99.85% <100.00%> (+<0.01%) ⬆️
...t.ML.Tokenizers/Model/SentencePieceBpeTokenizer.cs 76.86% <75.64%> (+1.20%) ⬆️

... and 7 files with indirect coverage changes

@tarekgh tarekgh merged commit 87a41fa into dotnet:main Sep 9, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants