-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Support for loading added_tokens_decoder #8997
[Tokenizer] Support for loading added_tokens_decoder #8997
Conversation
Thanks for your contribution! |
…x_added_tokens_decoder_load
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8997 +/- ##
===========================================
- Coverage 54.51% 53.89% -0.63%
===========================================
Files 648 652 +4
Lines 103473 104388 +915
===========================================
- Hits 56406 56255 -151
- Misses 47067 48133 +1066 ☔ View full report in Codecov by Sentry. |
@@ -158,6 +158,12 @@ def vocab_size(self): | |||
""" | |||
return len(self.encoder) | |||
|
|||
def __len__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mamba tokenizer的added_tokens_decoder中包含 [0,1]两个重复tokens,之前的计算方式会重复计算这两个token
@@ -80,6 +80,18 @@ def vocab_size(self): | |||
"""Returns vocab size""" | |||
return self.sp_model.get_piece_size() | |||
|
|||
def __len__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
解决无法添加token的问题
@@ -111,6 +111,18 @@ def vocab_size(self): | |||
"""Returns vocab size""" | |||
return self.sp_model.get_piece_size() | |||
|
|||
def __len__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
解决无法添加token的问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mamba OK
* fix added_tokens_decoder load * fix decode * fix saving and loading added_token_decoder * fix mamba * fix special_tokens_map_file load * fix gemma tokenizer * fix llama tokenzier * revert llama tokenizer * fix _decode
PR types
Bug fixes
PR changes
Others
Description
The new tokenizer_config.json now includes the added_tokens_decoder, and we load them in the PretrainedTokenizer _pre_init.