Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Support for loading added_tokens_decoder #8997

Conversation

DrownFish19
Copy link
Collaborator

@DrownFish19 DrownFish19 commented Aug 23, 2024

PR types

Bug fixes

PR changes

Others

Description

The new tokenizer_config.json now includes the added_tokens_decoder, and we load them in the PretrainedTokenizer _pre_init.

  1. 解决llama、gemma、mamba无法添加token的问题。
  2. 当前添加的token和原始的added_token_decoder最后都会保存在added_token_decoder:dict中,可下次加载并且序号不变。
  3. 当前added_token_decoder可被from_pretrained加载,保证tokenizer_config.json中序号不变。

Copy link

paddle-bot bot commented Aug 23, 2024

Thanks for your contribution!

@DrownFish19 DrownFish19 changed the title [tokenizer] fix added_tokens_decoder load [Tokenizer] fix added_tokens_decoder load Aug 23, 2024
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.

Project coverage is 53.89%. Comparing base (9f6b486) to head (d6f2f38).
Report is 239 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/gemma/tokenizer.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8997      +/-   ##
===========================================
- Coverage    54.51%   53.89%   -0.63%     
===========================================
  Files          648      652       +4     
  Lines       103473   104388     +915     
===========================================
- Hits         56406    56255     -151     
- Misses       47067    48133    +1066     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -158,6 +158,12 @@ def vocab_size(self):
"""
return len(self.encoder)

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mamba tokenizer的added_tokens_decoder中包含 [0,1]两个重复tokens,之前的计算方式会重复计算这两个token

@@ -80,6 +80,18 @@ def vocab_size(self):
"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

@@ -111,6 +111,18 @@ def vocab_size(self):
"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

@DrownFish19 DrownFish19 changed the title [Tokenizer] fix added_tokens_decoder load [Tokenizer] support added_tokens_decoder load Aug 28, 2024
@DrownFish19 DrownFish19 changed the title [Tokenizer] support added_tokens_decoder load [Tokenizer] Support for loading added_tokens_decoder Aug 28, 2024
Copy link
Member

@JunnYu JunnYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mamba OK

@DrownFish19 DrownFish19 merged commit 3e7c5ca into PaddlePaddle:develop Aug 28, 2024
10 of 12 checks passed
@DrownFish19 DrownFish19 deleted the dev_20240823_fix_added_tokens_decoder_load branch August 28, 2024 12:38
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* fix added_tokens_decoder load

* fix decode

* fix saving and loading added_token_decoder

* fix mamba

* fix special_tokens_map_file load

* fix gemma tokenizer

* fix llama tokenzier

* revert llama tokenizer

* fix _decode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants