Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

Merged

Conversation

lvdongyi
Copy link
Contributor

@lvdongyi lvdongyi commented Nov 1, 2024

PR types

New features

PR changes

Models

Description

  1. Add BertTokenizerFast , support converting a slow bert tokenizer instance in a fast tokenizer instance, add tests for BertTokenizerFast
  2. Support register new tokenizer in TOKENIZER_MAPPING, add tests for it.

Copy link

paddle-bot bot commented Nov 1, 2024

Thanks for your contribution!

@lvdongyi lvdongyi marked this pull request as ready for review November 1, 2024 15:36
@lvdongyi lvdongyi changed the title Add BertTokenizerFast, support register new tokenizer [Tokenizer] Add BertTokenizerFast, support register new tokenizer Nov 2, 2024

# it is expected that each Chinese character is not preceded by "##"
self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
self.assertListEqual(tokens_without_spe_char_r, list_of_commun_chinese_char)

# not yet supported in bert tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后边还有一些测试,是无法通过吗?我看现在bert tokenizer是支持对应参数的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以通过。

tests/transformers/auto/test_confiugration.py Outdated Show resolved Hide resolved
tests/transformers/auto/test_tokenizer.py Outdated Show resolved Hide resolved
@DrownFish19
Copy link
Collaborator

注意没有单测的代码,需要添加单测进行覆盖

DrownFish19
DrownFish19 previously approved these changes Nov 4, 2024
Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DrownFish19
Copy link
Collaborator

Lint 问题需要安装pre-commit 后格式化代码,参考步骤如下:

# 安装
pip install pre-commit

# 在项目文件夹下注册pre-commit,每次commit提交时都会格式化代码
pre-commit install

# 单独处理之前的代码文件
pre-commit run --file XXXX.py

Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 86.48649% with 10 lines in your changes missing coverage. Please review.

Project coverage is 53.12%. Comparing base (66c5d65) to head (851af38).
Report is 3 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/bert/tokenizer_fast.py 72.72% 9 Missing ⚠️
paddlenlp/transformers/auto/tokenizer.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9353      +/-   ##
===========================================
+ Coverage    52.24%   53.12%   +0.88%     
===========================================
  Files          673      674       +1     
  Lines       109100   107428    -1672     
===========================================
+ Hits         56998    57073      +75     
+ Misses       52102    50355    -1747     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DrownFish19 DrownFish19 merged commit cd22b0d into PaddlePaddle:develop Nov 4, 2024
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants