[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

lvdongyi · 2024-11-01T15:36:34Z

PR types

New features

PR changes

Models

Description

Add BertTokenizerFast , support converting a slow bert tokenizer instance in a fast tokenizer instance, add tests for BertTokenizerFast
Support register new tokenizer in TOKENIZER_MAPPING, add tests for it.

paddle-bot · 2024-11-01T15:36:40Z

Thanks for your contribution!

DrownFish19 · 2024-11-04T02:13:52Z

tests/transformers/bert/test_tokenizer.py


                # it is expected that each Chinese character is not preceded by "##"
                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_r, list_of_commun_chinese_char)

                # not yet supported in bert tokenizer


后边还有一些测试，是无法通过吗？我看现在bert tokenizer是支持对应参数的

可以通过。

tests/transformers/auto/test_confiugration.py

tests/transformers/auto/test_tokenizer.py

DrownFish19 · 2024-11-04T02:22:16Z

注意没有单测的代码，需要添加单测进行覆盖

DrownFish19

LGTM

DrownFish19 · 2024-11-04T07:40:36Z

Lint 问题需要安装pre-commit 后格式化代码，参考步骤如下：

# 安装
pip install pre-commit

# 在项目文件夹下注册pre-commit，每次commit提交时都会格式化代码
pre-commit install

# 单独处理之前的代码文件
pre-commit run --file XXXX.py

codecov · 2024-11-04T07:53:59Z

Codecov Report

Attention: Patch coverage is 86.48649% with 10 lines in your changes missing coverage. Please review.

Project coverage is 53.12%. Comparing base (66c5d65) to head (851af38).
Report is 3 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/bert/tokenizer_fast.py	72.72%	9 Missing ⚠️
paddlenlp/transformers/auto/tokenizer.py	94.11%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9353      +/-   ##
===========================================
+ Coverage    52.24%   53.12%   +0.88%     
===========================================
  Files          673      674       +1     
  Lines       109100   107428    -1672     
===========================================
+ Hits         56998    57073      +75     
+ Misses       52102    50355    -1747

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add , support register tokenizer, fix some typo

337eda1

lvdongyi marked this pull request as ready for review November 1, 2024 15:36

paddle-bot bot added the contributor label Nov 1, 2024

paddle-bot bot assigned lugimzzz Nov 1, 2024

lvdongyi changed the title ~~Add BertTokenizerFast, support register new tokenizer~~ [Tokenizer] Add BertTokenizerFast, support register new tokenizer Nov 2, 2024

DrownFish19 reviewed Nov 4, 2024

View reviewed changes

DrownFish19 mentioned this pull request Nov 4, 2024

【Hackathon 7th No 43】完善TokenizerFast功能支持 PaddlePaddle/community#998

Merged

DrownFish19 assigned DrownFish19 and unassigned lugimzzz Nov 4, 2024

lvdongyi added 2 commits November 4, 2024 15:21

add more tests

0bca74c

CustomTokenizerFast2->CustomTokenizerFastWithoutSlow

0eba022

DrownFish19 previously approved these changes Nov 4, 2024

View reviewed changes

lint

851af38

lvdongyi dismissed DrownFish19’s stale review via 851af38 November 4, 2024 07:45

DrownFish19 approved these changes Nov 4, 2024

View reviewed changes

DrownFish19 merged commit cd22b0d into PaddlePaddle:develop Nov 4, 2024
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

lvdongyi commented Nov 1, 2024

paddle-bot bot commented Nov 1, 2024

DrownFish19 Nov 4, 2024

lvdongyi Nov 4, 2024

DrownFish19 commented Nov 4, 2024

DrownFish19 left a comment

DrownFish19 commented Nov 4, 2024

codecov bot commented Nov 4, 2024 •

edited

Loading

[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353

Conversation

lvdongyi commented Nov 1, 2024

PR types

PR changes

Description

paddle-bot bot commented Nov 1, 2024

DrownFish19 Nov 4, 2024

Choose a reason for hiding this comment

lvdongyi Nov 4, 2024

Choose a reason for hiding this comment

DrownFish19 commented Nov 4, 2024

DrownFish19 left a comment

Choose a reason for hiding this comment

DrownFish19 commented Nov 4, 2024

codecov bot commented Nov 4, 2024 • edited Loading

Codecov Report

codecov bot commented Nov 4, 2024 •

edited

Loading