-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353
[Tokenizer] Add BertTokenizerFast, support register new tokenizer #9353
Conversation
Thanks for your contribution! |
|
||
# it is expected that each Chinese character is not preceded by "##" | ||
self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char) | ||
self.assertListEqual(tokens_without_spe_char_r, list_of_commun_chinese_char) | ||
|
||
# not yet supported in bert tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后边还有一些测试,是无法通过吗?我看现在bert tokenizer是支持对应参数的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以通过。
注意没有单测的代码,需要添加单测进行覆盖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Lint 问题需要安装pre-commit 后格式化代码,参考步骤如下: # 安装
pip install pre-commit
# 在项目文件夹下注册pre-commit,每次commit提交时都会格式化代码
pre-commit install
# 单独处理之前的代码文件
pre-commit run --file XXXX.py |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9353 +/- ##
===========================================
+ Coverage 52.24% 53.12% +0.88%
===========================================
Files 673 674 +1
Lines 109100 107428 -1672
===========================================
+ Hits 56998 57073 +75
+ Misses 52102 50355 -1747 ☔ View full report in Codecov by Sentry. |
PR types
New features
PR changes
Models
Description
BertTokenizerFast
, support converting a slow bert tokenizer instance in a fast tokenizer instance, add tests forBertTokenizerFast
TOKENIZER_MAPPING
, add tests for it.