-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Support reading Tiktoken tokenizer.model. #9215
[Tokenizer] Support reading Tiktoken tokenizer.model. #9215
Conversation
…d from pretrained, update method to get attr from a module
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9215 +/- ##
===========================================
- Coverage 53.19% 52.80% -0.40%
===========================================
Files 673 673
Lines 108855 107657 -1198
===========================================
- Hits 57909 56849 -1060
+ Misses 50946 50808 -138 ☔ View full report in Codecov by Sentry. |
@@ -176,7 +324,7 @@ def _get_tokenizer_class_from_config(cls, pretrained_model_name_or_path, config_ | |||
return tokenizer_class | |||
|
|||
@classmethod | |||
def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): | |||
def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个参数修改名称需要注意,需判断其他from_pretrained方法参数是否使用相同名称
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其他Tokenizer都没有override from_pretrained
方法,所以应该不会有影响
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的问题是使用auto.from_pretrained()和Qwen2XXX.form_pretrained()的代码写法可能会发生变化,建议统一
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块我先改回model_args
了
paddlenlp/utils/download/__init__.py
Outdated
@@ -272,7 +272,7 @@ def resolve_file_path( | |||
f"'{log_endpoint}' for available revisions." | |||
) | |||
except EntryNotFoundError: | |||
return None | |||
raise EnvironmentError(f"Does not appear one of the {filenames} in {repo_id}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个Error类型是不是应该是EntryNotFoundError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块在我修改之前就是这样的(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
估计是当时就写错了,这个错误可以改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果要raise EntryNotFoundError,那前面就不需要用except
捕获EntryNotFoundError
了,之前这么做应该有这么做的道理(吧)。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个层级是模型层级吧,为什么要拆开到两个文件夹?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前AutoTokenizer
加载Tokenizer
的方式是根据模型目录的名称进行匹配,举个例子,之前albert
,albert_chinese
,albert_english
都在albert
目录下,但是根据名称进行匹配(TOKENIZER_MAPPING_NAMES
表)只允许有一个Tokenizer和一个TokenizerFast,如果不分开会导致albert_chinese
和albert_english
无法通过AutoTokenizer
加载,因为他们三个加载时需要用不同的Tokenizer
类。
e84a062
to
a422932
Compare
tokenizer.Load(self.vocab_file) | ||
return tokenizer | ||
|
||
with open(self.vocab_file, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在默认全面切换到faster了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是没有的,只有当use_fast=True时才会使用tokenizer_fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在默认全面切换到faster了?
已经修改了默认值,默认使用以前的Load方式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
376b3ad
to
fe19531
Compare
fe19531
to
3412f50
Compare
需要合入的话,可以 @ 我 |
目前不知道什么原因,PaddleNLP-CI会卡在running P0case 2/4: albert |
好的,等一会儿 CI 吧,有个 Conflicting 可以处理一下 |
2b3fb62
to
2ceeccb
Compare
2ceeccb
to
19521f9
Compare
处理了 |
PR types
New features
PR changes
APIs
Description
Support reading Tiktoken tokenizer.model.
Split
PretrainedTokenizerBase.from_pretrained
into two separate methods:from_pretrained
and_from_pretrained
.Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set
use_fast=True
)Use
LazyMapping
to load keys and values when it is accessed.Modify
tests/transformers/test_modeling_common.py
to supportLlamaTokenizerFast
TOKENIZER_MAPPING_NAMES
,MODEL_NAMES_MAPPING
,CONFIG_MAPPING_NAMES
should be reviewed carefully