[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

lvdongyi · 2024-09-28T12:19:35Z

PR types

New features

PR changes

APIs

Description

Support reading Tiktoken tokenizer.model.
Split PretrainedTokenizerBase.from_pretrained into two separate methods: from_pretrained and _from_pretrained.
Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set use_fast=True )
Use LazyMapping to load keys and values when it is accessed.
Modify tests/transformers/test_modeling_common.py to support LlamaTokenizerFast

TOKENIZER_MAPPING_NAMES, MODEL_NAMES_MAPPING, CONFIG_MAPPING_NAMES should be reviewed carefully

…d from pretrained, update method to get attr from a module

paddle-bot · 2024-09-28T12:19:39Z

Thanks for your contribution!

codecov · 2024-09-28T12:52:54Z

Codecov Report

Attention: Patch coverage is 66.66667% with 156 lines in your changes missing coverage. Please review.

Project coverage is 52.80%. Comparing base (78f911a) to head (5579695).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/auto/factory.py	46.06%	48 Missing ⚠️
paddlenlp/utils/import_utils.py	53.52%	33 Missing ⚠️
paddlenlp/transformers/auto/tokenizer.py	76.47%	24 Missing ⚠️
paddlenlp/transformers/auto/configuration.py	72.30%	18 Missing ⚠️
paddlenlp/transformers/convert_slow_tokenizer.py	76.11%	16 Missing ⚠️
paddlenlp/transformers/llama/tokenizer.py	41.17%	10 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_base.py	84.84%	5 Missing ⚠️
paddlenlp/transformers/configuration_utils.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9215      +/-   ##
===========================================
- Coverage    53.19%   52.80%   -0.40%     
===========================================
  Files          673      673              
  Lines       108855   107657    -1198     
===========================================
- Hits         57909    56849    -1060     
+ Misses       50946    50808     -138

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paddlenlp/transformers/configuration_utils.py

paddlenlp/transformers/auto/tokenizer.py

paddlenlp/transformers/fnet/tokenizer.py

paddlenlp/transformers/mbart50/__init__.py

paddlenlp/utils/download/__init__.py

paddlenlp/transformers/auto/configuration.py

paddlenlp/transformers/auto/tokenizer.py

DrownFish19 · 2024-10-11T12:50:25Z

paddlenlp/transformers/auto/tokenizer.py

@@ -176,7 +324,7 @@ def _get_tokenizer_class_from_config(cls, pretrained_model_name_or_path, config_
            return tokenizer_class

    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):


这个参数修改名称需要注意，需判断其他from_pretrained方法参数是否使用相同名称

其他Tokenizer都没有override from_pretrained方法，所以应该不会有影响

这里的问题是使用auto.from_pretrained()和Qwen2XXX.form_pretrained()的代码写法可能会发生变化，建议统一

这块我先改回model_args了

paddlenlp/transformers/auto/tokenizer.py

tests/transformers/llama/test_tokenizer.py

DrownFish19 · 2024-10-12T02:36:42Z

paddlenlp/utils/download/__init__.py

@@ -272,7 +272,7 @@ def resolve_file_path(
            f"'{log_endpoint}' for available revisions."
        )
    except EntryNotFoundError:
-        return None
+        raise EnvironmentError(f"Does not appear one of the {filenames} in {repo_id}.")


这个Error类型是不是应该是EntryNotFoundError？

这块在我修改之前就是这样的（

估计是当时就写错了，这个错误可以改

如果要raise EntryNotFoundError，那前面就不需要用except捕获EntryNotFoundError了，之前这么做应该有这么做的道理（吧）。

tests/transformers/auto/test_confiugration.py

DrownFish19

LGTM

paddlenlp/transformers/auto/configuration.py

paddlenlp/transformers/auto/tokenizer.py

paddlenlp/transformers/configuration_utils.py

ZHUI · 2024-10-17T11:32:42Z

paddlenlp/transformers/albert_chinese/tokenizer.py

+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software


这个层级是模型层级吧，为什么要拆开到两个文件夹？

目前AutoTokenizer加载Tokenizer的方式是根据模型目录的名称进行匹配，举个例子，之前albert，albert_chinese，albert_english都在albert目录下，但是根据名称进行匹配(TOKENIZER_MAPPING_NAMES表)只允许有一个Tokenizer和一个TokenizerFast，如果不分开会导致albert_chinese和albert_english无法通过AutoTokenizer加载，因为他们三个加载时需要用不同的Tokenizer类。

ZHUI · 2024-10-21T09:39:54Z

paddlenlp/transformers/llama/tokenizer.py

+            tokenizer.Load(self.vocab_file)
+            return tokenizer
+
+        with open(self.vocab_file, "rb") as f:


现在默认全面切换到faster了？

应该是没有的，只有当use_fast=True时才会使用tokenizer_fast

现在默认全面切换到faster了？

已经修改了默认值，默认使用以前的Load方式

ZHUI

LGTM

ZHUI · 2024-10-23T08:52:46Z

需要合入的话，可以 @ 我

lvdongyi · 2024-10-23T08:54:07Z

需要合入的话，可以 @ 我

目前不知道什么原因，PaddleNLP-CI会卡在running P0case 2/4: albert

ZHUI · 2024-10-23T09:19:21Z

好的，等一会儿 CI 吧，有个 Conflicting 可以处理一下

lvdongyi · 2024-10-23T09:47:36Z

好的，等一会儿 CI 吧，有个 Conflicting 可以处理一下

处理了

lvdongyi added 8 commits September 24, 2024 14:10

add support of tiktoken tokenizer, refactor some code

0fd7240

Merge branch 'PaddlePaddle:develop' into dev-refactor-pretrained

d1ee434

add support of tiktoken tokenizer, refactor some code

9004ac9

clean code & add blobfile to requirements.txt

d004c33

Don't allow multiple Class in a

0b61d11

update docstring, add a RuntimeError when AutoTokenizer failed to loa…

aad6750

…d from pretrained, update method to get attr from a module

update albert_english/__init__.py and mbart/__init__.py

04dff4d

fix typo, rm redundent notations

6475a83

ZHUI requested a review from DrownFish19 September 30, 2024 03:41

DrownFish19 added the contributor label Oct 11, 2024

paddle-bot bot assigned wawltor Oct 11, 2024

DrownFish19 assigned DrownFish19 and unassigned wawltor Oct 11, 2024

DrownFish19 reviewed Oct 11, 2024

View reviewed changes

lvdongyi added 5 commits October 11, 2024 14:55

some changes...

dea3ad4

AutoTokenizer will not load TokenzierFast by default

f5ae794

Add test for external config

ce684a1

revert unnecrssary changes

75368d5

Update test_modeling_common.py

469ffbf

DrownFish19 reviewed Oct 12, 2024

View reviewed changes

lvdongyi added 3 commits October 12, 2024 02:59

fix

ee33fba

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

92e4e0e

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

f0f4113

DrownFish19 previously approved these changes Oct 15, 2024

View reviewed changes

paddlenlp/transformers/auto/configuration.py Show resolved Hide resolved

paddlenlp/transformers/auto/tokenizer.py Show resolved Hide resolved

paddlenlp/transformers/configuration_utils.py Show resolved Hide resolved

ZHUI reviewed Oct 17, 2024

View reviewed changes

rm redundent print

353fb41

lvdongyi dismissed DrownFish19’s stale review via 353fb41 October 17, 2024 12:40

lvdongyi added 3 commits October 17, 2024 15:44

revert some changes

d279d8d

fix problem in TOKENIZER_MAPPING_NAMES

e367332

try fix

a422932

lvdongyi force-pushed the dev-20240927-support-tiktoken branch from e84a062 to a422932 Compare October 18, 2024 01:12

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

7ff5a17

ZHUI reviewed Oct 21, 2024

View reviewed changes

ZHUI previously approved these changes Oct 21, 2024

View reviewed changes

update

d46655c

lvdongyi dismissed ZHUI’s stale review via d46655c October 21, 2024 10:30

lvdongyi force-pushed the dev-20240927-support-tiktoken branch 2 times, most recently from 376b3ad to fe19531 Compare October 21, 2024 13:37

fix

3412f50

lvdongyi force-pushed the dev-20240927-support-tiktoken branch from fe19531 to 3412f50 Compare October 22, 2024 07:57

lvdongyi force-pushed the dev-20240927-support-tiktoken branch from 2b3fb62 to 2ceeccb Compare October 23, 2024 09:44

rm redundent comment, resolve complicate

19521f9

lvdongyi force-pushed the dev-20240927-support-tiktoken branch from 2ceeccb to 19521f9 Compare October 23, 2024 09:46

lvdongyi added 4 commits October 23, 2024 17:49

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

99299b0

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

5c169fb

add case of built-in tokenizers to handle CI error

d2d7eeb

Merge branch 'PaddlePaddle:develop' into dev-20240927-support-tiktoken

5579695

ZHUI merged commit ec25cb8 into PaddlePaddle:develop Oct 30, 2024
9 of 12 checks passed

DrownFish19 mentioned this pull request Nov 4, 2024

【Hackathon 7th No 43】完善TokenizerFast功能支持 PaddlePaddle/community#998

Merged

lvdongyi mentioned this pull request Nov 27, 2024

飞桨开源社区重磅福利！开源贡献者专享2000元会员礼 PaddlePaddle/community#1015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

lvdongyi commented Sep 28, 2024 •

edited

Loading

paddle-bot bot commented Sep 28, 2024

codecov bot commented Sep 28, 2024 •

edited

Loading

DrownFish19 Oct 11, 2024

lvdongyi Oct 11, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 Oct 12, 2024

lvdongyi Oct 12, 2024

DrownFish19 left a comment

ZHUI Oct 17, 2024

lvdongyi Oct 17, 2024

ZHUI Oct 21, 2024

DrownFish19 Oct 21, 2024

lvdongyi Oct 21, 2024

ZHUI left a comment

ZHUI commented Oct 23, 2024

lvdongyi commented Oct 23, 2024

ZHUI commented Oct 23, 2024

lvdongyi commented Oct 23, 2024

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

[Tokenizer] Support reading Tiktoken tokenizer.model. #9215

Conversation

lvdongyi commented Sep 28, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Sep 28, 2024

codecov bot commented Sep 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrownFish19 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

ZHUI commented Oct 23, 2024

lvdongyi commented Oct 23, 2024

ZHUI commented Oct 23, 2024

lvdongyi commented Oct 23, 2024

lvdongyi commented Sep 28, 2024 •

edited

Loading

codecov bot commented Sep 28, 2024 •

edited

Loading