Download重构 #8020

LOVE-YOURSELF-1 · 2024-02-26T06:53:21Z

PR types

Function optimization

PR changes

独立下载模块，重构下载逻辑

Description

将PaddleNLP项目中涉及下载的模块独立出来，其他利用到下载操作的均统一调用该模块；
PaddleNLP项目下载逻辑比较混乱，故进行梳理和统一

paddle-bot · 2024-02-26T06:53:25Z

Thanks for your contribution!

CLAassistant · 2024-02-26T06:53:26Z

All committers have signed the CLA.

JunnYu · 2024-02-26T07:11:21Z

paddlenlp/utils/download/__init__.py

+    # log_filename = os.path.join(download_kwargs["subfolder"], filename)
+
+    # 增加 modelscope 下载的选项
+    from_modelscope = os.environ.get("from_modelscope", False)


from paddlenlp.trainer import strtobool

codecov · 2024-03-06T10:48:24Z

Codecov Report

Attention: Patch coverage is 69.15688% with 289 lines in your changes are missing coverage. Please review.

Project coverage is 56.47%. Comparing base (e34cbe9) to head (119c648).
Report is 2 commits behind head on develop.

Files	Patch %	Lines
paddlenlp/utils/download/aistudio_hub_download.py	64.35%	103 Missing ⚠️
paddlenlp/utils/download/common.py	71.03%	73 Missing ⚠️
paddlenlp/utils/download/__init__.py	74.60%	32 Missing ⚠️
paddlenlp/utils/download/bos_download.py	79.61%	21 Missing ⚠️
paddlenlp/transformers/model_utils.py	65.00%	14 Missing ⚠️
paddlenlp/experimental/model_utils.py	10.00%	9 Missing ⚠️
paddlenlp/transformers/ernie_gen/modeling.py	10.00%	9 Missing ⚠️
paddlenlp/transformers/auto/image_processing.py	12.50%	7 Missing ⚠️
paddlenlp/transformers/auto/processing.py	12.50%	7 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	4 Missing ⚠️
... and 6 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8020      +/-   ##
===========================================
- Coverage    56.51%   56.47%   -0.05%     
===========================================
  Files          592      596       +4     
  Lines        91114    91546     +432     
===========================================
+ Hits         51494    51698     +204     
- Misses       39620    39848     +228

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-03-07T03:03:33Z

paddlenlp/experimental/model_utils.py

@@ -96,6 +95,11 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
        pretrained_models = list(cls.pretrained_init_configuration.keys())
        resource_files = {}
        init_configuration = {}
+        pretrained_model_name_or_path = str(pretrained_model_name_or_path)


paddlenlp/experimental/model_utils.py 这些代码有CI测试覆盖吗？

experimental目录下没有专门新增单测，但是transformers下有新增单测，只是加上单测会导致ci失败，但是在本地是可以正常运行的

@JunnYu 这里CE可以覆盖吗？对推理而言风向比较大。

我那里的CE都是动态图的，不会涉及到experimental的部分

ZHUI · 2024-03-07T03:04:01Z

model_zoo/bert/run_pretrain_trainer.py

@@ -60,7 +60,7 @@ class ModelArguments:
        default=80, metadata={"help": "The maximum total of masked tokens in input sequence"}
    )

-    to_static: strtobool = field(default=False, metadata={"help": "Enable training under @to_static."})


why disable it ?

因为之前测试发现这里会报错，所以去除了

model_args.to_static 改成 training_args.to_static 你看一下

ZHUI · 2024-03-07T03:08:40Z

paddlenlp/utils/download/__init__.py

+    from_aistudio: bool = False,
+    from_hf_hub: bool = False,
+    from_bos: bool = True,
+) -> str:


这个核心函数，写一下注释吧

注释有什么要求吗

ZHUI · 2024-03-07T03:24:59Z

paddlenlp/utils/download/__init__.py

+        )
+
+
+def get_file(


换成 resolve_file_path 之类的是否更合适？

Suggested change

def get_file(

def resolve_file_path(

paddlenlp/transformers/auto/configuration.py

ZHUI · 2024-03-07T03:26:43Z

paddlenlp/transformers/auto/configuration.py


+        if os.path.exists(config_file):


是否一定是 exists 的？不存在的话，报错是不是在 get_file 内部？

如果下载失败的话是在get_file内部报错，如果repo没有该文件get_file会返回None，会在这报错

paddlenlp/transformers/auto/image_processing.py

paddlenlp/transformers/auto/modeling.py

paddlenlp/transformers/auto/tokenizer.py

gongel · 2024-02-28T07:07:13Z

paddlenlp/transformers/auto/tokenizer.py

@@ -149,7 +150,7 @@ class AutoTokenizer:
    _tokenizer_mapping = MAPPING_NAMES
    _name_mapping = TOKENIZER_MAPPING_NAMES
    _fast_name_mapping = FAST_TOKENIZER_MAPPING_NAMES
-    tokenizer_config_file = "tokenizer_config.json"
+    tokenizer_config_file = ["tokenizer_config.json", "config.json", "model_config.json"]


会有多个的情况吗？

为了适配auto加载时repo没有tokenizer_config.json的情况，也可以不做这个兼容

没有的话，去加载 "config.json", "model_config.json" 吗？看着不是很合理。config.json 里面有什么东西tokenier可用吗？

gongel · 2024-02-28T07:16:45Z

requirements-dev.txt

+tensorboard
+modelscope


modelscope也能支持吗？

目前是支持modelscope下载的

gongel · 2024-02-28T07:18:27Z

tests/transformers/from_pretrained/test_tokenizer.py

+class TokenizerLoadTester(unittest.TestCase):
+
+    # 这是内置的是下载哪些文件
+    @parameterized.expand(


内置的可以LLM的大模型都加进来，小模型低优。

好的我看下，添加测试样式在本地完成测试

wawltor

LGTM

JunnYu · 2024-03-08T08:05:44Z

paddlenlp/utils/download/__init__.py

+        elif from_modelscope:
+            for index, filename in enumerate(filenames):
+                try:
+                    from modelscope.hub.file_download import (


加个try 导入from modelscope.hub.file_download import
如果是 import error给他提示一个装modelscope的提示

LOVE-YOURSELF-1 added 4 commits February 23, 2024 16:24

download

66744bb

modified file

40b27c4

modified from_pretrained

68b5f8c

modified config

e342983

paddle-bot bot added the contributor label Feb 26, 2024

paddle-bot bot assigned wawltor Feb 26, 2024

JunnYu reviewed Feb 26, 2024

View reviewed changes

modified download

fcc392b

LOVE-YOURSELF-1 closed this Feb 26, 2024

LOVE-YOURSELF-1 reopened this Feb 26, 2024

LOVE-YOURSELF-1 closed this Feb 26, 2024

LOVE-YOURSELF-1 reopened this Feb 26, 2024

test_tokenizer

3aa76ab

LOVE-YOURSELF-1 closed this Feb 27, 2024

LOVE-YOURSELF-1 reopened this Feb 27, 2024

LOVE-YOURSELF-1 and others added 9 commits February 26, 2024 20:12

Delete tests/transformers/from_pretrained/run.sh

d6dfcf0

Update test_tokenizer.py

0705617

Update tokenizer_utils_base.py

f9c5af7

test_model

275e52b

test_model

76cd0da

test_model

9bdc94e

Remove comments

df82769

Remove comments

5148bc6

add requirements

6a0085b

JunnYu requested review from wawltor and gongel February 28, 2024 07:42

JunnYu and others added 2 commits February 28, 2024 17:32

update bos download

7006332

Update test_model.py

620aacc

LOVE-YOURSELF-1 closed this Mar 1, 2024

LOVE-YOURSELF-1 reopened this Mar 1, 2024

LOVE-YOURSELF-1 and others added 8 commits March 1, 2024 17:04

fix bug

e392644

Merge branch 'PaddlePaddle:develop' into download

40842fd

add \n

b44f8ed

Update __init__.py

a18ca41

Merge branch 'PaddlePaddle:develop' into download

03d5047

Merge branch 'PaddlePaddle:develop' into download

6bb0544

Merge branch 'PaddlePaddle:develop' into download

0364a65

add requestion

b60d218

LOVE-YOURSELF-1 force-pushed the download branch from ffab8d0 to b60d218 Compare March 5, 2024 15:16

LOVE-YOURSELF-1 and others added 8 commits March 5, 2024 23:27

modified download

850796f

重测

8ce5dfe

Merge branch 'PaddlePaddle:develop' into download

af7bb9d

Update test_tokenizer.py

3109368

Update requirements-dev.txt

d25e6cd

Update requirements.txt

ee497e5

Merge branch 'PaddlePaddle:develop' into download

ed4d372

delete from_pretrained

d829bc5

Merge branch 'PaddlePaddle:develop' into download

eb06571

ZHUI reviewed Mar 7, 2024

View reviewed changes

gongel reviewed Mar 7, 2024

View reviewed changes

LOVE-YOURSELF-1 and others added 3 commits March 7, 2024 15:45

make superior

793784f

Merge branch 'PaddlePaddle:develop' into download

286b80a

Update run_pretrain_trainer.py

119c648

wawltor approved these changes Mar 8, 2024

View reviewed changes

JunnYu reviewed Mar 8, 2024

View reviewed changes

ZHUI approved these changes Mar 8, 2024

View reviewed changes

wawltor merged commit 95c8b24 into PaddlePaddle:develop Mar 8, 2024
7 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download重构 #8020

Download重构 #8020

LOVE-YOURSELF-1 commented Feb 26, 2024 •

edited

Loading

paddle-bot bot commented Feb 26, 2024

CLAassistant commented Feb 26, 2024 •

edited

Loading

JunnYu Feb 26, 2024

LOVE-YOURSELF-1 Mar 1, 2024

codecov bot commented Mar 6, 2024 •

edited

Loading

ZHUI Mar 7, 2024

LOVE-YOURSELF-1 Mar 7, 2024

ZHUI Mar 8, 2024

JunnYu Mar 8, 2024

ZHUI Mar 7, 2024

LOVE-YOURSELF-1 Mar 7, 2024

JunnYu Mar 8, 2024

ZHUI Mar 7, 2024

LOVE-YOURSELF-1 Mar 7, 2024

ZHUI Mar 7, 2024

LOVE-YOURSELF-1 Mar 7, 2024

ZHUI Mar 7, 2024

LOVE-YOURSELF-1 Mar 7, 2024

gongel Feb 28, 2024

LOVE-YOURSELF-1 Mar 7, 2024

ZHUI Mar 8, 2024

gongel Feb 28, 2024

LOVE-YOURSELF-1 Mar 7, 2024

gongel Feb 28, 2024

LOVE-YOURSELF-1 Mar 7, 2024

wawltor left a comment

JunnYu Mar 8, 2024

		tensorboard
		modelscope

Download重构 #8020

Download重构 #8020

Conversation

LOVE-YOURSELF-1 commented Feb 26, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Feb 26, 2024

CLAassistant commented Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 6, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LOVE-YOURSELF-1 commented Feb 26, 2024 •

edited

Loading

CLAassistant commented Feb 26, 2024 •

edited

Loading

codecov bot commented Mar 6, 2024 •

edited

Loading