Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download重构 #8020

Merged
merged 44 commits into from
Mar 8, 2024
Merged

Download重构 #8020

merged 44 commits into from
Mar 8, 2024

Conversation

LOVE-YOURSELF-1
Copy link
Contributor

@LOVE-YOURSELF-1 LOVE-YOURSELF-1 commented Feb 26, 2024

PR types

Function optimization

PR changes

独立下载模块,重构下载逻辑

Description

将PaddleNLP项目中涉及下载的模块独立出来,其他利用到下载操作的均统一调用该模块;
PaddleNLP项目下载逻辑比较混乱,故进行梳理和统一

Copy link

paddle-bot bot commented Feb 26, 2024

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Feb 26, 2024

CLA assistant check
All committers have signed the CLA.

# log_filename = os.path.join(download_kwargs["subfolder"], filename)

# 增加 modelscope 下载的选项
from_modelscope = os.environ.get("from_modelscope", False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from paddlenlp.trainer import strtobool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@JunnYu JunnYu requested review from wawltor and gongel February 28, 2024 07:42
Copy link

codecov bot commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 69.15688% with 289 lines in your changes are missing coverage. Please review.

Project coverage is 56.47%. Comparing base (e34cbe9) to head (119c648).
Report is 2 commits behind head on develop.

Files Patch % Lines
paddlenlp/utils/download/aistudio_hub_download.py 64.35% 103 Missing ⚠️
paddlenlp/utils/download/common.py 71.03% 73 Missing ⚠️
paddlenlp/utils/download/__init__.py 74.60% 32 Missing ⚠️
paddlenlp/utils/download/bos_download.py 79.61% 21 Missing ⚠️
paddlenlp/transformers/model_utils.py 65.00% 14 Missing ⚠️
paddlenlp/experimental/model_utils.py 10.00% 9 Missing ⚠️
paddlenlp/transformers/ernie_gen/modeling.py 10.00% 9 Missing ⚠️
paddlenlp/transformers/auto/image_processing.py 12.50% 7 Missing ⚠️
paddlenlp/transformers/auto/processing.py 12.50% 7 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py 0.00% 4 Missing ⚠️
... and 6 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8020      +/-   ##
===========================================
- Coverage    56.51%   56.47%   -0.05%     
===========================================
  Files          592      596       +4     
  Lines        91114    91546     +432     
===========================================
+ Hits         51494    51698     +204     
- Misses       39620    39848     +228     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -96,6 +95,11 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
pretrained_models = list(cls.pretrained_init_configuration.keys())
resource_files = {}
init_configuration = {}
pretrained_model_name_or_path = str(pretrained_model_name_or_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp/experimental/model_utils.py 这些代码有CI测试覆盖吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimental目录下没有专门新增单测,但是transformers下有新增单测,只是加上单测会导致ci失败,但是在本地是可以正常运行的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JunnYu 这里CE可以覆盖吗?对推理而言风向比较大。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我那里的CE都是动态图的,不会涉及到experimental的部分

@@ -60,7 +60,7 @@ class ModelArguments:
default=80, metadata={"help": "The maximum total of masked tokens in input sequence"}
)

to_static: strtobool = field(default=False, metadata={"help": "Enable training under @to_static."})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disable it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为之前测试发现这里会报错,所以去除了

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_args.to_static 改成 training_args.to_static 你看一下

from_aistudio: bool = False,
from_hf_hub: bool = False,
from_bos: bool = True,
) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个核心函数,写一下注释吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释有什么要求吗

)


def get_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换成 resolve_file_path 之类的是否更合适?

Suggested change
def get_file(
def resolve_file_path(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

paddlenlp/transformers/auto/configuration.py Outdated Show resolved Hide resolved

if os.path.exists(config_file):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否一定是 exists 的?不存在的话,报错是不是在 get_file 内部?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果下载失败的话是在get_file内部报错,如果repo没有该文件get_file会返回None,会在这报错

paddlenlp/transformers/auto/image_processing.py Outdated Show resolved Hide resolved
paddlenlp/transformers/auto/modeling.py Outdated Show resolved Hide resolved
paddlenlp/transformers/auto/tokenizer.py Outdated Show resolved Hide resolved
@@ -149,7 +150,7 @@ class AutoTokenizer:
_tokenizer_mapping = MAPPING_NAMES
_name_mapping = TOKENIZER_MAPPING_NAMES
_fast_name_mapping = FAST_TOKENIZER_MAPPING_NAMES
tokenizer_config_file = "tokenizer_config.json"
tokenizer_config_file = ["tokenizer_config.json", "config.json", "model_config.json"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

会有多个的情况吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了适配auto加载时repo没有tokenizer_config.json的情况,也可以不做这个兼容

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有的话,去加载 "config.json", "model_config.json" 吗?看着不是很合理。config.json 里面有什么东西tokenier可用吗?

tensorboard
modelscope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelscope也能支持吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前是支持modelscope下载的

class TokenizerLoadTester(unittest.TestCase):

# 这是内置的是下载哪些文件
@parameterized.expand(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

内置的可以LLM的大模型都加进来,小模型低优。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的我看下,添加测试样式在本地完成测试

Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

elif from_modelscope:
for index, filename in enumerate(filenames):
try:
from modelscope.hub.file_download import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加个try 导入from modelscope.hub.file_download import
如果是 import error给他提示一个装modelscope的提示

@wawltor wawltor merged commit 95c8b24 into PaddlePaddle:develop Mar 8, 2024
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants