-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download重构 #8020
Download重构 #8020
Conversation
Thanks for your contribution! |
paddlenlp/utils/download/__init__.py
Outdated
# log_filename = os.path.join(download_kwargs["subfolder"], filename) | ||
|
||
# 增加 modelscope 下载的选项 | ||
from_modelscope = os.environ.get("from_modelscope", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from paddlenlp.trainer import strtobool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ffab8d0
to
b60d218
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8020 +/- ##
===========================================
- Coverage 56.51% 56.47% -0.05%
===========================================
Files 592 596 +4
Lines 91114 91546 +432
===========================================
+ Hits 51494 51698 +204
- Misses 39620 39848 +228 ☔ View full report in Codecov by Sentry. |
@@ -96,6 +95,11 @@ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs): | |||
pretrained_models = list(cls.pretrained_init_configuration.keys()) | |||
resource_files = {} | |||
init_configuration = {} | |||
pretrained_model_name_or_path = str(pretrained_model_name_or_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/experimental/model_utils.py 这些代码有CI测试覆盖吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experimental目录下没有专门新增单测,但是transformers下有新增单测,只是加上单测会导致ci失败,但是在本地是可以正常运行的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JunnYu 这里CE可以覆盖吗?对推理而言风向比较大。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我那里的CE都是动态图的,不会涉及到experimental的部分
@@ -60,7 +60,7 @@ class ModelArguments: | |||
default=80, metadata={"help": "The maximum total of masked tokens in input sequence"} | |||
) | |||
|
|||
to_static: strtobool = field(default=False, metadata={"help": "Enable training under @to_static."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why disable it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为之前测试发现这里会报错,所以去除了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_args.to_static 改成 training_args.to_static 你看一下
from_aistudio: bool = False, | ||
from_hf_hub: bool = False, | ||
from_bos: bool = True, | ||
) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个核心函数,写一下注释吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释有什么要求吗
paddlenlp/utils/download/__init__.py
Outdated
) | ||
|
||
|
||
def get_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换成 resolve_file_path 之类的是否更合适?
def get_file( | |
def resolve_file_path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
|
||
if os.path.exists(config_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否一定是 exists 的?不存在的话,报错是不是在 get_file 内部?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果下载失败的话是在get_file内部报错,如果repo没有该文件get_file会返回None,会在这报错
@@ -149,7 +150,7 @@ class AutoTokenizer: | |||
_tokenizer_mapping = MAPPING_NAMES | |||
_name_mapping = TOKENIZER_MAPPING_NAMES | |||
_fast_name_mapping = FAST_TOKENIZER_MAPPING_NAMES | |||
tokenizer_config_file = "tokenizer_config.json" | |||
tokenizer_config_file = ["tokenizer_config.json", "config.json", "model_config.json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
会有多个的情况吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为了适配auto加载时repo没有tokenizer_config.json的情况,也可以不做这个兼容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有的话,去加载 "config.json", "model_config.json"
吗?看着不是很合理。config.json 里面有什么东西tokenier可用吗?
requirements-dev.txt
Outdated
tensorboard | ||
modelscope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modelscope也能支持吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前是支持modelscope下载的
class TokenizerLoadTester(unittest.TestCase): | ||
|
||
# 这是内置的是下载哪些文件 | ||
@parameterized.expand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
内置的可以LLM的大模型都加进来,小模型低优。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的我看下,添加测试样式在本地完成测试
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
elif from_modelscope: | ||
for index, filename in enumerate(filenames): | ||
try: | ||
from modelscope.hub.file_download import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加个try 导入from modelscope.hub.file_download import
如果是 import error给他提示一个装modelscope的提示
PR types
Function optimization
PR changes
独立下载模块,重构下载逻辑
Description
将PaddleNLP项目中涉及下载的模块独立出来,其他利用到下载操作的均统一调用该模块;
PaddleNLP项目下载逻辑比较混乱,故进行梳理和统一