-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你的这个 pr 我觉得非常棒,看起来很令人兴奋,不过我除了留的 comment,还有几点想跟你讨论:
- 你现在是直接在老版本的from_pretrained(新版本为
from_pretrained_v2
)上面做的,可是后面的主流模型都会迁移到新版本上,所以我感觉这部分的调整是否直接在新的接口上来完成是否比较好? from_hf_hub
的核心逻辑应该是数据源的问题,那是否可以把这部分的逻辑抽离成一个单独的函数来完成下载的这个工作呢?- 是否不用添加
from_hf_hub
这个参数,当在框架(pretrained_init_configuration)本地、community 都没找到的情况下,是否可以直接默认去 huggingface 上面去找呢? - 是否需要针对于调整添加一些单测来保证稳定性呢?
def _get_model_class_from_config(cls, pretrained_model_name_or_path, | ||
config_file_path): | ||
with io.open(config_file_path, encoding="utf-8") as f: | ||
init_kwargs = json.load(f) | ||
# class name corresponds to this configuration | ||
init_class = init_kwargs.pop("init_class", None) | ||
init_class = init_class[:-5] if init_class.endswith( | ||
"Model") else init_class | ||
if init_class: | ||
for model_flag, name in MAPPING_NAMES.items(): | ||
if model_flag in init_class: | ||
model_name = model_flag + 'Model' | ||
break | ||
else: | ||
# From pretrained_model_name_or_path | ||
for model_flag, name in MAPPING_NAMES.items(): | ||
if name in pretrained_model_name_or_path.lower(): | ||
model_name = model_flag + 'Model' | ||
break | ||
init_class = cls._name_mapping[model_name + '_Import_Class'] | ||
class_name = cls._name_mapping[init_class] | ||
import_class = importlib.import_module( | ||
f"paddlenlp.transformers.{class_name}.modeling") | ||
try: | ||
model_class = getattr(import_class, init_class) | ||
return model_class | ||
except AttributeError as err: | ||
logger.error(err) | ||
all_model_classes = import_class.__all__ | ||
all_tasks = { | ||
get_task_name(m) | ||
for m in all_model_classes if get_task_name(m) is not None | ||
} | ||
raise AttributeError( | ||
f"module '{import_class.__name__}' only supports the following classes: " | ||
+ ", ".join(m for m in all_model_classes) + "\n" | ||
"Hint: you can use interface " + | ||
" or ".join(task + ".from_pretrained" for task in all_tasks) + | ||
f" to load '{pretrained_model_name_or_path}'\n") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个方法在我的cli
PR里面有,后续此类方法可抽离成utils公共模块。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由于此方法比较通用,我建议可放到paddlenlp/transformers/utils.py
模块当中,这样其他模块也可以复用,你觉得呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外,目前为了和 hf 对齐,我们也是兼容:architectures
这个字段的,所以在这个模块也是考虑此字段的信息的,特别是针对于未来新模型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of the scope for this PR but added a TODO
if os.path.exists(config_file): | ||
model_class = cls._get_model_class_from_config( | ||
pretrained_model_name_or_path, config_file) | ||
logger.info("We are using %s to load '%s'." % | ||
(model_class, pretrained_model_name_or_path)) | ||
return model_class.from_pretrained( | ||
pretrained_model_name_or_path, | ||
from_hf_hub=from_hf_hub, | ||
*model_args, | ||
**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果config_file
不存在的话,是否需要给一个 warning 或者 error 呢?
因为如果说是不存在的话,理论上应该终止模型的初始化逻辑处理。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a warning
print( | ||
'We use pattern recognition to recognize the Tokenizer class.') | ||
for key, pattern in cls._name_mapping.items(): | ||
if pattern in pretrained_model_name_or_path.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我在想,这里的pattern in name_or_path
这个操作会不会太松散,比如在以下情况下就会出现问题:
cls._name_mapping = {
"BertTokenizer": "bert",
"AlbertTokenizer": "albert",
"RobertaTokenizer": "roberta",
}
pretrained_model_name_or_path = "tinybert-4l-312d"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is the original code, it's out of scope for this PR. I added a TODO and referenced your comment.
|
…PaddleNLP into hf_hub_integration
PR types
New features
PR changes
APIs
Description
Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub
Test Tokenizer
Test Model