Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

Merged
merged 18 commits into from
Nov 20, 2022

Conversation

sijunhe
Copy link
Collaborator

@sijunhe sijunhe commented Nov 16, 2022

PR types

New features

PR changes

APIs

Description

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub

Test Tokenizer

# load from HF works
tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-3.0-nano-zh", from_hf_hub=True)
# load official pre-trained works
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-nano-zh")
tokenizer.save_pretrained("temp_test_tokenizer")
# load from local
tokenizer = AutoTokenizer.from_pretrained("temp_test_tokenizer")

Test Model

# load from HF works
model = AutoModel.from_pretrained("PaddlePaddle/ernie-3.0-nano-zh", from_hf_hub=True)
# load official pre-trained works
model = AutoModel.from_pretrained("ernie-3.0-nano-zh")
model.save_pretrained("temp_test_tokenizer")
# load from local
model = AutoModel.from_pretrained("temp_test_tokenizer")

@sijunhe sijunhe requested a review from ZeyuChen November 16, 2022 12:39
@sijunhe sijunhe self-assigned this Nov 16, 2022
@sijunhe sijunhe requested a review from wj-Mcat November 16, 2022 13:02
Copy link
Contributor

@wj-Mcat wj-Mcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你的这个 pr 我觉得非常棒,看起来很令人兴奋,不过我除了留的 comment,还有几点想跟你讨论:

  1. 你现在是直接在老版本的from_pretrained(新版本为from_pretrained_v2)上面做的,可是后面的主流模型都会迁移到新版本上,所以我感觉这部分的调整是否直接在新的接口上来完成是否比较好?
  2. from_hf_hub的核心逻辑应该是数据源的问题,那是否可以把这部分的逻辑抽离成一个单独的函数来完成下载的这个工作呢?
  3. 是否不用添加from_hf_hub这个参数,当在框架(pretrained_init_configuration)本地、community 都没找到的情况下,是否可以直接默认去 huggingface 上面去找呢?
  4. 是否需要针对于调整添加一些单测来保证稳定性呢?

Comment on lines +174 to +213
def _get_model_class_from_config(cls, pretrained_model_name_or_path,
config_file_path):
with io.open(config_file_path, encoding="utf-8") as f:
init_kwargs = json.load(f)
# class name corresponds to this configuration
init_class = init_kwargs.pop("init_class", None)
init_class = init_class[:-5] if init_class.endswith(
"Model") else init_class
if init_class:
for model_flag, name in MAPPING_NAMES.items():
if model_flag in init_class:
model_name = model_flag + 'Model'
break
else:
# From pretrained_model_name_or_path
for model_flag, name in MAPPING_NAMES.items():
if name in pretrained_model_name_or_path.lower():
model_name = model_flag + 'Model'
break
init_class = cls._name_mapping[model_name + '_Import_Class']
class_name = cls._name_mapping[init_class]
import_class = importlib.import_module(
f"paddlenlp.transformers.{class_name}.modeling")
try:
model_class = getattr(import_class, init_class)
return model_class
except AttributeError as err:
logger.error(err)
all_model_classes = import_class.__all__
all_tasks = {
get_task_name(m)
for m in all_model_classes if get_task_name(m) is not None
}
raise AttributeError(
f"module '{import_class.__name__}' only supports the following classes: "
+ ", ".join(m for m in all_model_classes) + "\n"
"Hint: you can use interface " +
" or ".join(task + ".from_pretrained" for task in all_tasks) +
f" to load '{pretrained_model_name_or_path}'\n")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法在我的cli PR里面有,后续此类方法可抽离成utils公共模块。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于此方法比较通用,我建议可放到paddlenlp/transformers/utils.py模块当中,这样其他模块也可以复用,你觉得呢?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,目前为了和 hf 对齐,我们也是兼容:architectures这个字段的,所以在这个模块也是考虑此字段的信息的,特别是针对于未来新模型。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of the scope for this PR but added a TODO

Comment on lines 284 to 293
if os.path.exists(config_file):
model_class = cls._get_model_class_from_config(
pretrained_model_name_or_path, config_file)
logger.info("We are using %s to load '%s'." %
(model_class, pretrained_model_name_or_path))
return model_class.from_pretrained(
pretrained_model_name_or_path,
from_hf_hub=from_hf_hub,
*model_args,
**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果config_file不存在的话,是否需要给一个 warning 或者 error 呢?

因为如果说是不存在的话,理论上应该终止模型的初始化逻辑处理。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a warning

paddlenlp/transformers/auto/tokenizer.py Outdated Show resolved Hide resolved
print(
'We use pattern recognition to recognize the Tokenizer class.')
for key, pattern in cls._name_mapping.items():
if pattern in pretrained_model_name_or_path.lower():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我在想,这里的pattern in name_or_path这个操作会不会太松散,比如在以下情况下就会出现问题:

cls._name_mapping = {
    "BertTokenizer": "bert",
    "AlbertTokenizer": "albert",
    "RobertaTokenizer": "roberta",
}
pretrained_model_name_or_path = "tinybert-4l-312d"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the original code, it's out of scope for this PR. I added a TODO and referenced your comment.

@sijunhe
Copy link
Collaborator Author

sijunhe commented Nov 17, 2022

你的这个 pr 我觉得非常棒,看起来很令人兴奋,不过我除了留的 comment,还有几点想跟你讨论:

  1. 你现在是直接在老版本的from_pretrained(新版本为from_pretrained_v2)上面做的,可是后面的主流模型都会迁移到新版本上,所以我感觉这部分的调整是否直接在新的接口上来完成是否比较好?
  2. from_hf_hub的核心逻辑应该是数据源的问题,那是否可以把这部分的逻辑抽离成一个单独的函数来完成下载的这个工作呢?
  3. 是否不用添加from_hf_hub这个参数,当在框架(pretrained_init_configuration)本地、community 都没找到的情况下,是否可以直接默认去 huggingface 上面去找呢?
  4. 是否需要针对于调整添加一些单测来保证稳定性呢?
  1. added the logic in from_pretrained_v2 as well, which is tested through the bert model
  2. discussed offline that we need from_hf_hub
  3. same as above
  4. added integration test

@sijunhe sijunhe requested a review from wj-Mcat November 17, 2022 13:22
wj-Mcat
wj-Mcat previously approved these changes Nov 18, 2022
@sijunhe sijunhe merged commit b33eadb into develop Nov 20, 2022
@sijunhe sijunhe deleted the hf_hub_integration branch November 20, 2022 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants