Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

sijunhe · 2022-11-16T12:38:55Z

PR types

New features

PR changes

APIs

Description

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub

Test Tokenizer

# load from HF works
tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-3.0-nano-zh", from_hf_hub=True)
# load official pre-trained works
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-nano-zh")
tokenizer.save_pretrained("temp_test_tokenizer")
# load from local
tokenizer = AutoTokenizer.from_pretrained("temp_test_tokenizer")

Test Model

# load from HF works
model = AutoModel.from_pretrained("PaddlePaddle/ernie-3.0-nano-zh", from_hf_hub=True)
# load official pre-trained works
model = AutoModel.from_pretrained("ernie-3.0-nano-zh")
model.save_pretrained("temp_test_tokenizer")
# load from local
model = AutoModel.from_pretrained("temp_test_tokenizer")

wj-Mcat

你的这个 pr 我觉得非常棒，看起来很令人兴奋，不过我除了留的 comment，还有几点想跟你讨论：

你现在是直接在老版本的from_pretrained（新版本为from_pretrained_v2）上面做的，可是后面的主流模型都会迁移到新版本上，所以我感觉这部分的调整是否直接在新的接口上来完成是否比较好？
from_hf_hub的核心逻辑应该是数据源的问题，那是否可以把这部分的逻辑抽离成一个单独的函数来完成下载的这个工作呢？
是否不用添加from_hf_hub这个参数，当在框架(pretrained_init_configuration)本地、community 都没找到的情况下，是否可以直接默认去 huggingface 上面去找呢？
是否需要针对于调整添加一些单测来保证稳定性呢？

wj-Mcat · 2022-11-16T13:36:33Z

paddlenlp/transformers/auto/modeling.py

+    def _get_model_class_from_config(cls, pretrained_model_name_or_path,
+                                     config_file_path):
+        with io.open(config_file_path, encoding="utf-8") as f:
+            init_kwargs = json.load(f)
+        # class name corresponds to this configuration
+        init_class = init_kwargs.pop("init_class", None)
+        init_class = init_class[:-5] if init_class.endswith(
+            "Model") else init_class
+        if init_class:
+            for model_flag, name in MAPPING_NAMES.items():
+                if model_flag in init_class:
+                    model_name = model_flag + 'Model'
+                    break
+        else:
+            # From pretrained_model_name_or_path
+            for model_flag, name in MAPPING_NAMES.items():
+                if name in pretrained_model_name_or_path.lower():
+                    model_name = model_flag + 'Model'
+                    break
+        init_class = cls._name_mapping[model_name + '_Import_Class']
+        class_name = cls._name_mapping[init_class]
+        import_class = importlib.import_module(
+            f"paddlenlp.transformers.{class_name}.modeling")
+        try:
+            model_class = getattr(import_class, init_class)
+            return model_class
+        except AttributeError as err:
+            logger.error(err)
+            all_model_classes = import_class.__all__
+            all_tasks = {
+                get_task_name(m)
+                for m in all_model_classes if get_task_name(m) is not None
+            }
+            raise AttributeError(
+                f"module '{import_class.__name__}' only supports the following classes: "
+                + ", ".join(m for m in all_model_classes) + "\n"
+                "Hint: you can use interface " +
+                " or ".join(task + ".from_pretrained" for task in all_tasks) +
+                f" to load '{pretrained_model_name_or_path}'\n")
+


这个方法在我的cli PR里面有，后续此类方法可抽离成utils公共模块。

由于此方法比较通用，我建议可放到paddlenlp/transformers/utils.py模块当中，这样其他模块也可以复用，你觉得呢？

另外，目前为了和 hf 对齐，我们也是兼容：architectures这个字段的，所以在这个模块也是考虑此字段的信息的，特别是针对于未来新模型。

out of the scope for this PR but added a TODO

wj-Mcat · 2022-11-17T01:49:17Z

paddlenlp/transformers/auto/modeling.py

+            if os.path.exists(config_file):
+                model_class = cls._get_model_class_from_config(
+                    pretrained_model_name_or_path, config_file)
+                logger.info("We are using %s to load '%s'." %
+                            (model_class, pretrained_model_name_or_path))
+                return model_class.from_pretrained(
+                    pretrained_model_name_or_path,
+                    from_hf_hub=from_hf_hub,
+                    *model_args,
+                    **kwargs)


如果config_file不存在的话，是否需要给一个 warning 或者 error 呢？

因为如果说是不存在的话，理论上应该终止模型的初始化逻辑处理。

added a warning

paddlenlp/transformers/auto/tokenizer.py

wj-Mcat · 2022-11-17T02:21:18Z

paddlenlp/transformers/auto/tokenizer.py

+            print(
+                'We use pattern recognition to recognize the Tokenizer class.')
+            for key, pattern in cls._name_mapping.items():
+                if pattern in pretrained_model_name_or_path.lower():


我在想，这里的pattern in name_or_path这个操作会不会太松散，比如在以下情况下就会出现问题：

cls._name_mapping = { "BertTokenizer": "bert", "AlbertTokenizer": "albert", "RobertaTokenizer": "roberta", } pretrained_model_name_or_path = "tinybert-4l-312d"

since this is the original code, it's out of scope for this PR. I added a TODO and referenced your comment.

…PaddleNLP into hf_hub_integration

sijunhe · 2022-11-17T13:19:25Z

你的这个 pr 我觉得非常棒，看起来很令人兴奋，不过我除了留的 comment，还有几点想跟你讨论：

你现在是直接在老版本的from_pretrained（新版本为from_pretrained_v2）上面做的，可是后面的主流模型都会迁移到新版本上，所以我感觉这部分的调整是否直接在新的接口上来完成是否比较好？

from_hf_hub的核心逻辑应该是数据源的问题，那是否可以把这部分的逻辑抽离成一个单独的函数来完成下载的这个工作呢？

是否不用添加from_hf_hub这个参数，当在框架(pretrained_init_configuration)本地、community 都没找到的情况下，是否可以直接默认去 huggingface 上面去找呢？

是否需要针对于调整添加一些单测来保证稳定性呢？

added the logic in from_pretrained_v2 as well, which is tested through the bert model
discussed offline that we need from_hf_hub
same as above
added integration test

…PaddleNLP into hf_hub_integration

sijunhe added 6 commits November 16, 2022 15:56

wip

9d49e5b

wip

369223b

AutoModel and AutoTokenizer works

e4b8cea

eady for review

7df05f1

isort and yapf

60e749e

merge with develop

833e25d

sijunhe requested a review from ZeyuChen November 16, 2022 12:39

sijunhe self-assigned this Nov 16, 2022

sijunhe requested a review from wj-Mcat November 16, 2022 13:02

sijunhe and others added 2 commits November 16, 2022 21:03

fix style

5569a32

Merge branch 'develop' into hf_hub_integration

1331138

wj-Mcat requested changes Nov 17, 2022

View reviewed changes

sijunhe added 4 commits November 17, 2022 21:14

unit test pass

4e85782

fix styles

4417230

Merge remote-tracking branch 'origin/develop' into hf_hub_integration

34c1670

Merge branch 'hf_hub_integration' of https://github.com/PaddlePaddle/…

3dea09e

…PaddleNLP into hf_hub_integration

add todo

a2617a2

sijunhe requested a review from wj-Mcat November 17, 2022 13:22

Merge branch 'develop' into hf_hub_integration

d447e52

wj-Mcat previously approved these changes Nov 18, 2022

View reviewed changes

sijunhe added 2 commits November 18, 2022 17:03

integration tests

0b5606c

Merge branch 'hf_hub_integration' of https://github.com/PaddlePaddle/…

625cdcd

…PaddleNLP into hf_hub_integration

sijunhe dismissed wj-Mcat’s stale review via 625cdcd November 18, 2022 09:04

wj-Mcat approved these changes Nov 18, 2022

View reviewed changes

sijunhe added 2 commits November 18, 2022 23:43

Merge branch 'develop' into hf_hub_integration

19243fc

Merge branch 'develop' into hf_hub_integration

885acc6

sijunhe merged commit b33eadb into develop Nov 20, 2022

sijunhe deleted the hf_hub_integration branch November 20, 2022 06:40

sijunhe mentioned this pull request Nov 25, 2022

PaddleNLP 2.4.4 Release Note Candidate #3899

Closed

gongel mentioned this pull request Nov 28, 2022

Rm print #3929

Merged

joey12300 mentioned this pull request Jan 24, 2023

[Bug]: 从huggingface hub上加载tokenizer的时候报错 #4525

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

sijunhe commented Nov 16, 2022

wj-Mcat left a comment •

edited

Loading

wj-Mcat Nov 16, 2022

wj-Mcat Nov 17, 2022

wj-Mcat Nov 17, 2022

sijunhe Nov 17, 2022

wj-Mcat Nov 17, 2022

sijunhe Nov 17, 2022

wj-Mcat Nov 17, 2022

sijunhe Nov 17, 2022

sijunhe commented Nov 17, 2022 •

edited

Loading

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

Enable Model and Tokenizer to directly load paddlepaddle models from huggingface hub #3786

Conversation

sijunhe commented Nov 16, 2022

PR types

PR changes

Description

Test Tokenizer

Test Model

wj-Mcat left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sijunhe commented Nov 17, 2022 • edited Loading

wj-Mcat left a comment •

edited

Loading

sijunhe commented Nov 17, 2022 •

edited

Loading