Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utc] fix loading local model in taskflow #4505

Merged
merged 5 commits into from
Jan 17, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions applications/zero_shot_text_classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,8 @@ python run_train.py \
--disable_tqdm True \
--metric_for_best_model macro_f1 \
--load_best_model_at_end True \
--save_total_limit 1
--save_total_limit 1 \
--save_plm
```

如果在GPU环境中使用,可以指定gpus参数进行多卡训练:
Expand Down Expand Up @@ -143,7 +144,8 @@ python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
--disable_tqdm True \
--metric_for_best_model macro_f1 \
--load_best_model_at_end True \
--save_total_limit 1
--save_total_limit 1 \
--save_plm
```

该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。
Expand Down Expand Up @@ -204,7 +206,7 @@ python run_eval.py \
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
>>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best', precision="fp16")
>>> my_cls = Taskflow("zero_shot_text_classification", schema=schema, task_path='./checkpoint/model_best/plm', precision="fp16")
>>> pprint(my_cls("中性粒细胞比率偏低"))
```

Expand All @@ -221,7 +223,7 @@ from paddlenlp import SimpleServer, Taskflow
schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
utc = Taskflow("zero_shot_text_classification",
schema=schema,
task_path="../../checkpoint/model_best/",
task_path="../../checkpoint/model_best/plm",
precision="fp32")
app = SimpleServer()
app.register_taskflow("taskflow/utc", utc)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,15 +41,15 @@ schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就

```python
# Default task_path
utc = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/", schema=schema)
utc = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/plm", schema=schema)
```

#### 多卡服务化预测
PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码

```python
utc1 = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/", schema=schema)
utc2 = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/", schema=schema)
utc1 = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/plm", schema=schema)
utc2 = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/plm", schema=schema)
service.register_taskflow("taskflow/utc", [utc1, utc2])
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# The schema changed to your defined schema
schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
# The task path changed to your best model path
utc = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/", schema=schema)
utc = Taskflow("zero_shot_text_classification", task_path="../../checkpoint/model_best/plm", schema=schema)
# If you want to define the finetuned utc service
app = SimpleServer()
app.register_taskflow("taskflow/utc", utc)
2 changes: 1 addition & 1 deletion paddlenlp/taskflow/zero_shot_text_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _construct_model(self, model):
if self.from_hf_hub:
model_instance = UTC.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
else:
model_instance = UTC.from_pretrained(model)
model_instance = UTC.from_pretrained(self._task_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LemonNoel 改成from_pretrained("{local_path}")的形式后,需要定义resource_files_names和resource_files_urls并在__init__ 中增加 self._check_task_files(),可以参考这里https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/taskflow/information_extraction.py#L115

关于例如from_pretrained("utc_large")调用后模型不能再通过from_pretrained("{local_path}")方式加载的问题也请 @wj-Mcat 帮忙看下,我们后续看看能不能解决一下这里的gap

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是改成这样,没必要if了,直接一行model_instance = UTC.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub) 就行了。
@linjieccc UTC taskflow有做from_pretrained以外的文件下载吗?像这种已经整合pretrained config的模型和taskflow, 建议下载功能全部由from_pretrained承载,不要再做分开的下载逻辑了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijunhe 嗯嗯,这里确实改成下载功能全部通过from_pretrained承载好些,也可以避免模型文件重复下载的情况,后续会针对这块处理进行统一升级

关于Taskflow内非transformers类的模型,例如像GRU-CRF,目前模型是放在$PPNLP_HOME/.taskflow/{task_name}/{model_name},是否这部分模型后续也统一放在$PPNLP_HOME/.paddlenlp/models管理,模型加载的代码在Taskflow中实现

self._model = model_instance
self._model.eval()

Expand Down