-
- 硬件 |
- 硬件对应的接口 |
- 可用的推理引擎 |
- 推理引擎对应的接口 |
- 是否支持 Paddle 新格式量化模型 |
- 是否支持 FP16 模式 |
-
-
- CPU |
- use_cpu() |
- Paddle Inference |
- use_paddle_infer_backend() |
- ✅ |
- N/A |
-
-
- ONNX Runtime |
- use_ort_backend() |
- ✅ |
- N/A |
-
-
- OpenVINO |
- use_openvino_backend() |
- ❔ |
- N/A |
-
-
- GPU |
- use_gpu() |
- Paddle Inference |
- use_paddle_infer_backend() |
- ✅ |
- N/A |
-
-
- ONNX Runtime |
- use_ort_backend() |
- ✅ |
- ❔ |
-
-
- Paddle TensorRT |
- use_paddle_infer_backend() + paddle_infer_option.enable_trt = True |
- ✅ |
- ✅ |
-
-
- TensorRT |
- use_trt_backend() |
- ✅ |
- ✅ |
-
-
- 昆仑芯 XPU |
- use_kunlunxin() |
- Paddle Lite |
- use_paddle_lite_backend() |
- N/A |
- ✅ |
-
-
- 华为 昇腾 |
- use_ascend() |
- Paddle Lite |
- use_paddle_lite_backend() |
- ❔ |
- ✅ |
-
-
- Graphcore IPU |
- use_ipu() |
- Paddle Inference |
- use_paddle_infer_backend() |
- ❔ |
- N/A |
-
-
diff --git a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
deleted file mode 100644
index 9b4662e6798a..000000000000
--- a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
+++ /dev/null
@@ -1,144 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import distutils.util
-import os
-
-import fastdeploy as fd
-import numpy as np
-
-from paddlenlp.transformers import AutoTokenizer
-
-
-def parse_arguments():
- import argparse
-
- parser = argparse.ArgumentParser()
- parser.add_argument("--model_dir", required=True, help="The directory of model.")
- parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
- parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
- parser.add_argument(
- "--device",
- type=str,
- default="cpu",
- choices=["gpu", "cpu"],
- help="Type of inference device, support 'cpu' or 'gpu'.",
- )
- parser.add_argument(
- "--backend",
- type=str,
- default="paddle",
- choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
- help="The inference runtime backend.",
- )
- parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
- parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
- parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
- parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
- parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
- parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
- return parser.parse_args()
-
-
-def batchfy_text(texts, batch_size):
- batch_texts = []
- batch_start = 0
- while batch_start < len(texts):
- batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
- batch_start += batch_size
- return batch_texts
-
-
-class Predictor(object):
- def __init__(self, args):
- self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir)
- self.runtime = self.create_fd_runtime(args)
- self.batch_size = args.batch_size
- self.max_length = args.max_length
-
- def create_fd_runtime(self, args):
- option = fd.RuntimeOption()
- model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
- params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
- option.set_model_path(model_path, params_path)
- if args.device == "cpu":
- option.use_cpu()
- option.set_cpu_thread_num(args.cpu_threads)
- else:
- option.use_gpu(args.device_id)
- if args.backend == "paddle":
- option.use_paddle_infer_backend()
- elif args.backend == "onnx_runtime":
- option.use_ort_backend()
- elif args.backend == "openvino":
- option.use_openvino_backend()
- else:
- option.use_trt_backend()
- if args.backend == "paddle_tensorrt":
- option.use_paddle_infer_backend()
- option.paddle_infer_option.collect_trt_shape = True
- option.paddle_infer_option.enable_trt = True
- trt_file = os.path.join(args.model_dir, "model.trt")
- option.trt_option.set_shape(
- "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
- )
- if args.use_fp16:
- option.trt_option.enable_fp16 = True
- trt_file = trt_file + ".fp16"
- option.trt_option.serialize_file = trt_file
- return fd.Runtime(option)
-
- def preprocess(self, text, text_pair):
- data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True)
- input_ids_name = self.runtime.get_input_info(0).name
- input_map = {
- input_ids_name: np.array(data["input_ids"], dtype="int64"),
- }
- return input_map
-
- def infer(self, input_map):
- results = self.runtime.infer(input_map)
- return results
-
- def postprocess(self, infer_data):
- logits = np.array(infer_data[0])
- max_value = np.max(logits, axis=1, keepdims=True)
- exp_data = np.exp(logits - max_value)
- probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
- out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)}
- return out_dict
-
- def predict(self, texts, texts_pair=None):
- input_map = self.preprocess(texts, texts_pair)
- infer_result = self.infer(input_map)
- output = self.postprocess(infer_result)
- return output
-
-
-if __name__ == "__main__":
- args = parse_arguments()
- predictor = Predictor(args)
- text = ["他们告诉我,呃,我最后会被叫到一个人那里去见面。"] * 3
- text_pair = ["我从来没有被告知任何与任何人会面。", "我被告知将有一个人被叫进来与我见面。", "那个人来得有点晚。"]
- batch_texts = batchfy_text(text, args.batch_size)
- batch_texts_pair = batchfy_text(text_pair, args.batch_size)
- label_list = ["entailment", "neutral", "contradiction"]
-
- for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)):
- outputs = predictor.predict(texts, texts_pair)
- for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)):
- print(
- f'Batch id:{bs}, example id:{i}, sentence1:"{sentence1}", sentence2:"{sentence2}", '
- f"label:{label_list[outputs['label'][i]]}, confidence:{outputs['confidence'][i]:.4f}"
- )
diff --git a/model_zoo/ernie-m/deploy/simple_serving/README.md b/model_zoo/ernie-m/deploy/simple_serving/README.md
deleted file mode 100644
index 30da3c4f796a..000000000000
--- a/model_zoo/ernie-m/deploy/simple_serving/README.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# 基于PaddleNLP SimpleServing 的服务化部署
-
-## 目录
-- [环境准备](#环境准备)
-- [Server启动服务](#Server服务启动)
-- [其他参数设置](#其他参数设置)
-
-## 环境准备
-
-paddlenlp >= 2.5.0
-
-## Server服务启动
-### 文本分类任务启动
-#### 启动文本分类 Server 服务
-```bash
-paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189
-```
-
-#### 分类任务发送服务
-```bash
-python client_seq_cls.py --language zh
-```
-
-## 其他参数设置
-可以在client端设置 `max_seq_len`, `batch_size` 参数
-```python
- data = {
- 'data': {
- 'text': texts,
- 'text_pair': text_pairs
- },
- 'parameters': {
- 'max_seq_len': args.max_seq_len,
- 'batch_size': args.batch_size
- }
- }
-```
diff --git a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
deleted file mode 100644
index 5fc1de30fa04..000000000000
--- a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
+++ /dev/null
@@ -1,43 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-
-import requests
-from datasets import load_dataset
-
-# yapf: disable
-parser = argparse.ArgumentParser()
-parser.add_argument("--language", required=True, type=str, help="The language for the simple seving")
-parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.")
-parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
-args = parser.parse_args()
-# yapf: enable
-
-url = "http://0.0.0.0:8189/models/ernie_m_cls"
-headers = {"Content-Type": "application/json"}
-
-
-if __name__ == "__main__":
- examples = load_dataset("xnli", args.language, split="validation")[:10]
- texts = [text for text in examples["premise"]]
- text_pairs = [text for text in examples["hypothesis"]]
-
- data = {
- "data": {"text": texts, "text_pair": text_pairs},
- "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
- }
- r = requests.post(url=url, headers=headers, data=json.dumps(data))
- print(r.text)
diff --git a/model_zoo/ernie-m/run_classifier.py b/model_zoo/ernie-m/run_classifier.py
deleted file mode 100644
index 0d1886c6dd6f..000000000000
--- a/model_zoo/ernie-m/run_classifier.py
+++ /dev/null
@@ -1,322 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import os
-import random
-from dataclasses import dataclass, field
-from functools import partial
-from typing import Optional
-
-import numpy as np
-import paddle
-from datasets import load_dataset
-from paddle.io import Dataset
-from paddle.metric import Accuracy
-
-import paddlenlp
-from paddlenlp.data import DataCollatorWithPadding
-from paddlenlp.trainer import (
- PdArgumentParser,
- Trainer,
- TrainingArguments,
- get_last_checkpoint,
-)
-from paddlenlp.transformers import (
- AutoModelForSequenceClassification,
- AutoTokenizer,
- ErnieMForSequenceClassification,
-)
-from paddlenlp.utils.log import logger
-
-all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
-task_type_list = ["cross-lingual-transfer", "translate-train-all"]
-
-
-@dataclass
-class ModelArguments:
- task_type: str = field(
- default=None,
- metadata={"help": "The type of the task to finetune selected in the list: " + ", ".join(task_type_list)},
- )
- model_name_or_path: str = field(
- default=None,
- metadata={
- "help": "Path to pre-trained model or shortcut name selected in the list: "
- + ", ".join(list(ErnieMForSequenceClassification.pretrained_init_configuration.keys()))
- },
- )
- max_seq_length: Optional[int] = field(
- default=256,
- metadata={
- "help": "The maximum total input sequence length after tokenization. Sequences longer "
- "than this will be truncated, sequences shorter will be padded."
- },
- )
- classifier_dropout: Optional[float] = field(default=0.1, metadata={"help": "Dropout rate."})
- layerwise_decay: Optional[float] = field(default=0.8, metadata={"help": "Layerwise decay ratio."})
- export_model_dir: Optional[str] = field(
- default="./best_models",
- metadata={"help": "Path to directory to store the exported inference model."},
- )
- use_test_data: Optional[bool] = field(
- default=False, metadata={"help": "Whether to use a tiny dataset for CI test."}
- )
- test_data_path: Optional[str] = field(default=None, metadata={"help": "Path to tiny dataset."})
-
-
-def set_seed(seed):
- # Use the same data seed(for data shuffle) for all procs to guarantee data
- # consistency after sharding.
- random.seed(seed)
- np.random.seed(seed)
- # Maybe different op seeds(for dropout) for different procs is better. By:
- # `paddle.seed(seed + paddle.distributed.get_rank())`
- paddle.seed(seed)
-
-
-def convert_example(example, tokenizer, max_seq_length=256):
- """convert a example into necessary features"""
- # Convert raw text to feature
- tokenized_example = tokenizer(
- example["premise"],
- text_pair=example["hypothesis"],
- max_length=max_seq_length,
- padding=False,
- truncation=True,
- return_position_ids=True,
- return_attention_mask=True,
- return_token_type_ids=False,
- )
- return tokenized_example
-
-
-def load_xnli_dataset(args, path, lang, split=None):
- """load dataset for specificed language"""
- if args.use_test_data:
- if args.test_data_path is None:
- raise ValueError("Should specified `test_data_path` for test datasets when `use_test_data` is True.")
- data_files = {
- "train": args.test_data_path,
- "validation": args.test_data_path,
- "test": args.test_data_path,
- }
- return load_dataset("json", data_files=data_files, split=split)
- else:
- return load_dataset(path, lang, split=split)
-
-
-class XnliDataset(Dataset):
- """
- Make all languages datasets be loaded in lazy mode.
- """
-
- def __init__(self, datasets):
- self.datasets = datasets
- # Ar language has 2000 empty data.
- self.num_samples = [len(i) for i in datasets]
- self.cumsum_len = np.cumsum(self.num_samples)
-
- def __getitem__(self, idx):
- language_idx = np.argmax(self.cumsum_len > idx)
- last = language_idx - 1 if language_idx > 0 else language_idx
- sample_idx = idx - self.cumsum_len[last] if idx >= self.cumsum_len[last] else idx
- return self.datasets[int(language_idx)][int(sample_idx)]
-
- def __len__(self):
- return self.cumsum_len[-1]
-
-
-def do_train():
- training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
- training_args: TrainingArguments = training_args
- model_args: ModelArguments = model_args
-
- training_args.print_config(model_args, "Model")
-
- paddle.set_device(training_args.device)
-
- set_seed(training_args.seed)
-
- # Detecting last checkpoint.
- last_checkpoint = None
- if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
- last_checkpoint = get_last_checkpoint(training_args.output_dir)
- if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
- raise ValueError(
- f"Output directory ({training_args.output_dir}) already exists and is not empty. "
- "Use --overwrite_output_dir to overcome."
- )
- elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
- logger.info(
- f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
- "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
- )
-
- # Dataset pre-process
- tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
- trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=model_args.max_seq_length)
- remove_columns = ["premise", "hypothesis"]
-
- def collect_all_languages_dataset(split):
- all_ds = []
- for language in all_languages:
- ds = load_xnli_dataset(model_args, "xnli", language, split=split)
- all_ds.append(ds.map(trans_func, batched=True, remove_columns=remove_columns))
- return XnliDataset(all_ds)
-
- if model_args.task_type == "cross-lingual-transfer":
- raw_datasets = load_xnli_dataset(model_args, "xnli", "en")
- if training_args.do_train:
- train_ds = raw_datasets["train"].map(trans_func, batched=True, remove_columns=remove_columns)
- if training_args.do_eval:
- eval_ds = raw_datasets["validation"].map(trans_func, batched=True, remove_columns=remove_columns)
- if training_args.do_predict:
- test_ds = raw_datasets["test"].map(trans_func, batched=True, remove_columns=remove_columns)
- elif model_args.task_type == "translate-train-all":
- if training_args.do_train:
- train_ds = collect_all_languages_dataset("train")
- if training_args.do_eval:
- eval_ds = collect_all_languages_dataset("validation")
- if training_args.do_predict:
- test_ds = collect_all_languages_dataset("test")
- else:
- raise ValueError(
- f"task_type should be 'cross-lingual-transfer' or 'translate-train-all' but '{model_args.task_type}' is specificed."
- )
-
- data_collator = DataCollatorWithPadding(tokenizer)
-
- num_labels = 3
- model = AutoModelForSequenceClassification.from_pretrained(
- model_args.model_name_or_path, num_labels=num_labels, classifier_dropout=model_args.classifier_dropout
- )
-
- # Define the metrics of tasks.
- def compute_metrics(p):
- # Define the metrics of tasks.
- preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
-
- preds = paddle.to_tensor(preds)
- label = paddle.to_tensor(p.label_ids)
-
- metric = Accuracy()
- result = metric.compute(preds, label)
- metric.update(result)
- accu = metric.accumulate()
- return {"accuracy": accu}
-
- trainer = Trainer(
- model=model,
- args=training_args,
- data_collator=data_collator,
- train_dataset=train_ds if training_args.do_train else None,
- eval_dataset=eval_ds if training_args.do_eval else None,
- tokenizer=tokenizer,
- compute_metrics=compute_metrics,
- # optimizers=[optimizer, lr_scheduler],
- )
-
- def using_layerwise_lr_decay(layerwise_decay, model, training_args):
- """
- Generate parameter names needed to perform weight decay.
- All bias and LayerNorm parameters are excluded.
- """
- # params_list = [{"params": param, "learning_rate": lr * decay_ratio}, ... ]
- params_list = []
- n_layers = model.config.num_hidden_layers
- for name, param in model.named_parameters():
- ratio = 1.0
- param_to_train = {"params": param, "dygraph_key_name": name}
- if any(nd in name for nd in ["bias", "norm"]):
- param_to_train["weight_decay"] = 0.0
- else:
- param_to_train["weight_decay"] = training_args.weight_decay
-
- if "encoder.layers" in name:
- idx = name.find("encoder.layers.")
- layer = int(name[idx:].split(".")[2])
- ratio = layerwise_decay ** (n_layers - layer)
- elif "embedding" in name:
- ratio = layerwise_decay ** (n_layers + 1)
-
- param_to_train["learning_rate"] = ratio
-
- params_list.append(param_to_train)
- return params_list
-
- params_to_train = using_layerwise_lr_decay(model_args.layerwise_decay, model, training_args)
-
- trainer.set_optimizer_grouped_parameters(params_to_train)
-
- checkpoint = None
- if training_args.resume_from_checkpoint is not None:
- checkpoint = training_args.resume_from_checkpoint
- elif last_checkpoint is not None:
- checkpoint = last_checkpoint
-
- # training
- if training_args.do_train:
- train_result = trainer.train(resume_from_checkpoint=checkpoint)
- metrics = train_result.metrics
- trainer.save_model()
- trainer.log_metrics("train", metrics)
- trainer.save_metrics("train", metrics)
- trainer.save_state()
-
- # Evaluating
- if training_args.do_eval:
- combined = {}
- for language in all_languages:
- eval_ds = load_xnli_dataset(model_args, "xnli", language, split="validation")
- eval_ds = eval_ds.map(trans_func, batched=True, remove_columns=remove_columns, load_from_cache_file=True)
- metrics = trainer.evaluate(eval_dataset=eval_ds)
- metrics = {k + f"_{language}": v for k, v in metrics.items()}
- combined.update({f"eval_accuracy_{language}": metrics.get(f"eval_accuracy_{language}", 0.0)})
- trainer.log_metrics("eval", metrics)
-
- combined.update({"eval_accuracy_average": np.mean(list(combined.values()))})
- trainer.log_metrics("eval", combined)
- trainer.save_metrics("eval", combined)
-
- # Predicting
- if training_args.do_predict:
- test_ret = trainer.predict(test_ds)
- trainer.log_metrics("test", test_ret.metrics)
- logits = test_ret.predictions
- max_value = np.max(logits, axis=1, keepdims=True)
- exp_data = np.exp(logits - max_value)
- probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
- out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()}
- out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w")
- json.dump(out_dict, out_file)
-
- # Export inference model
- if training_args.do_export and paddle.distributed.get_rank() == 0:
- # You can also load from certain checkpoint
- # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
- model_to_save = trainer.model
- model_to_save = model_to_save._layers if isinstance(model_to_save, paddle.DataParallel) else model_to_save
- input_spec = [
- paddle.static.InputSpec(shape=[None, None], dtype="int64"),
- ]
- model_args.export_model_dir = os.path.join(model_args.export_model_dir, "export")
- paddlenlp.transformers.export_model(
- model=model_to_save, input_spec=input_spec, path=model_args.export_model_dir
- )
- trainer.tokenizer.save_pretrained(model_args.export_model_dir)
-
-
-if __name__ == "__main__":
- do_train()
diff --git a/model_zoo/plato-xl/README.md b/model_zoo/plato-xl/README.md
deleted file mode 100644
index 9d8fa5be7275..000000000000
--- a/model_zoo/plato-xl/README.md
+++ /dev/null
@@ -1,148 +0,0 @@
-# PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
-
-## 模型简介
-
-构建高质量的开放领域(Open-Domain)的对话机器人,使得它能用自然语言与人自由地交流,这一直是自然语言处理领域终极目标之一。
-
-PLATO-XL 是业界首个开源的百亿超大规模开放域对话预训练模型,其使用了参数高效(encoder-decoder共享参数)的 UnifiedTransformer(prefix LM)模型架构,将模型参数量提升到了11B量级,经过了十亿级样本对话数据的预训练,并引入role embedding区分多方对话中的对话角色提升预训练效果,最终模型闲聊测试效果超过了众多代表性的对话模型。可以直接使用 PLATO-XL 构建高质量的开放领域对话机器人。
-
-PaddleNLP 内置了 PLATO-XL 英文预训练模型以供使用。由于 PLATO-XL 模型规模较大,这使得其在预测时生成对话回复的时间较长,并且 11B 的参数量也可能超出部分型号 GPU 显存容量,这是大模型推理与落地存在的普遍和关键问题。PaddleNLP FastGeneration 为 PLATO-XL 提供了 GPU 上的高性能生成加速能力,并且支持模型并行(张量并行)推理允许通过多张小显存容量的 GPU 使用百亿大模型,相比单卡代码中也只增加了`enable_ft_para()`一行,此外模型并行能进一步提升预测速度。
-
-本项目提供了 PLATO-XL 英文模型使用 PaddleNLP FastGeneration 进行高性能预测的使用示例。PLATO-XL 的训练及更多内容请参考 [PaddlePaddle/Knover](https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL)。
-
-## 开始运行
-### 单卡高性能推理
-
-`infer.py` 是 PLATO-XL 高性能预测使用示例脚本,可以使用如下命令运行:
-
-```shell
-python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
-```
-
-该脚本各个参数含义如下:
-
-- `topk` 用于Top-K采样策略,采样时将只从概率最高K个token中采样,默认为1,即greedy search。
-- `topp` 用于Top-P采样策略,采样时将只从概率最高且累加概率不超过该值的token中采样,默认为1.0。
-- `max_out_len` 指定生成的最大长度,默认为64。
-- `min_out_len` 指定生成的最小长度,默认为1。
-- `temperature` 用于调整预测概率分布,默认为1.0,即保持模型原有的预测概率。
-- `use_faster` 使用 FastGeneration
-- `use_fp16` 使用FP16,只在使用FastGeneration时生效
-
-脚本中使用了一条如下的多轮对话的样本数据, 由`List[str]`表示,其中每个`str`表示一句话,将根据历史对话内容生成回复。
-
-```python
- history = [
- "hi , Mary ! What do you usually like to do in your spare time ?",
- "well , I spend a lot of time watching movies .",
- "what a confidence ! I always watch a lot of movies , too ."
- "oh really , Frank ? What kind of movies do you like ?"
- ]
-```
-
-**注意** 由于 PLATO-XL 模型较大,单卡预测至少需要22G显存(使用FP16时),且模型下载需要一定时间(FP32的权重文件约41G)。
-
-### 多卡并行推理
-
-多卡并行推理当前依赖 MPI([MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可)和[NCCL](https://developer.nvidia.com/nccl),如需使用还请先安装依赖。安装完成后仍然使用 `infer.py` 来进行预测,相比单卡时不同的只是通过mpi来启动运行,如下:
-
-```shell
-mpirun -n 4 python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
-```
-
-其中`-n 4`指明使用的进程和GPU数,在`n`设置为1时仍将进行单卡推理。由于多卡并行推理使用和单卡使用不同的依赖库,第一次运行时将重新进行JIT编译。
-
-### 性能测试
-
-`infer.py` 中同时提供了性能测试的支持,在上面预测命令的基础上加上 `--profile` 即可,如下:
-
-```shell
-mpirun -n 4 python infer.py --batch_size 8 --min_out_len 20 --max_out_len 20 --topk 1 --use_faster --use_fp16 --profile
-```
-
-此外还指定了`batch_size`和`min_out_len`来得到特定输入输出大小下的性能,性能测试将给出循环运行多次的平均时延。以下为单卡高性能推理和4卡张量并行推理性能数据(V100,CUDA 10.2,输入长度60、输出长度20),可以看出4卡并行速度为单卡的2倍左右。
-
-