diff --git a/examples/language_model/bert b/examples/language_model/bert
deleted file mode 120000
index d4b3e50787af..000000000000
--- a/examples/language_model/bert
+++ /dev/null
@@ -1 +0,0 @@
-../../model_zoo/bert
\ No newline at end of file
diff --git a/examples/language_model/chinesebert/cmrc_eval.sh b/examples/language_model/chinesebert/cmrc_eval.sh
deleted file mode 100644
index 79155340b1c7..000000000000
--- a/examples/language_model/chinesebert/cmrc_eval.sh
+++ /dev/null
@@ -1 +0,0 @@
-python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65
diff --git a/examples/language_model/elmo/README.md b/examples/language_model/elmo/README.md
deleted file mode 100644
index bccce85996b8..000000000000
--- a/examples/language_model/elmo/README.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# ELMo
-
-## 模型简介
-
-ELMo(Embeddings from Language Models)是重要的通用语义表示模型之一，以双向LSTM为网络基本组件，以Language Model为训练目标，通过预训练得到通用的语义表示，ELMo能够学习到复杂的特征，比如语法、语义，并且能够学习在不同上下文情况下的词汇多义性。将ELMo得到的语义表示作为Feature迁移到下游NLP任务中，会显著提升下游任务的模型性能，比如问答、文本蕴含和情感分析等。ELMo模型的细节可以[参阅论文](https://arxiv.org/abs/1802.05365)。
-
-本项目是ELMo在Paddle上的开源实现, 基于1 Billion Word Language Model Benchmark进行预训练，并接入了简单的下游任务作为示例程序。
-
-接入的下游任务是在sentence polarity dataset v1数据集上构建的文本二分类任务，采用ELMo + BoW的简单网络结构。与base模型（Word2Vec + BoW）进行精度对比。
-
-| 模型  | test acc |
-| ---- | -------- |
-| word2vec + BoW  | 0.7769   |
-| ELMo + BoW  | 0.7760   |
-
-## 环境依赖
-
-- sklearn
-- gensim
-
-安装方式：`pip install sklearn gensim`
-
-### 代码结构说明
-
-以下是本项目主要代码结构及说明：
-
-```text
-.
-├── args.py # 运行参数配置文件
-├── dataset.py # 数据读取
-├── elmo.py # 模型组网
-├── run_pretrain.py # 训练模型主程序入口
-├── run_eval.py # 评估模型主程序入口
-├── word2vec_base.py # 下游二分类任务base模型训练测试主程序入口
-├── run_finetune.py # 下游二分类任务训练测试主程序入口
-├── download_data.sh # 数据下载脚本
-└── README.md # 文档说明
-```
-
-### 数据准备
-
-运行下载数据的脚本后，会生成两个文件，1-billion-word目录下会存在训练数据目录（training-tokenized-shuffled）、测试集数据（heldout-tokenized-shuffled）以及对应的词典（vocab-15w.txt），sentence-polarity-dataset-v1目录下会存在未切分的正向样本（rt-polarity.pos）、负向样本（rt-polarity.neg）以及Google预训练好的Word2Vec向量文件GoogleNews-vectors-negative300.bin.gz。
-
-```shell
-sh download_data.sh
-```
-
-1-billion-word目录结构：
-
-```text
-.
-├── training-tokenized-shuffled # 训练集
-├── heldout-tokenized-shuffled # 测试集
-└── vocab-15w.txt # 词典
-```
-
-sentence-polarity-dataset-v1目录结构：
-
-```text
-.
-├── rt-polarity.pos # 正向样本
-├── rt-polarity.neg # 负向样本
-└── GoogleNews-vectors-negative300.bin.gz # 预训练好的Word2Vec向量
-```
-
-### 模型训练
-
-基于1-billion-word数据集，可以运行下面的命令，在训练集上进行模型训练
-```shell
-# GPU启动, 支持单卡和多卡
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus '0' run_pretrain.py --train_data_path='./1-billion-word/training-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --save_dir='./checkpoints' --device='gpu'
-```
-
-其他可选参数和参数的默认值请参考`args.py`。
-
-程序运行时将会自动开始训练，同时训练过程中会自动保存模型在指定的`save_dir`中。
-如：
-```text
-checkpoints/
-├── 10000.pdopt
-├── 10000.pdparams
-├── 20000.pdopt
-├── 20000.pdparams
-├── ...
-├── final.pdopt
-└── final.pdparams
-```
-
-**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/10000`即可，程序会自动加载模型参数`checkpoints/10000.pdparams`，也会自动加载优化器状态`checkpoints/10000.pdopt`。
-
-### 模型评估
-
-基于1-billion-word数据集，可以运行下面的命令，在评测集上进行模型评估
-```shell
-# GPU启动，仅支持单卡
-export CUDA_VISIBLE_DEVICES=0
-python run_eval.py --dev_data_path='./1-billion-word/heldout-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --init_from_ckpt='./checkpoints/10000' --device='gpu'
-```
-
-### 下游任务
-
-下游任务是基于sentence polarity dataset v1数据集的二分类任务，base模型采用Word2Vec + BoW的模型结构，其中Word2Vec采用Google预训练好的GoogleNews-vectors-negative300.bin.gz。
-
-#### base模型
-
-base模型可以运行下面的命令，在训练集上进行模型训练评估
-```shell
-# GPU启动, 支持单卡和多卡
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus '0' word2vec_base.py --data_dir='./sentence-polarity-dataset-v1/' --pretrained_word2vec_file='./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin' --device='gpu'
-```
-
-#### ELMo finetune
-
-ELMo finetune可以运行下面的命令，在训练集上进行模型训练评估
-```shell
-# GPU启动, 支持单卡和多卡
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus '0' run_finetune.py --data_dir='./sentence-polarity-dataset-v1/' --init_from_ckpt='./checkpoints/10000' --device='gpu'
-```
-
-**NOTE:** 可以通过构建模型时的trainable参数设置ELMo参与或不参与下游任务的训练。ELMo接入下游任务的具体用法请参考`run_finetune.py`。
-
-另外，预训练的ELMo也可以作为文本词向量编码器单独使用，即输入文本内容，输出每个词对应的词向量。用法示例如下：
-
-```python
-from elmo import ELMoEmbedder
-
-embedder = ELMoEmbedder(params_file)
-sentences = [['The', 'first', 'sentence', '.'], ['Second', 'one', '.']]
-
-embeddings = embedder.encode(sentences)
-for i, (text, emb) in enumerate(zip(sentences, embeddings)):
-    print(text)
-    print(emb.shape)
-    print()
-```
-
-## Reference
-
-- [Deep contextualized word representations](https://arxiv.org/abs/1802.05365)
diff --git a/examples/language_model/elmo/args.py b/examples/language_model/elmo/args.py
deleted file mode 100644
index 6deaf83d53b9..000000000000
--- a/examples/language_model/elmo/args.py
+++ /dev/null
@@ -1,51 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-
-# yapf: disable
-def parse_args():
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--train_data_path", type=str, default="./1-billion-word/training-tokenized-shuffled/*", help="Specify the path to load train data.")
-    parser.add_argument("--dev_data_path", type=str, default="./1-billion-word/heldout-tokenized-shuffled/*", help="Specify the path to load dev data.")
-    parser.add_argument("--vocab_file", type=str, default="./1-billion-word/vocab-15w.txt", help="Specify the path to load vocab file.")
-    parser.add_argument("--save_dir", type=str, default="./checkpoint/", help="Specify the path to save the checkpoints.")
-    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
-    parser.add_argument("--save_freq", type=int, default=100, help="The frequency, in number of steps, to save checkpoint. (default: %(default)d)")
-    parser.add_argument("--log_freq", type=int, default=100, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
-    parser.add_argument("--epochs", type=int, default=10, help="Total number of training epochs to perform.")
-    parser.add_argument("--batch_size", type=int, default=128, help="Batch size per GPU/CPU for training.")
-    parser.add_argument("--dropout", type=float, default=0.1, help="The dropout rate.")
-    parser.add_argument("--lr", type=float, default=0.2, help="The initial learning rate.")
-    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
-    parser.add_argument("--max_grad_norm", type=float, default=10.0, help='The max grad norm.')
-    parser.add_argument("--max_characters_per_token", type=int, default=50, help="The maximum characters number of token in sequence. (default: %(default)d)")
-    parser.add_argument("--unroll_steps", type=int, default=20, help="The sentence length after re-cutting in dataset. (default: %(default)d)")
-    parser.add_argument("--char_embed_dim", type=int, default=16, help="The dimension of char_embedding table. (default: %(default)d)")
-    parser.add_argument("--projection_dim", type=int, default=512, help="The size of rnn hidden unit. (default: %(default)d)")
-    parser.add_argument("--num_layers", type=int, default=2, help="The num of rnn layers. (default: %(default)d)")
-    parser.add_argument("--num_highways", type=int, default=2, help="The num of highways in CharEncoder. (default: %(default)d)")
-    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.")
-
-    args = parser.parse_args()
-    return args
-# yapf: enable
-
-
-def print_args(args):
-    print("-----------  Configuration Arguments -----------")
-    for arg, value in sorted(vars(args).items()):
-        print("%s: %s" % (arg, value))
-    print("------------------------------------------------")
diff --git a/examples/language_model/elmo/dataset.py b/examples/language_model/elmo/dataset.py
deleted file mode 100644
index adacebe7041e..000000000000
--- a/examples/language_model/elmo/dataset.py
+++ /dev/null
@@ -1,443 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import glob
-import random
-from copy import deepcopy
-from typing import List
-
-import numpy as np
-import paddle
-from paddle.io import IterableDataset
-
-
-class Vocabulary(object):
-    """
-    A token vocabulary. Holds a map from token to ids and provides a method for
-    encoding text to a sequence of ids.
-
-    Parameters:
-        filename (str): The vocabulary file. It is a flat text file with
-            one (normalized) token per line.
-    """
-
-    def __init__(self, filename):
-        self._word_to_id = {}
-        for word in ["UNK", "<S>", "</S>", "<PAD>"]:
-            self._word_to_id[word] = len(self._word_to_id)
-        with open(filename, "r") as fin:
-            for line in fin:
-                word = line.strip()
-                if word in self._word_to_id:
-                    raise ValueError("There has repeated token in the vocabulary file: %s" % word)
-                self._word_to_id[word] = len(self._word_to_id)
-
-    @property
-    def bos(self):
-        return self._word_to_id["<S>"]
-
-    @property
-    def eos(self):
-        return self._word_to_id["</S>"]
-
-    @property
-    def unk(self):
-        return self._word_to_id["UNK"]
-
-    @property
-    def pad(self):
-        return self._word_to_id["<PAD>"]
-
-    @property
-    def size(self):
-        return len(self._word_to_id)
-
-    def word_to_id(self, word):
-        if word in self._word_to_id:
-            return self._word_to_id[word]
-        return self.unk
-
-    def encode(self, sentence, split=True):
-        """
-        Convert a sentence to a list of ids, with special tokens added.
-        Sentence is a single string with tokens separated by whitespace.
-        """
-        if split:
-            word_ids = [self.word_to_id(cur_word) for cur_word in sentence.split()]
-        else:
-            word_ids = [self.word_to_id(cur_word) for cur_word in sentence]
-
-        word_ids = [self.bos] + word_ids + [self.eos]
-        word_ids_reverse = deepcopy(word_ids)
-        word_ids_reverse.reverse()
-        return np.array(word_ids, dtype=np.int64), np.array(word_ids_reverse, dtype=np.int64)
-
-
-class UnicodeCharsVocabulary(Vocabulary):
-    """
-    Vocabulary containing character-level and word level information.
-
-    Has a word vocabulary that is used to lookup word ids and a character id
-    that is used to map words to arrays of character ids.
-
-    The character ids are defined by ord(c) for c in word.encode('utf-8').
-    This limits the total number of possible char ids to 256.
-    To this we add 5 additional special ids: begin sentence, end sentence,
-    begin word, end word and char padding.
-
-    Parameters:
-        filename (str): The vocabulary file. It is a flat text file with
-            one (normalized) token per line.
-        max_word_length (int): The maximum characters number of token in sequence.
-    """
-
-    def __init__(self, filename, max_word_length, **kwargs):
-        super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs)
-        self._max_word_length = max_word_length
-
-        self.bos_char = 256  # <begin sentence>
-        self.eos_char = 257  # <end sentence>
-        self.bow_char = 258  # <begin word>
-        self.eow_char = 259  # <end word>
-        self.pad_char = 260  # <char padding>
-
-        self._word_char_ids = {}
-
-        # the charcter representation of the begin/end of sentence characters
-        def _make_bos_eos(c):
-            r = np.zeros([self.max_word_length], dtype=np.int64)
-            r[:] = self.pad_char
-            r[0] = self.bow_char
-            r[1] = c
-            r[2] = self.eow_char
-            return r
-
-        self.bos_chars = _make_bos_eos(self.bos_char)
-        self.eos_chars = _make_bos_eos(self.eos_char)
-
-        for word in self._word_to_id:
-            self._word_char_ids[word] = self._convert_word_to_char_ids(word)
-
-        self._word_char_ids["<S>"] = self.bos_chars
-        self._word_char_ids["</S>"] = self.eos_chars
-
-    @property
-    def char_size(self):
-        # char ids 0-255 come from utf-8 encoding bytes.
-        # assign 256-300 to special chars.
-        # all +1, the id=0 is for token padding and mask.
-        return 262
-
-    @property
-    def max_word_length(self):
-        return self._max_word_length
-
-    def _convert_word_to_char_ids(self, word):
-        code = np.zeros([self.max_word_length], dtype=np.int64)
-        code[:] = self.pad_char
-
-        word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)]
-        code[0] = self.bow_char
-        for k, chr_id in enumerate(word_encoded, start=1):
-            code[k] = chr_id
-        code[len(word_encoded) + 1] = self.eow_char
-
-        return code
-
-    def word_to_char_ids(self, word):
-        if word in self._word_to_id:
-            return self._word_char_ids[word]
-        else:
-            return self._convert_word_to_char_ids(word)
-
-    def encode_chars(self, sentence, split=True):
-        """
-        Encode the sentence as a white space delimited string of tokens.
-        """
-        if split:
-            chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence.split()]
-        else:
-            chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence]
-
-        chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars]
-        chars_ids_reverse = deepcopy(chars_ids)
-        chars_ids_reverse.reverse()
-
-        # +1 for token padding and mask
-        chars_ids = np.vstack(chars_ids) + 1
-        chars_ids_reverse = np.vstack(chars_ids_reverse) + 1
-        return chars_ids, chars_ids_reverse
-
-
-class CharsVocabulary(object):
-    def __init__(self, max_word_length):
-        self._max_word_length = max_word_length
-
-        self.bos_char = 256  # <begin sentence>
-        self.eos_char = 257  # <end sentence>
-        self.bow_char = 258  # <begin word>
-        self.eow_char = 259  # <end word>
-        self.pad_char = 260  # <char padding>
-
-        # the charcter representation of the begin/end of sentence characters
-        def _make_bos_eos(c):
-            r = np.zeros([self.max_word_length], dtype=np.int64)
-            r[:] = self.pad_char
-            r[0] = self.bow_char
-            r[1] = c
-            r[2] = self.eow_char
-            return r
-
-        self.bos_chars = _make_bos_eos(self.bos_char)
-        self.eos_chars = _make_bos_eos(self.eos_char)
-
-    @property
-    def char_size(self):
-        # char ids 0-255 come from utf-8 encoding bytes.
-        # assign 256-300 to special chars.
-        # all +1, the id=0 is for token padding and mask.
-        return 262
-
-    @property
-    def max_word_length(self):
-        return self._max_word_length
-
-    def convert_word_to_char_ids(self, word):
-        code = np.zeros([self.max_word_length], dtype=np.int64)
-        code[:] = self.pad_char
-
-        word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)]
-        code[0] = self.bow_char
-        for k, chr_id in enumerate(word_encoded, start=1):
-            code[k] = chr_id
-        code[len(word_encoded) + 1] = self.eow_char
-
-        return code
-
-    def encode_chars(self, sentence, split=True):
-        """
-        Encode the sentence as a white space delimited string of tokens.
-        """
-        if split:
-            chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence.split()]
-        else:
-            chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence]
-
-        chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars]
-        chars_ids_reverse = deepcopy(chars_ids)
-        chars_ids_reverse.reverse()
-
-        # +1 for token padding and mask
-        chars_ids = np.vstack(chars_ids) + 1
-        chars_ids_reverse = np.vstack(chars_ids_reverse) + 1
-        return chars_ids, chars_ids_reverse
-
-
-def load_vocab(vocab_file=None, max_word_length=50):
-    if vocab_file is None:
-        return CharsVocabulary(max_word_length)
-    elif max_word_length:
-        return UnicodeCharsVocabulary(vocab_file, max_word_length)
-    else:
-        return Vocabulary(vocab_file)
-
-
-class OneBillionWordDataset(IterableDataset):
-    """
-    Hold the one billion word dataset, consisting of 1B Words which is used for
-    benchmarking of Language Modeling. The training/held-out data was produced
-    from the WMT 2011 News Crawl data.
-
-    The dataset is a list of tokenized files. Each file contains one sentence
-    per line. Each sentence is pre-tokenized and white space joined.
-
-    Parameters:
-        filepattern (str): A glob string that specifies the list of files.
-        vocab (Vocabulary): An instance of Vocabulary or UnicodeCharsVocabulary.
-        batch_size (int): The batch_size.
-        num_steps (int): The sentence length after re-cutting in dataset.
-        n_procs (int): The number of GPUs.
-        mode (str, optional): The dataset mode. It can be "train" and "test".
-            When "test", the dataset iterate through all data once then stop.
-            When "train", it will iterate forever. Default: "test".
-        shuffle (bool, optional): Whether shuffle the data. Default: False.
-        seed (int, optional): The random seed. Default: 0.
-    """
-
-    def __init__(
-        self, filepattern, vocab, batch_size, num_steps, n_procs=1, rank=0, mode="test", shuffle=False, seed=0
-    ):
-        super(OneBillionWordDataset, self).__init__()
-
-        self._all_files = glob.glob(filepattern)
-        print("\nFound %d files at %s\n" % (len(self._all_files), filepattern))
-        self._vocab = vocab
-        self._max_word_length = vocab.max_word_length
-        self._use_char_inputs = hasattr(vocab, "encode_chars")
-        self._batch_size = batch_size
-        self._num_steps = num_steps
-        self._n_procs = n_procs
-        self._rank = rank
-        self._mode = mode
-        self._shuffle = shuffle
-        self._seed = abs(seed)
-        self._file_seed = self._get_file_random_seed()
-
-    def _get_file_random_seed(self):
-        file_seed = {}
-        np.random.seed(self._seed)
-        seed_list = list(np.random.random(len(self._all_files)))
-        for file_path, seed in zip(list(self._all_files), seed_list):
-            file_seed[file_path] = seed
-        return file_seed
-
-    def _load_file(self, file_path):
-        print("\nLoading data from: %s\n" % file_path)
-        with open(file_path) as f:
-            sentences_raw = f.readlines()
-        sentences = sentences_raw
-
-        if self._shuffle:
-            if self._n_procs > 1:
-                seed = self._file_seed[file_path] * self._seed
-                random.seed(seed)
-            random.shuffle(sentences)
-
-        for sentence in sentences:
-            ids, ids_reverse = self._vocab.encode(sentence)
-            if self._use_char_inputs:
-                char_ids, char_ids_reverse = self._vocab.encode_chars(sentence)
-            else:
-                char_ids, char_ids_reverse = None, None
-            yield (ids, char_ids, ids_reverse, char_ids_reverse)
-
-    def get_sentence(self):
-        while True:
-            self._seed += 1
-            all_files = list(self._all_files)
-            if self._shuffle:
-                if self._n_procs > 1:
-                    random.seed(self._seed)
-                random.shuffle(all_files)
-            for file_path in all_files:
-                for ret in self._load_file(file_path):
-                    yield ret
-            if self._mode == "test":
-                break
-
-    @property
-    def number_of_tokens(self):
-        # number of tokens in training data (1B Word Benchmark)
-        return 768648884
-
-    def __iter__(self):
-        sentence_generator = self.get_sentence()
-        n_batch_size = self._batch_size * self._n_procs
-        cur_stream = [None] * n_batch_size
-
-        while True:
-            inputs = np.zeros([n_batch_size, self._num_steps], np.int64)
-            inputs_reverse = np.zeros([n_batch_size, self._num_steps], np.int64)
-            if self._max_word_length is not None:
-                char_inputs = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64)
-                char_inputs_reverse = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64)
-            else:
-                char_inputs = None
-                char_inputs_reverse = None
-            targets = np.zeros([n_batch_size, self._num_steps], np.int64)
-            targets_reverse = np.zeros([n_batch_size, self._num_steps], np.int64)
-
-            for i in range(n_batch_size):
-                cur_pos = 0
-                while cur_pos < self._num_steps:
-                    if cur_stream[i] is None or len(cur_stream[i][0]) <= 1:
-                        try:
-                            cur_stream[i] = list(next(sentence_generator))
-                        except StopIteration:
-                            return
-
-                    how_many = min(len(cur_stream[i][0]) - 1, self._num_steps - cur_pos)
-                    next_pos = cur_pos + how_many
-
-                    inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many]
-                    inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][2][:how_many]
-                    if self._max_word_length is not None:
-                        char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][:how_many]
-                        char_inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][3][:how_many]
-                    targets[i, cur_pos:next_pos] = cur_stream[i][0][1 : how_many + 1]
-                    targets_reverse[i, cur_pos:next_pos] = cur_stream[i][2][1 : how_many + 1]
-
-                    cur_pos = next_pos
-
-                    cur_stream[i][0] = cur_stream[i][0][how_many:]
-                    cur_stream[i][2] = cur_stream[i][2][how_many:]
-                    if self._max_word_length is not None:
-                        cur_stream[i][1] = cur_stream[i][1][how_many:]
-                        cur_stream[i][3] = cur_stream[i][3][how_many:]
-
-            # token_ids: (n_batch_size, self._num_steps)
-            # char_inputs: character ids (n_batch_size, self._num_steps, 50)
-            # targets: word ID of next word (n_batch_size, self._num_steps)
-            batch_data = {
-                "token_ids": inputs,
-                "tokens_characters": char_inputs,
-                "next_token_ids": targets,
-                "token_ids_reverse": inputs_reverse,
-                "tokens_characters_reverse": char_inputs_reverse,
-                "next_token_ids_reverse": targets_reverse,
-            }
-            if self._n_procs > 1:
-                start = self._rank * self._batch_size
-                end = start + self._batch_size
-                for key in batch_data:
-                    batch_data[key] = batch_data[key][start:end]
-
-            yield (
-                batch_data["tokens_characters"],
-                batch_data["next_token_ids"],
-                batch_data["tokens_characters_reverse"],
-                batch_data["next_token_ids_reverse"],
-            )
-
-
-def create_one_batch(sentences, vocab, max_seq_len):
-    # Add <S>, </S> for every sentence
-    max_len = max([len(sentence) for sentence in sentences]) + 2
-    max_len = min(max_len, max_seq_len)
-    batch_ids = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64)
-    batch_ids_reverse = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64)
-    batch_lens = []
-    for i, sentence in enumerate(sentences):
-        sentence = sentence[: max_len - 2]
-        seq_len = len(sentence) + 2
-        ids, ids_reverse = vocab.encode_chars(sentence, split=False)
-        batch_ids[i, :seq_len, :] = ids
-        batch_ids_reverse[i, :seq_len, :] = ids_reverse
-        batch_lens.append(seq_len)
-    return batch_ids, batch_ids_reverse, batch_lens
-
-
-def create_batches(sentences: List[List[str]], batch_size, vocab, max_seq_len):
-    """
-    Batch the sentences as character ids
-    Each sentence is a list of tokens without <s> or </s>, e.g.
-    [['The', 'first', 'sentence', '.'], ['Second', '.']]
-    """
-    n_batch = (len(sentences) - 1) // batch_size + 1
-    for i in range(n_batch):
-        start, end = i * batch_size, (i + 1) * batch_size
-        ids, ids_reverse, seq_lens = create_one_batch(sentences[start:end], vocab, max_seq_len)
-        ids = paddle.to_tensor(ids)
-        ids_reverse = paddle.to_tensor(ids_reverse)
-        yield ids, ids_reverse, seq_lens
diff --git a/examples/language_model/elmo/download_data.sh b/examples/language_model/elmo/download_data.sh
deleted file mode 100644
index 385df158a073..000000000000
--- a/examples/language_model/elmo/download_data.sh
+++ /dev/null
@@ -1,25 +0,0 @@
-set -eux
-
-rm 1-billion-word* -rf
-wget https://bj.bcebos.com/paddlenlp/datasets/1-billion-word.tar.gz
-src_md5="5f079a9b88ea27585e0539f502ca9327"
-md5=`md5sum 1-billion-word.tar.gz | cut -d ' ' -f1`
-if [ $md5 != $src_md5 ]
-then
-    echo "The MD5 values of 1-billion-word.tar.gz are inconsistent. Please download again!"
-    exit 1
-fi
-tar -zxf 1-billion-word.tar.gz
-
-rm sentence-polarity-dataset-v1* -rf
-wget https://bj.bcebos.com/paddlenlp/datasets/movie-review/sentence-polarity-dataset-v1.tar.gz
-src_md5="0464239d7b14b18d941f54a948c6cb26"
-md5=`md5sum sentence-polarity-dataset-v1.tar.gz | cut -d ' ' -f1`
-if [ $md5 != $src_md5 ]
-then
-    echo "The MD5 values of sentence-polarity-dataset-v1.tar.gz are inconsistent. Please download again!"
-    exit 1
-fi
-tar -zxf sentence-polarity-dataset-v1.tar.gz
-cd sentence-polarity-dataset-v1
-gunzip GoogleNews-vectors-negative300.bin.gz
diff --git a/examples/language_model/elmo/elmo.py b/examples/language_model/elmo/elmo.py
deleted file mode 100755
index 04c79e764232..000000000000
--- a/examples/language_model/elmo/elmo.py
+++ /dev/null
@@ -1,335 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from typing import List
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-import paddle.nn.functional as F
-import paddle.nn.initializer as I
-from dataset import create_batches, load_vocab
-
-
-def reverse_sequence(x, sequence_lengths):
-    batch_size = x.shape[0]
-    sequence_lengths = sequence_lengths.numpy().data
-    y = paddle.zeros(x.shape, x.dtype)
-    for i in range(batch_size):
-        lens = sequence_lengths[i]
-        z = x[i, :lens, :]
-        z = paddle.reverse(z, axis=[0])
-        y[i, :lens, :] = z
-    return y
-
-
-class ELMo(nn.Layer):
-    def __init__(
-        self,
-        batch_size=None,
-        char_embed_dim=16,
-        projection_dim=512,
-        vocab_size=None,
-        cnn_filters=[[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]],
-        char_vocab_size=262,
-        max_characters_per_token=50,
-        num_highways=2,
-        num_layers=2,
-        dropout=0.1,
-        task="pre-train",
-    ):
-        super(ELMo, self).__init__()
-
-        if task == "pre-train":
-            if vocab_size is None or batch_size is None:
-                raise ValueError('vocab_size and batch_size should be set when task="pre-train"')
-        elif task == "fine-tune":
-            if batch_size is None:
-                batch_size = 128
-        else:
-            raise ValueError('task should be "pre-train" or "fine-tune"')
-
-        self._projection_dim = projection_dim
-        self._task = task
-
-        self._token_embding_layer = ELMoCharacterEncoderLayer(
-            char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token
-        )
-        self._elmobilm = ELMoBiLM(batch_size, projection_dim, projection_dim, num_layers, dropout, task)
-        if task == "pre-train":
-            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(projection_dim)))
-            self._linear_layer = nn.Linear(projection_dim, vocab_size, weight_attr=paramAttr)
-
-    @property
-    def embedding_dim(self):
-        return self._projection_dim * 2
-
-    def forward(self, inputs):
-        # [batch_size, seq_len, max_characters_per_token]
-        ids, ids_reverse = inputs
-        # [batch_size, seq_len, projection_dim]
-        token_embedding = self._token_embding_layer(ids)
-        token_embedding_reverse = self._token_embding_layer(ids_reverse)
-
-        outs = self._elmobilm(token_embedding, token_embedding_reverse)
-
-        if self._task == "pre-train":
-            # [batch_size, seq_len, projection_dim]
-            fw_out, bw_out = outs
-
-            # [batch_size, max_seq_len, vocab_size]
-            fw_logits = self._linear_layer(fw_out)
-            bw_logits = self._linear_layer(bw_out)
-            return [fw_logits, bw_logits]
-        else:
-            mask = paddle.any(ids > 0, axis=2)
-            seq_lens = paddle.sum(paddle.cast(mask, dtype=ids.dtype), axis=1)
-            outputs = [paddle.concat([token_embedding, token_embedding], axis=2)]
-            for fw_h, bw_h in zip(outs[0], outs[1]):
-                bw_h = reverse_sequence(bw_h, seq_lens)
-                outputs.append(paddle.concat([fw_h, bw_h], axis=2))
-            # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2]
-            outputs = paddle.concat([paddle.unsqueeze(emb, axis=1) for emb in outputs], axis=1)
-            return outputs
-
-
-class ELMoBiLM(nn.Layer):
-    def __init__(self, batch_size, input_size, hidden_size, num_layers, dropout, task="pre-train"):
-        super(ELMoBiLM, self).__init__()
-
-        self._num_layers = num_layers
-        self._dropout = dropout
-        self._task = task
-
-        self._lstm_layers = []
-        for direction in ["forward", "backward"]:
-            layers = []
-            for i in range(num_layers):
-                lstm = nn.LSTM(
-                    input_size=input_size,
-                    hidden_size=hidden_size,
-                    num_layers=1,
-                    direction="forward",
-                    weight_hh_attr=paddle.ParamAttr(initializer=I.XavierUniform()),
-                    weight_ih_attr=paddle.ParamAttr(initializer=I.XavierUniform()),
-                    bias_hh_attr=False,
-                    bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)),
-                )
-                self.add_sublayer("{}_lstm_layer_{}".format(direction, i), lstm)
-
-                hidden_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32")
-                cell_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32")
-                layers.append({"lstm": lstm, "hidden_state": hidden_state, "cell_state": cell_state})
-            self._lstm_layers.append(layers)
-
-        if dropout:
-            self._dropout_layer = nn.Dropout(p=dropout)
-
-    def forward(self, fw_x, bw_x):
-        final_outs = []
-        lstm_outs = []
-        for x, layers in zip([fw_x, bw_x], self._lstm_layers):
-            batch_size = x.shape[0]
-            outs = []
-            for i, dic in enumerate(layers):
-                lstm = dic["lstm"]
-                hidden_state = dic["hidden_state"][:, :batch_size, :]
-                cell_state = dic["cell_state"][:, :batch_size, :]
-                if self._dropout:
-                    x = self._dropout_layer(x)
-                x, (hidden_state, cell_state) = lstm(x, (hidden_state, cell_state))
-                hidden_state = hidden_state.detach()
-                cell_state = cell_state.detach()
-                dic["hidden_state"][:, :batch_size, :] = hidden_state
-                dic["cell_state"][:, :batch_size, :] = cell_state
-                outs.append(x)
-            lstm_outs.append(outs)
-
-            if self._dropout:
-                x = self._dropout_layer(x)
-            final_outs.append(x)
-        if self._task == "pre-train":
-            return final_outs
-        else:
-            return lstm_outs
-
-
-class ELMoCharacterEncoderLayer(nn.Layer):
-    def __init__(
-        self, char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token
-    ):
-        super(ELMoCharacterEncoderLayer, self).__init__()
-
-        self._use_highway = num_highways > 0
-        self._n_filters = sum(f[1] for f in cnn_filters)
-        self._use_proj = self._n_filters != projection_dim
-
-        paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-1.0, high=1.0))
-        self._char_embedding_layer = nn.Embedding(
-            num_embeddings=char_vocab_size, embedding_dim=char_embed_dim, weight_attr=paramAttr
-        )
-
-        with paddle.no_grad():
-            self._char_embedding_layer.weight[0, :] = 0
-
-        self._convolution_layers = []
-        for i, (width, num) in enumerate(cnn_filters):
-            paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-0.05, high=0.05))
-            conv2d = nn.Conv2D(
-                in_channels=char_embed_dim,
-                out_channels=num,
-                kernel_size=(1, width),
-                padding="Valid",
-                data_format="NHWC",
-                weight_attr=paramAttr,
-            )
-            max_pool = nn.MaxPool2D(
-                kernel_size=(1, max_characters_per_token - width + 1),
-                stride=(1, 1),
-                padding="Valid",
-                data_format="NHWC",
-            )
-            self.add_sublayer("cnn_layer_{}".format(i), conv2d)
-            self.add_sublayer("maxpool_layer_{}".format(i), max_pool)
-            self._convolution_layers.append([width, conv2d, max_pool])
-
-        self._relu = nn.ReLU()
-        if self._use_highway:
-            self._highway_layer = Highway(self._n_filters, num_highways)
-        if self._use_proj:
-            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(self._n_filters)))
-            self._linear_layer = nn.Linear(self._n_filters, projection_dim, weight_attr=paramAttr)
-
-    def forward(self, x):
-        # [batch_size, seq_len, max_characters_per_token, embed_dim]
-        char_embedding = self._char_embedding_layer(x)
-
-        cnn_outs = []
-        for width, conv2d, max_pool in self._convolution_layers:
-            # [batch_size, seq_len, max_characters_per_token - kerner_width, out_channel]
-            conv_out = conv2d(char_embedding)
-            # [batch_size, seq_len, 1, out_channel]
-            pool_out = max_pool(conv_out)
-            # [batch_size, seq_len, 1, out_channel]
-            out = self._relu(pool_out)
-            # [batch_size, seq_len, out_channel]
-            out = paddle.squeeze(out, axis=2)
-            cnn_outs.append(out)
-
-        # [batch_size, seq_len, n_filters]
-        token_embedding = paddle.concat(cnn_outs, axis=-1)
-
-        if self._use_highway:
-            # [batch_size, seq_len, n_filters]
-            token_embedding = self._highway_layer(token_embedding)
-
-        if self._use_proj:
-            # [batch_size, seq_len, projection_dim]
-            token_embedding = self._linear_layer(token_embedding)
-
-        return token_embedding
-
-
-class Highway(nn.Layer):
-    def __init__(self, input_dim, num_layers):
-        super(Highway, self).__init__()
-
-        self._num_layers = num_layers
-
-        self._highway_layers = []
-        for i in range(num_layers):
-            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim)))
-            paramAttr_b = paddle.ParamAttr(initializer=I.Constant(value=-2.0))
-            carry_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr, bias_attr=paramAttr_b)
-            self.add_sublayer("carry_linear_{}".format(i), carry_linear)
-
-            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim)))
-            transform_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr)
-            self.add_sublayer("transform_linear_{}".format(i), transform_linear)
-
-            self._highway_layers.append([carry_linear, transform_linear])
-
-        self._relu = nn.ReLU()
-        self._sigmoid = nn.Sigmoid()
-
-    def forward(self, x):
-        for i in range(self._num_layers):
-            carry_linear, transform_linear = self._highway_layers[i]
-            carry_gate = self._sigmoid(carry_linear(x))
-            transform_gate = self._relu(transform_linear(x))
-            x = carry_gate * transform_gate + (1.0 - carry_gate) * x
-        return x
-
-
-class ELMoLoss(nn.Layer):
-    def __init__(self):
-        super(ELMoLoss, self).__init__()
-
-    def forward(self, x, y):
-        # [batch_size, seq_len, vocab_size]
-        fw_logits, bw_logits = x
-        # [batch_size, seq_len]
-        fw_label, bw_label = y
-        # [batch_size, seq_len, 1]
-        fw_label = paddle.unsqueeze(fw_label, axis=2)
-        bw_label = paddle.unsqueeze(bw_label, axis=2)
-
-        # [batch_size, seq_len, 1]
-        fw_loss = F.cross_entropy(input=fw_logits, label=fw_label)
-        bw_loss = F.cross_entropy(input=bw_logits, label=bw_label)
-
-        avg_loss = 0.5 * (fw_loss + bw_loss)
-        return avg_loss
-
-
-def get_elmo_layer(params_file, batch_size, trainable=False):
-    if trainable:
-        elmo = ELMo(batch_size=batch_size, task="fine-tune")
-    else:
-        elmo = ELMo(batch_size=batch_size, dropout=None, task="fine-tune")
-    weight_state_dict = paddle.load(params_file + ".pdparams")
-    elmo.set_state_dict(weight_state_dict)
-    if trainable:
-        elmo.train()
-    else:
-        for params in elmo.parameters():
-            params.trainable = False
-        elmo.eval()
-    return elmo
-
-
-class ELMoEmbedder(object):
-    def __init__(self, params_file, batch_size=128, max_seq_len=256):
-        self._max_seq_len = max_seq_len
-        self._batch_size = batch_size
-
-        self._elmo = get_elmo_layer(params_file, batch_size, trainable=False)
-        self._vocab = load_vocab()
-
-    def encode(self, sentences: List[List[str]]):
-        """
-        Each sentence is a list of tokens without <s> or </s>, e.g.
-        [['The', 'first', 'sentence', '.'], ['Second', '.']]
-        """
-        batch_data = create_batches(sentences, self._batch_size, self._vocab, self._max_seq_len)
-        embeddings = []
-        for data in batch_data:
-            ids, ids_reverse, seq_lens = data
-            # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2]
-            outputs = self._elmo([ids, ids_reverse])
-            outputs = outputs.numpy()
-            for i, lens in enumerate(seq_lens):
-                arr = outputs[i, :, 1 : lens - 1, :]
-                embeddings.append(arr)
-        return embeddings
diff --git a/examples/language_model/elmo/run_eval.py b/examples/language_model/elmo/run_eval.py
deleted file mode 100644
index ea330d62eaa6..000000000000
--- a/examples/language_model/elmo/run_eval.py
+++ /dev/null
@@ -1,87 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-import time
-
-import paddle
-from args import parse_args, print_args
-from dataset import OneBillionWordDataset, load_vocab
-from elmo import ELMo, ELMoLoss
-from paddle.io import DataLoader
-
-
-@paddle.no_grad()
-def eval(args):
-    paddle.set_device(args.device)
-
-    if not args.init_from_ckpt:
-        raise ValueError("init_from_ckpt should be set when eval.")
-    vocab = load_vocab(args.vocab_file, args.max_characters_per_token)
-
-    elmo = ELMo(
-        args.batch_size,
-        args.char_embed_dim,
-        args.projection_dim,
-        vocab.size,
-        dropout=args.dropout,
-        num_layers=args.num_layers,
-        num_highways=args.num_highways,
-        char_vocab_size=vocab.char_size,
-    )
-    elmo.eval()
-
-    elmo_loss = ELMoLoss()
-
-    # Loads pre-trained parameters.
-    weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams")
-    elmo.set_state_dict(weight_state_dict)
-    print("Loaded checkpoint from %s" % args.init_from_ckpt)
-
-    dev_dataset = OneBillionWordDataset(
-        args.dev_data_path, vocab, args.batch_size, args.unroll_steps, mode="test", shuffle=False, seed=args.seed
-    )
-
-    dev_dataloader = DataLoader(dev_dataset, return_list=True, batch_size=None)
-
-    total_step = total_loss = 0
-    total_time = 0.0
-    batch_start_time = time.time()
-    for step, inputs in enumerate(dev_dataloader, start=1):
-        ids, next_ids, ids_reverse, next_ids_reverse = inputs
-        outputs = elmo([ids, ids_reverse])
-        loss = elmo_loss(outputs, [next_ids, next_ids_reverse])
-        ppl = paddle.exp(loss)
-
-        total_loss += float(loss)
-        total_step += 1
-
-        total_time += time.time() - batch_start_time
-        if step % args.log_freq == 0:
-            print(
-                "Eval step %d - loss: %.4f - Perplexity: %.4f - %.3fs/step"
-                % (step, float(loss) * args.unroll_steps, float(ppl), total_time / args.log_freq)
-            )
-            total_time = 0.0
-        batch_start_time = time.time()
-
-    avg_loss = total_loss / total_step
-    avg_ppl = math.exp(avg_loss)
-    print("Eval - average loss: %.4f - average Perplexity: %.4f" % (avg_loss * args.unroll_steps, avg_ppl))
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_args(args)
-    eval(args)
diff --git a/examples/language_model/elmo/run_finetune.py b/examples/language_model/elmo/run_finetune.py
deleted file mode 100644
index a129c6e87d0c..000000000000
--- a/examples/language_model/elmo/run_finetune.py
+++ /dev/null
@@ -1,275 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import re
-
-import numpy as np
-import paddle
-import paddle.distributed as dist
-import paddle.nn as nn
-from dataset import load_vocab
-from elmo import get_elmo_layer
-from paddle.io import DataLoader, Dataset
-from sklearn.model_selection import train_test_split
-
-
-# yapf: disable
-def parse_args():
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.")
-    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
-    parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
-    parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.")
-    parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.")
-    parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.")
-    parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.")
-    parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.")
-    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
-    parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm')
-    parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.")
-    parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.")
-    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.")
-
-    args = parser.parse_args()
-    return args
-# yapf: enable
-
-
-def clean_str(string):
-    """
-    Tokenization/string cleaning for all datasets except for SST.
-    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
-    """
-    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
-    string = re.sub(r"\'s", " 's", string)
-    string = re.sub(r"\'ve", " 've", string)
-    string = re.sub(r"n\'t", " n't", string)
-    string = re.sub(r"\'re", " 're", string)
-    string = re.sub(r"\'d", " 'd", string)
-    string = re.sub(r"\'ll", " 'll", string)
-    string = re.sub(r",", " , ", string)
-    string = re.sub(r"!", " ! ", string)
-    string = re.sub(r"\(", " \( ", string)
-    string = re.sub(r"\)", " \) ", string)
-    string = re.sub(r"\?", " \? ", string)
-    string = re.sub(r"\s{2,}", " ", string)
-    return string.strip().lower()
-
-
-def load_data_and_labels(positive_data_file, negative_data_file):
-    """
-    Loads MR polarity data from files, splits the data into words and generates labels.
-    Returns split sentences and labels.
-    """
-    # Load data from files
-    positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines())
-    positive_examples = [s.strip() for s in positive_examples]
-    negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines())
-    negative_examples = [s.strip() for s in negative_examples]
-    # Split by words
-    x_text = positive_examples + negative_examples
-    x_text = [clean_str(sent) for sent in x_text]
-    x_text = list(map(lambda x: x.split(), x_text))
-    # Generate labels
-    positive_labels = [1 for _ in positive_examples]
-    negative_labels = [0 for _ in negative_examples]
-    y = np.array(positive_labels + negative_labels)
-    return [x_text, y]
-
-
-class ELMoBowTextClassification(nn.Layer):
-    def __init__(self, params_file, batch_size, sent_embedding_dim, dropout, num_labels):
-        super(ELMoBowTextClassification, self).__init__()
-
-        self._elmo = get_elmo_layer(params_file, batch_size, trainable=True)
-        word_embedding_dim = self._elmo.embedding_dim
-        self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim)
-        self._fc2 = nn.Linear(sent_embedding_dim, num_labels)
-        self._dropout = nn.Dropout(p=dropout)
-
-    def forward(self, inputs):
-        """
-        Parameters:
-            inputs (Tuple): It is a Tuple contains 2 tensor with shape
-                `[batch_size, max_seq_len, max_characters_per_token]`.
-        """
-        mask = paddle.any(inputs[0] > 0, axis=2)
-        # [batch_size, 3, max_seq_len, word_embedding_dim]
-        elmo_out = self._elmo(inputs)
-        # [batch_size, max_seq_len, word_embedding_dim]
-        word_emb = self.mix_elmo_outputs(elmo_out)
-
-        # [batch_size, word_embedding_dim]
-        sent_emb = self.average_word_embedding(word_emb, mask)
-
-        # [batch_size, sent_embedding_dim]
-        dense = self._fc1(sent_emb)
-        dense = self._dropout(dense)
-
-        # [batch_size, num_labels]
-        out = self._fc2(dense)
-        return out
-
-    def mix_elmo_outputs(self, elmo_out):
-        """
-        Computes a mixture of elmo_out. At present, we simply take the last one.
-        Parameters:
-            elmo_out (Tensor): It is a Tensor with shape
-                `[batch_size, 3, max_seq_len, word_embedding_dim]`.
-        """
-        # [batch_size, max_seq_len, word_embedding_dim]
-        word_emb = elmo_out[:, 2, :, :]
-        return word_emb
-
-    def average_word_embedding(self, word_emb, mask):
-        """
-        Parameters:
-            word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`.
-            mask: It is a Tensor with shape `[batch_size, max_seq_len]`.
-        """
-        mask = paddle.unsqueeze(mask, axis=-1)
-        # [batch_size, 1]
-        seq_lens = paddle.sum(paddle.cast(mask, dtype=word_emb.dtype), axis=1)
-
-        # [batch_size, max_seq_len, word_embedding_dim]
-        word_emb = word_emb * mask
-        # [batch_size, word_embedding_dim]
-        sent_emb = paddle.sum(word_emb, axis=1)
-        # [batch_size, word_embedding_dim]
-        sent_emb = sent_emb / seq_lens
-        return sent_emb
-
-
-class SentencePolarityDatasetV1(Dataset):
-    def __init__(self, x, y, vocab, max_seq_len):
-        super(SentencePolarityDatasetV1, self).__init__()
-
-        self._text = list(zip(x, y))
-        self._vocab = vocab
-        self._max_seq_len = max_seq_len
-        self._data = self.convert_to_ids()
-
-    def convert_to_ids(self):
-        data = []
-        for sentence, label in self._text:
-            ids, ids_reverse = self._vocab.encode_chars(sentence[: self._max_seq_len - 2], split=False)
-            data.append([ids, ids_reverse, label])
-        return data
-
-    def __getitem__(self, idx):
-        ids = np.copy(self._data[idx][0])
-        ids_reverse = np.copy(self._data[idx][1])
-        label = self._data[idx][2]
-        return (ids, ids_reverse, label)
-
-    def __len__(self):
-        return len(self._data)
-
-
-def generate_batch(batch):
-    batch_ids, batch_ids_reverse, batch_label = zip(*batch)
-    max_len = max([ids.shape[0] for ids in batch_ids])
-    new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64)
-    new_batch_ids_reverse = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64)
-    new_batch_label = []
-    for i, (ids, ids_reverse, label) in enumerate(zip(batch_ids, batch_ids_reverse, batch_label)):
-        seq_len = ids.shape[0]
-        new_batch_ids[i, :seq_len, :] = ids
-        new_batch_ids_reverse[i, :seq_len, :] = ids_reverse
-        new_batch_label.append(label)
-    return new_batch_ids, new_batch_ids_reverse, new_batch_label
-
-
-def finetune(args):
-    paddle.set_device(args.device)
-    if dist.get_world_size() > 1:
-        dist.init_parallel_env()
-
-    pos_file = os.path.join(args.data_dir, "rt-polarity.pos")
-    neg_file = os.path.join(args.data_dir, "rt-polarity.neg")
-    x_text, y = load_data_and_labels(pos_file, neg_file)
-    x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed)
-
-    if not args.init_from_ckpt:
-        raise ValueError("`init_from_ckpt` should be set.")
-    model = ELMoBowTextClassification(
-        args.init_from_ckpt, args.batch_size, args.sent_embedding_dim, args.dropout, args.num_classes
-    )
-    if dist.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-    model.train()
-
-    adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay)
-    criterion = nn.CrossEntropyLoss()
-
-    vocab = load_vocab()
-
-    train_dataset = SentencePolarityDatasetV1(x_train, y_train, vocab, args.max_seq_len)
-    test_dataset = SentencePolarityDatasetV1(x_test, y_test, vocab, args.max_seq_len)
-    train_loader = DataLoader(
-        train_dataset,
-        batch_size=args.batch_size,
-        return_list=True,
-        shuffle=True,
-        collate_fn=lambda batch: generate_batch(batch),
-    )
-    test_loader = DataLoader(
-        test_dataset,
-        batch_size=args.batch_size,
-        return_list=True,
-        shuffle=False,
-        collate_fn=lambda batch: generate_batch(batch),
-    )
-
-    for epoch in range(args.epochs):
-        print("Epoch {}/{}".format(epoch + 1, args.epochs))
-        for step, batch_data in enumerate(train_loader, start=1):
-            ids, ids_reverse, label = batch_data
-
-            output = model((ids, ids_reverse))
-            loss = criterion(output, label)
-            loss.backward()
-            adam.step()
-            adam.clear_grad()
-
-            if step % args.logging_step == 0:
-                print("step {}, loss {}".format(step, float(loss)))
-
-    acc = test(model, test_loader)
-    print("\ntest acc {}\n".format(acc))
-
-
-@paddle.no_grad()
-def test(model, test_loader):
-    correct = num = 0
-    model.eval()
-    for batch_data in test_loader:
-        ids, ids_reverse, label = batch_data
-
-        # [batch_size, 2]
-        output = model((ids, ids_reverse))
-
-        num += label.shape[0]
-        predict = paddle.argmax(output, axis=1)
-        label = paddle.cast(label, dtype=predict.dtype)
-        correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64")))
-    model.train()
-    return correct * 1.0 / num
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    finetune(args)
diff --git a/examples/language_model/elmo/run_pretrain.py b/examples/language_model/elmo/run_pretrain.py
deleted file mode 100644
index 7b22f97f8df3..000000000000
--- a/examples/language_model/elmo/run_pretrain.py
+++ /dev/null
@@ -1,123 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import time
-
-import paddle
-import paddle.distributed as dist
-import paddle.nn as nn
-from args import parse_args, print_args
-from dataset import OneBillionWordDataset, load_vocab
-from elmo import ELMo, ELMoLoss
-from paddle.io import DataLoader
-
-
-def save_params(elmo, optimizer, save_dir, name):
-    elmo_ckpt = os.path.join(save_dir, "{}.pdparams".format(name))
-    opt_ckpt = os.path.join(save_dir, "{}.pdopt".format(name))
-    paddle.save(elmo.state_dict(), elmo_ckpt)
-    paddle.save(optimizer.state_dict(), opt_ckpt)
-
-
-def train(args):
-    paddle.set_device(args.device)
-    n_procs = dist.get_world_size()
-    rank = dist.get_rank()
-
-    if n_procs > 1:
-        dist.init_parallel_env()
-
-    vocab = load_vocab(args.vocab_file, args.max_characters_per_token)
-
-    elmo = ELMo(
-        args.batch_size,
-        args.char_embed_dim,
-        args.projection_dim,
-        vocab.size,
-        dropout=args.dropout,
-        num_layers=args.num_layers,
-        num_highways=args.num_highways,
-        char_vocab_size=vocab.char_size,
-    )
-    if n_procs > 1:
-        elmo = paddle.DataParallel(elmo)
-    elmo.train()
-
-    gloabl_norm_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
-    optimizer = paddle.optimizer.Adagrad(
-        learning_rate=args.lr, parameters=elmo.parameters(), initial_accumulator_value=1.0, grad_clip=gloabl_norm_clip
-    )
-    elmo_loss = ELMoLoss()
-
-    # Loads pre-trained parameters.
-    if args.init_from_ckpt:
-        weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams")
-        opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt")
-        elmo.set_state_dict(weight_state_dict)
-        optimizer.set_state_dict(opt_state_dict)
-        print("Loaded checkpoint from %s" % args.init_from_ckpt)
-
-    train_dataset = OneBillionWordDataset(
-        args.train_data_path,
-        vocab,
-        args.batch_size,
-        args.unroll_steps,
-        n_procs=n_procs,
-        rank=rank,
-        mode="train",
-        shuffle=True,
-        seed=args.seed,
-    )
-
-    train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None)
-
-    n_tokens_per_batch = args.batch_size * args.unroll_steps * n_procs
-    n_steps_per_epoch = int(train_dataset.number_of_tokens / n_tokens_per_batch)
-    n_steps_total = args.epochs * n_steps_per_epoch
-    print("Training for %s epochs and %s steps" % (args.epochs, n_steps_total))
-
-    total_time = 0.0
-    batch_start_time = time.time()
-    for step, inputs in enumerate(train_dataloader, start=1):
-        ids, next_ids, ids_reverse, next_ids_reverse = inputs
-        outputs = elmo([ids, ids_reverse])
-        loss = elmo_loss(outputs, [next_ids, next_ids_reverse])
-        ppl = paddle.exp(loss)
-        loss *= args.unroll_steps
-        loss.backward()
-        optimizer.step()
-        optimizer.clear_grad()
-
-        total_time += time.time() - batch_start_time
-        if step % args.log_freq == 0:
-            print(
-                "step %d/%d - loss: %.4f - Perplexity: %.4f - %.3fs/step"
-                % (step, n_steps_total, float(loss), float(ppl), total_time / args.log_freq)
-            )
-            total_time = 0.0
-        if rank == 0 and step % args.save_freq == 0:
-            save_params(elmo, optimizer, args.save_dir, step)
-        if step == n_steps_total:
-            # training done
-            if rank == 0:
-                save_params(elmo, optimizer, args.save_dir, "final")
-            break
-        batch_start_time = time.time()
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_args(args)
-    train(args)
diff --git a/examples/language_model/elmo/word2vec_base.py b/examples/language_model/elmo/word2vec_base.py
deleted file mode 100644
index 1401268b8f0e..000000000000
--- a/examples/language_model/elmo/word2vec_base.py
+++ /dev/null
@@ -1,255 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import re
-
-import numpy as np
-import paddle
-import paddle.distributed as dist
-import paddle.nn as nn
-from gensim.models.keyedvectors import KeyedVectors
-from paddle.io import DataLoader, Dataset
-from sklearn.model_selection import train_test_split
-
-
-# yapf: disable
-def parse_args():
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.")
-    parser.add_argument("--pretrained_word2vec_file", type=str, default="./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin", help="Specify the pretrained word2vec model path.")
-    parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
-    parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.")
-    parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.")
-    parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.")
-    parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.")
-    parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.")
-    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
-    parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm')
-    parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.")
-    parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.")
-    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
-
-    args = parser.parse_args()
-    return args
-# yapf: enable
-
-
-def clean_str(string):
-    """
-    Tokenization/string cleaning for all datasets except for SST.
-    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
-    """
-    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
-    string = re.sub(r"\'s", " 's", string)
-    string = re.sub(r"\'ve", " 've", string)
-    string = re.sub(r"n\'t", " n't", string)
-    string = re.sub(r"\'re", " 're", string)
-    string = re.sub(r"\'d", " 'd", string)
-    string = re.sub(r"\'ll", " 'll", string)
-    string = re.sub(r",", " , ", string)
-    string = re.sub(r"!", " ! ", string)
-    string = re.sub(r"\(", " \( ", string)
-    string = re.sub(r"\)", " \) ", string)
-    string = re.sub(r"\?", " \? ", string)
-    string = re.sub(r"\s{2,}", " ", string)
-    return string.strip().lower()
-
-
-def load_data_and_labels(positive_data_file, negative_data_file):
-    """
-    Loads MR polarity data from files, splits the data into words and generates labels.
-    Returns split sentences and labels.
-    """
-    # Load data from files
-    positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines())
-    positive_examples = [s.strip() for s in positive_examples]
-    negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines())
-    negative_examples = [s.strip() for s in negative_examples]
-    # Split by words
-    x_text = positive_examples + negative_examples
-    x_text = [clean_str(sent) for sent in x_text]
-    x_text = list(map(lambda x: x.split(), x_text))
-    # Generate labels
-    positive_labels = [1 for _ in positive_examples]
-    negative_labels = [0 for _ in negative_examples]
-    y = np.array(positive_labels + negative_labels)
-    return [x_text, y]
-
-
-class Word2VecBoWTextClassification(nn.Layer):
-    def __init__(self, word_embedding_dim, sent_embedding_dim, dropout, num_classes):
-        super(Word2VecBoWTextClassification, self).__init__()
-
-        self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim)
-        self._fc2 = nn.Linear(sent_embedding_dim, num_classes)
-        self._dropout = nn.Dropout(p=dropout)
-
-    def forward(self, inputs):
-        word_emb, seq_lens = inputs
-
-        # [batch_size, word_embedding_dim]
-        sent_emb = self.average_word_embedding(word_emb, seq_lens)
-
-        # [batch_size, sent_embedding_dim]
-        dense = self._fc1(sent_emb)
-        dense = self._dropout(dense)
-
-        # [batch_size, num_classes]
-        out = self._fc2(dense)
-        return out
-
-    def average_word_embedding(self, word_emb, seq_lens):
-        """
-        Parameters:
-            word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`.
-            seq_lens: It is a Tensor with shape `[batch_size]`.
-        """
-        seq_lens = paddle.unsqueeze(seq_lens, axis=-1)
-        seq_lens = paddle.cast(seq_lens, dtype=word_emb.dtype)
-
-        # [batch_size, word_embedding_dim]
-        sent_emb = paddle.sum(word_emb, axis=1)
-        # [batch_size, word_embedding_dim]
-        sent_emb = sent_emb / seq_lens
-        return sent_emb
-
-
-class SentencePolarityDatasetV1(Dataset):
-    def __init__(self, x, y, gensim_model, max_seq_len):
-        super(SentencePolarityDatasetV1, self).__init__()
-
-        self._text = list(zip(x, y))
-        self._gensim_model = gensim_model
-        self._vector_size = gensim_model.vector_size
-        self._max_seq_len = max_seq_len
-        self._data = self.convert_to_ids()
-
-    def convert_to_ids(self):
-        data = []
-        for sentence, label in self._text:
-            sentence = sentence[: self._max_seq_len]
-            ids = np.zeros([len(sentence), self._vector_size], dtype=np.float32)
-            for i, word in enumerate(sentence):
-                if word in self._gensim_model:
-                    ids[i] = self._gensim_model[word]
-                else:
-                    ids[i] = np.random.uniform(-0.25, 0.25, self._vector_size)
-            data.append([ids, label])
-        return data
-
-    def __getitem__(self, idx):
-        ids = np.copy(self._data[idx][0])
-        label = self._data[idx][1]
-        return (ids, label)
-
-    def __len__(self):
-        return len(self._data)
-
-
-def generate_batch(batch):
-    batch_ids, batch_label = zip(*batch)
-    max_len = max([ids.shape[0] for ids in batch_ids])
-    new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.float32)
-    new_batch_label = []
-    new_batch_seq_len = []
-    for i, (ids, label) in enumerate(zip(batch_ids, batch_label)):
-        seq_len = ids.shape[0]
-        new_batch_ids[i, :seq_len, :] = ids
-        new_batch_label.append(label)
-        new_batch_seq_len.append(seq_len)
-    return new_batch_ids, new_batch_label, new_batch_seq_len
-
-
-def train(args):
-    paddle.set_device(args.device)
-    if dist.get_world_size() > 1:
-        dist.init_parallel_env()
-
-    pos_file = os.path.join(args.data_dir, "rt-polarity.pos")
-    neg_file = os.path.join(args.data_dir, "rt-polarity.neg")
-    x_text, y = load_data_and_labels(pos_file, neg_file)
-    x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed)
-
-    # gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True, limit=300000)
-    gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True)
-    print("\nLoaded word2vec from %s\n" % args.pretrained_word2vec_file)
-
-    train_dataset = SentencePolarityDatasetV1(x_train, y_train, gensim_model, args.max_seq_len)
-    test_dataset = SentencePolarityDatasetV1(x_test, y_test, gensim_model, args.max_seq_len)
-    train_loader = DataLoader(
-        train_dataset,
-        batch_size=args.batch_size,
-        return_list=True,
-        shuffle=True,
-        collate_fn=lambda batch: generate_batch(batch),
-    )
-    test_loader = DataLoader(
-        test_dataset,
-        batch_size=args.batch_size,
-        return_list=True,
-        shuffle=False,
-        collate_fn=lambda batch: generate_batch(batch),
-    )
-
-    model = Word2VecBoWTextClassification(
-        gensim_model.vector_size, args.sent_embedding_dim, args.dropout, args.num_classes
-    )
-    if dist.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-    model.train()
-
-    adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay)
-    criterion = nn.CrossEntropyLoss()
-
-    for epoch in range(args.epochs):
-        print("Epoch %d/%d" % (epoch + 1, args.epochs))
-        for step, batch_data in enumerate(train_loader, start=1):
-            ids, label, seq_lens = batch_data
-
-            output = model((ids, seq_lens))
-            loss = criterion(output, label)
-            loss.backward()
-            adam.step()
-            adam.clear_grad()
-
-            if step % args.logging_step == 0:
-                print("step %d, loss %.4f" % (step, float(loss)))
-
-    acc = test(model, test_loader)
-    print("\ntest acc %.4f\n" % acc)
-
-
-@paddle.no_grad()
-def test(model, test_loader):
-    correct = num = 0
-    model.eval()
-    for batch_data in test_loader:
-        ids, label, seq_lens = batch_data
-
-        # [batch_size, 2]
-        output = model((ids, seq_lens))
-
-        num += label.shape[0]
-        predict = paddle.argmax(output, axis=1)
-        label = paddle.cast(label, dtype=predict.dtype)
-        correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64")))
-    model.train()
-    return correct * 1.0 / num
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    train(args)
diff --git a/examples/language_model/end_to_end_memory_networks/README.md b/examples/language_model/end_to_end_memory_networks/README.md
deleted file mode 100644
index d3a2dc040356..000000000000
--- a/examples/language_model/end_to_end_memory_networks/README.md
+++ /dev/null
@@ -1,212 +0,0 @@
-# End-To-End-Memory-Networks-in-Paddle
-## 一、简介
-
-用Paddle来复现论文End-To-End Memory Networks
-
-![模型简介](http://paddle.yulan.net.cn/model_introduction.png)
-
-本模型是Facebook AI在Memory networks之后提出的一个更加完善的记忆网络模型，在问答系统以及语言模型中均有良好的应用。论文中使用了多个单层单元堆叠而成的多层架构。
-
-单层架构如上图a所示，主要的参数包括A,B,C,W四个矩阵，其中A,B,C三个矩阵就是embedding矩阵，主要是将输入文本和Question编码成词向量，W是最终的输出矩阵。从上图可以看出，对于输入的句子s分别会使用A和C进行编码得到Input和Output的记忆模块，Input用来跟Question编码得到的向量相乘得到每句话跟q的相关性，Output则与该相关性进行加权求和得到输出向量。然后再加上q并传入最终的输出层。
-
-多层网络如上图b所示，实际上是将多个单层堆叠到一起形成的网络，这里将每一层称为一个hop。
-为了减少参数，模型提出了两种让各个hop之间共享Embedding参数（A与C）的方法：
-* Adjacent：这种方法让相邻层之间的$A=C$。也就是说$A_{k+1}=C_{k}$，此外W等于顶层的C，B等于底层的A，这样就减少了一半的参数量。
-* Layer-wise（RNN-like)：与RNN相似，采用完全共享参数的方法，即各层之间参数均相等。$A_{1}=A_{2}=...=A_{k}$,$C_{1}=C_{2}=...=C_{k}$。但这样模型的参数太少，性能会受到影响，故提出一种改进方法，在每一层之间加一个线性映射矩阵H，即令$u^{k+1}=H u^{k}+o^{k}$。
-
-具体到语言模型，模型做出了一下调整：
-1. 由于输入是单个句子，编码级别是单词级的，所以可以直接将每个单词的词向量存入memory即可，也就是说A与C现在都是单词的Embedding矩阵，mi与ci中都是单个单词的词向量。
-2. 输出W矩阵的output为下一个单词的概率，即输出维度为vocab size。
-3. 不同于QA任务，这里不存在Question，所以直接将q向量设置为全0.1的常量，也不需要再进行Embedding操作。
-4. 采用Layer-wise的参数缩减策略。
-5. 文中提出，对于每一层的u向量中一半的神经元进行ReLU操作，以帮助模型训练。
-
-## 二、数据集
-
-* Penn Treetank:
-
-    * [Penn Treebank](http://paddle.yulan.net.cn/ptb.zip)
-
-        NLP中常用的PTB语料库,语料来源为1989年华尔街日报，并做以下切分
-
-        train：887k words
-
-        valid：70k words
-
-        test：78k words
-
-        vocabulary  size：10k
-
-    * [text8](http://paddle.yulan.net.cn/text8.zip)
-
-        来源于enwiki8，总共100M个字符，划分为93.3M/5.7M/1M字符(train/valid/test)，将出现次数少于10次的单词替换为<UNK>
-
-## 三、环境依赖
-
-* 硬件：GPU
-* 框架：Paddle >= 2.0.0，progress库
-
-## 四、快速开始
-
-下载数据集和已训练好的模型
-```bash
-mkdir data
-mkdir models
-cd data
-wget http://paddle.yulan.net.cn/ptb.zip
-wget http://paddle.yulan.net.cn/text8.zip
-unzip -d ptb ptb.zip
-unzip -d text8 text8.zip
-cd ..
-cd models
-wget http://paddle.yulan.net.cn/model_ptb
-wget http://paddle.yulan.net.cn/model_text8
-cd ..
-```
-
-### 训练
-
-训练参数可在`config.yaml`文件中调整。
-
-Note: 由于本模型受随机因素影响较大，故每次训练的结果差异较大，即使固定随机种子，由于GPU的原因训练结果仍然无法完全一致。
-
-#### 在ptb数据集上训练
-
-```bash
-cp config/config_ptb.yaml config.yaml
-python train.py
-```
-
-#### 寻找最佳模型
-
-由于模型受随机因素影响较大，故要进行多次训练来找到最优模型，原论文中在ptb数据集上进行了10次训练，并保留了在test集上表现最好的模型。本复现提供了一个脚本，来进行多次训练以获得能达到足够精度的模型。
-
-```bash
-cp config/config_ptb.yaml config.yaml
-python train_until.py --target 111.0
-```
-
-以下是在ptb数据集上进行多次训练以达到目标精度的[log](http://paddle.yulan.net.cn/ptb_train_until.log),可以计算出20轮的平均ppl为113，方差为5.68
-
-#### 在text8数据集上训练
-
-```bash
-cp config/config_text8.yaml config.yaml
-python train.py
-```
-
-### 测试
-
-保持`config.yaml`文件与训练时相同
-
-```
-python eval.py
-```
-
-### 使用预训练模型
-
-#### ptb数据集上
-
-```bash
-cp config/config_ptb_test.yaml config.yaml
-python eval.py
-```
-
-将得到以下结果
-
-![](http://paddle.yulan.net.cn/test_ptb.png)
-
-#### text8数据集上
-
-```bash
-cp config/config_text8_test.yaml config.yaml
-python eval.py
-```
-
-结果如下
-
-![](http://paddle.yulan.net.cn/test_text8.png)
-
-## 五、复现精度
-
-相应模型已包含在本repo中，分别位于目录`models_ptb`与`models_text8`下
-
-| Dataset | Paper Perplexity | Our Perplexity |
-| :-----: | :--------------: | :------------: |
-|   ptb   |       111        |     110.75     |
-|  text8  |       147        |     145.62     |
-
-## 六、代码结构详细说明
-
-### 6.1 代码结构
-
-```
-├── checkpoints
-├── config                                        # 配置文件模板
-├── config.yaml
-├── README.md
-├── requirements.txt
-├── config.py
-├── model.py
-├── data.py
-├── train.py                                    # 训练脚本
-├── eval.py                                        # 测试脚本
-├── train_until.py
-└── utils.py
-```
-
-### 6.2 参数说明
-
-可以在`config.yaml`中设置以下参数
-
-```
-# internal state dimension
-edim: 150
-# linear part of the state
-lindim: 75
-# number of hops
-nhop: 7
-# memory size
-mem_size: 200
-# initial internal state value
-init_hid: 0.1
-# initial learning rate
-init_lr: 0.01
-# weight initialization std
-init_std: 0.05
-# clip gradients to this norm
-max_grad_norm: 50
-
-# batch size to use during training
-batch_size: 128
-# number of epoch to use during training
-nepoch: 100
-
-# data directory
-data_dir: "data/ptb"
-# checkpoint directory
-checkpoint_dir: "checkpoints"
-# model name for test and recover train
-model_name: "model"
-# if True, load model [model_name] before train
-recover_train: False
-# data set name
-data_name: "ptb"
-# print progress, need progress module
-show: True
-# initial random seed
-srand: 17814
-# How many epochs output log once
-log_epoch: 5
-# Desired ppl
-target_ppl: 147
-```
-
-### 七、reference
-原论文地址：[Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus: “End-To-End Memory Networks”, 2015.](https://arxiv.org/pdf/1503.08895v5.pdf)
-
-复现repo：[yulangz/End-to-End-Memory-Networks-in-Paddle](https://github.com/yulangz/End-to-End-Memory-Networks-in-Paddle)
-
-参考repo：[https://github.com/facebookarchive/MemNN](https://github.com/facebookarchive/MemNN)
-
-项目AiStudio地址：[https://aistudio.baidu.com/aistudio/projectdetail/2381004](https://aistudio.baidu.com/aistudio/projectdetail/2381004)
diff --git a/examples/language_model/end_to_end_memory_networks/config.py b/examples/language_model/end_to_end_memory_networks/config.py
deleted file mode 100644
index 4e1fcc0eafd2..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config.py
+++ /dev/null
@@ -1,32 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import yaml
-
-
-class Config(object):
-    """
-    A simple waper for configs
-    """
-
-    def __init__(self, config_path: str):
-        with open(config_path, "r") as f:
-            self.d = yaml.load(f.read(), Loader=yaml.SafeLoader)
-
-    def __getattribute__(self, key):
-        d = super(Config, self).__getattribute__("d")
-        if key in d:
-            return d[key]
-        else:
-            return super(Config, self).__getattribute__(key)
diff --git a/examples/language_model/end_to_end_memory_networks/config.yaml b/examples/language_model/end_to_end_memory_networks/config.yaml
deleted file mode 100644
index 99ccce66ab52..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config.yaml
+++ /dev/null
@@ -1,40 +0,0 @@
-# internal state dimension
-edim: 150
-# linear part of the state
-lindim: 75
-# number of hops
-nhop: 7
-# memory size
-mem_size: 200
-# initial internal state value
-init_hid: 0.1
-# initial learning rate
-init_lr: 0.01
-# weight initialization std
-init_std: 0.05
-# clip gradients to this norm
-max_grad_norm: 50
-
-# batch size to use during training
-batch_size: 128
-# number of epoch to use during training
-nepoch: 100
-
-# data directory
-data_dir: "data/ptb"
-# checkpoint directory
-checkpoint_dir: "checkpoints"
-# model name for test and recover train
-model_name: "model"
-# if True, load model [model_name] before train
-recover_train: False
-# data set name
-data_name: "ptb"
-# print progress, need progress module
-show: True
-# initial random seed
-srand: 17814
-# How many epochs output log once
-log_epoch: 5
-# Desired ppl
-target_ppl: 147
\ No newline at end of file
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml
deleted file mode 100644
index 620877cbced5..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-edim: 150
-lindim: 75
-nhop: 7
-mem_size: 200
-batch_size: 128
-nepoch: 100
-init_lr: 0.01
-init_hid: 0.1
-init_std: 0.05
-max_grad_norm: 50
-data_dir: "data/ptb"
-checkpoint_dir: "checkpoints"
-model_name: "model"
-recover_train: False
-data_name: "ptb"
-show: True
-srand: 17814
-log_epoch: 5
-target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml
deleted file mode 100644
index 3be5d7c58a1f..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml
+++ /dev/null
@@ -1,18 +0,0 @@
-edim: 150
-lindim: 75
-nhop: 7
-mem_size: 200
-batch_size: 128
-nepoch: 100
-init_lr: 0.01
-init_hid: 0.1
-init_std: 0.05
-max_grad_norm: 50
-data_dir: "data/ptb"
-checkpoint_dir: "models"
-model_name: "model_ptb"
-recover_train: False
-data_name: "ptb"
-show: True
-log_epoch: 5
-target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml
deleted file mode 100644
index ce9d2b5fb3fb..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-edim: 500
-lindim: 250
-nhop: 7
-mem_size: 100
-batch_size: 128
-nepoch: 100
-init_lr: 0.01
-init_hid: 0.1
-init_std: 0.05
-max_grad_norm: 50
-data_dir: "data/text8"
-checkpoint_dir: "checkpoints"
-model_name: "model"
-recover_train: False
-data_name: "text8"
-show: True
-srand: 12345
-log_epoch: 5
-target_ppl: 111
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml
deleted file mode 100644
index 04751bfa6803..000000000000
--- a/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml
+++ /dev/null
@@ -1,18 +0,0 @@
-edim: 500
-lindim: 250
-nhop: 7
-mem_size: 100
-batch_size: 128
-nepoch: 100
-init_lr: 0.01
-init_hid: 0.1
-init_std: 0.05
-max_grad_norm: 50
-data_dir: "data/text8"
-checkpoint_dir: "models"
-model_name: "model_text8"
-recover_train: False
-data_name: "text8"
-show: True
-log_epoch: 5
-target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/data.py b/examples/language_model/end_to_end_memory_networks/data.py
deleted file mode 100644
index 3083cb996348..000000000000
--- a/examples/language_model/end_to_end_memory_networks/data.py
+++ /dev/null
@@ -1,88 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-
-def read_data(fname, word2idx):
-    """
-    Data is processed into a one-dimensional vector, and each value is the code corresponding to a word.
-    The two sentences are separated by special characters < EOS >.
-
-    Args:
-        fname (str):
-            data filename
-        word2idx (dict):
-            word dict
-
-    Returns:
-        list: return word vectors
-    """
-    if os.path.isfile(fname):
-        with open(fname) as f:
-            lines = f.readlines()
-    else:
-        raise (Exception("[!] Data %s not found" % fname))
-
-    words = []
-    for line in lines:
-        words.extend(line.split())
-
-    print("Read %s words from %s" % (len(words), fname))
-
-    data = list()
-    for line in lines:
-        for word in line.split():
-            index = word2idx[word]
-            data.append(index)
-        data.append(word2idx["<eos>"])
-    return data
-
-
-def load_vocab(fname):
-    """
-    load word dict
-
-    Args:
-        fname (str): filename of the vocav file
-
-    Returns:
-        dict: word dict
-    """
-    word2idx = {}
-    with open(fname, "r") as f:
-        for line in f:
-            pair = line.split()
-            word2idx[pair[0]] = int(pair[1])
-    return word2idx
-
-
-def load_data(config):
-    """
-    load data
-
-    Args:
-        config: config
-
-    Returns:
-        word dict, and train, valid, test data
-    """
-    vocab_path = os.path.join(config.data_dir, "%s.vocab.txt" % config.data_name)
-    word2idx = load_vocab(vocab_path)
-
-    train_data = read_data(os.path.join(config.data_dir, "%s.train.txt" % config.data_name), word2idx)
-    valid_data = read_data(os.path.join(config.data_dir, "%s.valid.txt" % config.data_name), word2idx)
-    test_data = read_data(os.path.join(config.data_dir, "%s.test.txt" % config.data_name), word2idx)
-
-    return word2idx, train_data, valid_data, test_data
diff --git a/examples/language_model/end_to_end_memory_networks/eval.py b/examples/language_model/end_to_end_memory_networks/eval.py
deleted file mode 100644
index bc3dede1d357..000000000000
--- a/examples/language_model/end_to_end_memory_networks/eval.py
+++ /dev/null
@@ -1,107 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-import os
-from importlib import import_module
-
-import numpy as np
-import paddle
-from config import Config
-from data import load_data
-from model import MemN2N
-from paddle import nn
-
-
-@paddle.no_grad()
-def eval(model: MemN2N, data, config, mode="Test"):
-    """
-    evaluate the model performance
-
-    Args:
-        model (MemN2N): the model to be evaluate
-        data: evaluation data
-        config: model and eval configs
-        mode: Valid or Test
-
-    Returns:
-        average loss
-    """
-    model.eval()
-    lossfn = nn.CrossEntropyLoss(reduction="sum")
-    N = int(math.ceil(len(data) / config.batch_size))
-    total_loss = 0
-
-    context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64)
-    target = np.ndarray([config.batch_size], dtype=np.int64)
-
-    if config.show:
-        ProgressBar = getattr(import_module("utils"), "ProgressBar")
-        bar = ProgressBar(mode, max=N - 1)
-
-    m = config.mem_size
-    for batch in range(N):
-        if config.show:
-            bar.next()
-
-        for i in range(config.batch_size):
-            if m >= len(data):
-                break
-            target[i] = data[m]
-            context[i, :] = data[m - config.mem_size : m]
-            m += 1
-        if m >= len(data):
-            break
-
-        batch_data = paddle.to_tensor(context)
-        batch_label = paddle.to_tensor(target)
-
-        preict = model(batch_data)
-        loss = lossfn(preict, batch_label)
-
-        total_loss += loss
-
-    if config.show:
-        bar.finish()
-
-    return total_loss / N / config.batch_size
-
-
-def test(model: MemN2N, test_data, config):
-    """
-    test the model performance
-    """
-    test_loss = eval(model, test_data, config, "Test")
-    test_perplexity = math.exp(test_loss)
-    print("Perplexity on Test: %f" % test_perplexity)
-
-
-if __name__ == "__main__":
-    config = Config("config.yaml")
-
-    if not os.path.exists(config.checkpoint_dir):
-        os.makedirs(config.checkpoint_dir)
-
-    word2idx, train_data, valid_data, test_data = load_data(config)
-    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
-    config.nwords = len(word2idx)
-
-    print("vacab size is %d" % config.nwords)
-
-    model = MemN2N(config)
-
-    model_path = os.path.join(config.checkpoint_dir, config.model_name)
-    state_dict = paddle.load(model_path)
-    model.set_dict(state_dict)
-    test(model, test_data, config)
diff --git a/examples/language_model/end_to_end_memory_networks/model.py b/examples/language_model/end_to_end_memory_networks/model.py
deleted file mode 100644
index 8897cbc70002..000000000000
--- a/examples/language_model/end_to_end_memory_networks/model.py
+++ /dev/null
@@ -1,108 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-from paddle import nn
-import numpy as np
-
-
-class MemN2N(nn.Layer):
-    """
-    End to End Memory Networks model
-
-    reference paper: https://arxiv.org/pdf/1503.08895v5.pdf
-    """
-
-    def __init__(self, config):
-        """
-        Model initialization
-
-        Args:
-            config: model configuration, see config.yaml for more detail
-        """
-        super(MemN2N, self).__init__()
-        self.nwords = config.nwords
-        self.init_hid = config.init_hid
-        self.init_std = config.init_std
-        self.nhop = config.nhop
-        self.edim = config.edim
-        self.mem_size = config.mem_size
-        self.lindim = config.lindim
-        self.max_grad_norm = config.max_grad_norm
-        self.batch_size = config.batch_size
-
-        self.checkpoint_dir = config.checkpoint_dir
-
-        normal_attr = paddle.framework.ParamAttr(initializer=paddle.nn.initializer.Normal(std=self.init_std))
-        self.A = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr)
-        self.C = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr)
-
-        # Temporal Encoding
-        self.T_A = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr)
-        self.T_C = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr)
-
-        # Linear mapping for q
-        self.H = nn.Linear(self.edim, self.edim, weight_attr=normal_attr, bias_attr=False)
-
-        # output mapping
-        self.W = nn.Linear(self.edim, self.nwords, weight_attr=normal_attr, bias_attr=False)
-
-    def forward(self, data):
-        """
-        The shape of data is [batch_size, mem_size], and the content is the id of each word
-        """
-        q = np.ndarray([self.batch_size, self.edim], dtype=np.float32)
-        q.fill(self.init_hid)
-        q = paddle.to_tensor(q)
-
-        time = np.ndarray([self.batch_size, self.mem_size], dtype=np.int64)
-        for i in range(self.mem_size):
-            time[:, i] = i
-        time = paddle.to_tensor(time)
-
-        for hop in range(self.nhop):
-            A_in_c = self.A(data)  # [batch_size, mem_size, edim]
-            A_in_t = self.T_A(time)  # [batch_size, mem_size, edim]
-            A_in = paddle.add(A_in_c, A_in_t)  # [batch_size, mem_size, edim]
-
-            q_in = q.reshape([-1, 1, self.edim])  # [batch, 1, edim]
-            A_out3d = paddle.matmul(q_in, A_in, transpose_y=True)  # [batch, 1, mem_size]
-            A_out2d = A_out3d.reshape([-1, self.mem_size])
-            p = nn.functional.softmax(A_out2d)  # [batch, mem_size]
-
-            C_in_c = self.C(data)
-            C_in_t = self.T_C(time)
-            C_in = paddle.add(C_in_c, C_in_t)  # [batch_size, mem_size, edim]
-
-            p_3d = p.reshape([-1, 1, self.mem_size])  # [batch, 1, mem_size]
-            C_out3d = paddle.matmul(p_3d, C_in)  # [batch, 1, edim]
-
-            C_out2d = C_out3d.reshape([-1, self.edim])  # [batch, edim]
-
-            # Linear mapping and addition
-            q_mapped = self.H(q)
-            q_out = paddle.add(C_out2d, q_mapped)
-
-            if self.lindim == self.edim:
-                q = q_out
-            elif self.lindim == 0:
-                q = nn.functional.relu(q_out)
-            else:
-                F = q_out[:, : self.lindim]
-                G = q_out[:, self.lindim :]
-                K = nn.functional.relu(G)
-                q = paddle.concat([F, K], axis=-1)
-
-        predict = self.W(q)
-        return predict
diff --git a/examples/language_model/end_to_end_memory_networks/requirements.txt b/examples/language_model/end_to_end_memory_networks/requirements.txt
deleted file mode 100644
index a5c04145b738..000000000000
--- a/examples/language_model/end_to_end_memory_networks/requirements.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-progress==1.6
-
diff --git a/examples/language_model/end_to_end_memory_networks/train.py b/examples/language_model/end_to_end_memory_networks/train.py
deleted file mode 100644
index ef1c6e893b0e..000000000000
--- a/examples/language_model/end_to_end_memory_networks/train.py
+++ /dev/null
@@ -1,164 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-import os
-import random
-from importlib import import_module
-
-import numpy as np
-import paddle
-from config import Config
-from data import load_data
-from eval import eval
-from model import MemN2N
-from paddle import nn
-
-
-def train_single_epoch(model: MemN2N, lr, data, config):
-    """
-    train one epoch
-
-    Args:
-        model (MemN2N): model to be trained
-        lr (float): the learning rate of this epoch
-        data: training data
-        config: configs
-
-    Returns:
-        float: average loss
-    """
-    model.train()
-    N = int(math.ceil(len(data) / config.batch_size))  # total train N batchs
-
-    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=config.max_grad_norm)
-    optimizer = paddle.optimizer.SGD(learning_rate=lr, parameters=model.parameters(), grad_clip=clip)
-    lossfn = nn.CrossEntropyLoss(reduction="sum")
-
-    total_loss = 0
-
-    if config.show:
-        ProgressBar = getattr(import_module("utils"), "ProgressBar")
-        bar = ProgressBar("Train", max=N)
-
-    for batch in range(N):
-        if config.show:
-            bar.next()
-
-        optimizer.clear_grad()
-        context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64)
-        target = np.ndarray([config.batch_size], dtype=np.int64)
-        for i in range(config.batch_size):
-            m = random.randrange(config.mem_size, len(data))
-            target[i] = data[m]
-            context[i, :] = data[m - config.mem_size : m]
-
-        batch_data = paddle.to_tensor(context)
-        batch_label = paddle.to_tensor(target)
-
-        preict = model(batch_data)
-        loss = lossfn(preict, batch_label)
-        loss.backward()
-        optimizer.step()
-        total_loss += loss
-
-    if config.show:
-        bar.finish()
-
-    return total_loss / N / config.batch_size
-
-
-def train(model: MemN2N, train_data, valid_data, config):
-    """
-    do train
-
-    Args:
-        model (MemN2N): the model to be evaluate
-        train_data: training data
-        valid_data: validating data
-        config: model and training configs
-
-    Returns:
-        no return
-    """
-    lr = config.init_lr
-
-    train_losses = []
-    train_perplexities = []
-
-    valid_losses = []
-    valid_perplexities = []
-
-    for epoch in range(1, config.nepoch + 1):
-        train_loss = train_single_epoch(model, lr, train_data, config)
-        valid_loss = eval(model, valid_data, config, "Validation")
-
-        info = {"epoch": epoch, "learning_rate": lr}
-
-        # When the loss on the valid no longer drops, it's like learning rate divided by 1.5
-        if len(valid_losses) > 0 and valid_loss > valid_losses[-1] * 0.9999:
-            lr /= 1.5
-
-        train_losses.append(train_loss)
-        train_perplexities.append(math.exp(train_loss))
-
-        valid_losses.append(valid_loss)
-        valid_perplexities.append(math.exp(valid_loss))
-
-        info["train_perplexity"] = train_perplexities[-1]
-        info["validate_perplexity"] = valid_perplexities[-1]
-
-        print(info)
-
-        if epoch % config.log_epoch == 0:
-            save_dir = os.path.join(config.checkpoint_dir, "model_%d" % epoch)
-            paddle.save(model.state_dict(), save_dir)
-            lr_path = os.path.join(config.checkpoint_dir, "lr_%d" % epoch)
-            with open(lr_path, "w") as f:
-                f.write(f"{lr}")
-
-        # to get the target ppl
-        if info["validate_perplexity"] < config.target_ppl:
-            save_dir = os.path.join(config.checkpoint_dir, "model_good")
-            paddle.save(model.state_dict(), save_dir)
-            break
-
-        if lr < 1e-5:
-            break
-
-    save_dir = os.path.join(config.checkpoint_dir, "model")
-    paddle.save(model.state_dict(), save_dir)
-
-
-if __name__ == "__main__":
-    config = Config("config.yaml")
-
-    if not os.path.exists(config.checkpoint_dir):
-        os.makedirs(config.checkpoint_dir)
-
-    word2idx, train_data, valid_data, test_data = load_data(config)
-    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
-    config.nwords = len(word2idx)
-    print("vacab size is %d" % config.nwords)
-
-    np.random.seed(config.srand)
-    random.seed(config.srand)
-    paddle.seed(config.srand)
-
-    model = MemN2N(config)
-    if config.recover_train:
-        model_path = os.path.join(config.checkpoint_dir, config.model_name)
-        state_dict = paddle.load(model_path)
-        model.set_dict(state_dict)
-    train(model, train_data, valid_data, config)
diff --git a/examples/language_model/end_to_end_memory_networks/train_until.py b/examples/language_model/end_to_end_memory_networks/train_until.py
deleted file mode 100644
index ebb94a2455b5..000000000000
--- a/examples/language_model/end_to_end_memory_networks/train_until.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-
-import numpy as np
-import paddle
-from config import Config
-from data import load_data
-from eval import test
-from model import MemN2N
-from train import train
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--target", default=111.0, type=float, help="target perplexity")
-target = parser.parse_args().target
-
-if __name__ == "__main__":
-    config = Config("config.yaml")
-    if not os.path.exists(config.checkpoint_dir):
-        os.makedirs(config.checkpoint_dir)
-
-    word2idx, train_data, valid_data, test_data = load_data(config)
-    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
-    config.nwords = len(word2idx)
-    print("vacab size is %d" % config.nwords)
-
-    while True:
-        random.seed(time.time())
-        config.srand = random.randint(0, 100000)
-
-        np.random.seed(config.srand)
-        random.seed(config.srand)
-        paddle.seed(config.srand)
-
-        model = MemN2N(config)
-        train(model, train_data, valid_data, config)
-
-        test_ppl = test(model, test_data, config)
-        if test_ppl < target:
-            model_path = os.path.join(config.checkpoint_dir, config.model_name + "_" + str(config.srand) + "_good")
-            paddle.save(model.state_dict(), model_path)
-            break
diff --git a/examples/language_model/moe/data_tools b/examples/language_model/moe/data_tools
deleted file mode 120000
index 8841a30c30fd..000000000000
--- a/examples/language_model/moe/data_tools
+++ /dev/null
@@ -1 +0,0 @@
-../../../model_zoo/ernie-1.0/data_tools
\ No newline at end of file
diff --git a/examples/language_model/mpnet/glue/predict.sh b/examples/language_model/mpnet/glue/predict.sh
deleted file mode 100644
index c2396372fb6e..000000000000
--- a/examples/language_model/mpnet/glue/predict.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-# task name ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]
-
-python run_predict.py --task_name qqp  --ckpt_path qqp/best-qqp_ft_model_106000.pdparams
\ No newline at end of file
diff --git a/examples/language_model/rnnlm/README.md b/examples/language_model/rnnlm/README.md
deleted file mode 100644
index efd6459bd911..000000000000
--- a/examples/language_model/rnnlm/README.md
+++ /dev/null
@@ -1,68 +0,0 @@
-# 语言模型
-
-# 简介
-
-## 1. 任务说明
-本文主要介绍基于lstm的语言的模型的实现，给定一个输入词序列（中文分词、英文tokenize），计算其ppl（语言模型困惑度，用户表示句子的流利程度），基于循环神经网络语言模型的介绍可以[参阅论文](https://arxiv.org/abs/1409.2329)。相对于传统的方法，基于循环神经网络的方法能够更好的解决稀疏词的问题。
-
-
-## 2. 效果说明
-
-|   |    train    |   valid    |    test      |
-| :------------- | :---------: | :--------: | :----------: |
-|     PaddlePaddle     |    47.234   |  86.801    |    83.159    |
-|   Tensorflow   |    45.594   |  87.363    |    84.015   |
-
-
-
-## 3. 数据集
-
-此任务的数据集合是采用ptb dataset，下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
-
-
-# 快速开始
-
-### 数据准备
-为了方便开发者进行测试，我们内置了数据下载脚本，默认自动下载PTB数据集。
-
-### 训练或Fine-tune
-
-任务训练启动命令如下：
-
-```
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus "0" train.py \
-```
-
-程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型到checkpoint、中。
-还可以在启动命令后以--的形式修改网络参数或数据位置，具体可修改的参数和参数的默认值参考`args.py`。
-
-**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/test`即可，程序会自动加载模型参数`checkpoints/test.pdparams`，也会自动加载优化器状态`checkpoints/test.pdopt`。
-
-# 进阶使用
-
-## 任务定义与建模
-此任务目的是给定一个输入的词序列，预测下一个词出现的概率。
-
-## 模型原理介绍
-此任务采用了序列任务常用的rnn网络，实现了一个两层的lstm网络，然后lstm的结果去预测下一个词出现的概率。
-
-由于数据的特殊性，每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell。
-
-
-## 数据格式说明
-此任务的数据格式比较简单，每一行为一个已经分好词（英文的tokenize）的词序列。
-
-目前的句子示例如下图所示:
-```
-aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
-pierre <unk> N years old will join the board as a nonexecutive director nov. N
-mr. <unk> is chairman of <unk> n.v. the dutch publishing group
-```
-
-特殊说明：ptb的数据比较特殊，ptb的数据来源于一些文章，相邻的句子可能来源于一个段落或者相邻的段落，ptb 数据不能做shuffle。
-
-
-## 如何组建自己的模型
-+ **自定义数据：** 关于数据，如果可以把自己的数据先进行分词（或者tokenize），通过`--data_path`来指定本地数据集所在文件夹，并需要在`train.py`中修改对应的文件名称。
-+ **网络结构更改：** 网络只实现了基于lstm的语言模型，用户可以自己的需求更换为gru等网络结构，这些实现都是在`model.py`中定义。
diff --git a/examples/language_model/rnnlm/args.py b/examples/language_model/rnnlm/args.py
deleted file mode 100644
index 21b586b83e5c..000000000000
--- a/examples/language_model/rnnlm/args.py
+++ /dev/null
@@ -1,23 +0,0 @@
-import argparse
-
-
-def parse_args():
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--data_path", type=str, default=None, help="all the data for train,valid,test")
-    parser.add_argument("--batch_size", type=int, default=20, help="batch size")
-    parser.add_argument("--hidden_size", type=int, default=650, help="hidden_size")
-    parser.add_argument("--num_steps", type=int, default=35, help="num steps")
-    parser.add_argument("--num_layers", type=int, default=2, help="num_layers")
-    parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm")
-    parser.add_argument("--dropout", type=float, default=0.5, help="dropout")
-    parser.add_argument("--epoch_start_decay", type=int, default=6, help="epoch_start_decay")
-    parser.add_argument("--max_epoch", type=int, default=39, help="max_epoch")
-    parser.add_argument("--lr_decay", type=float, default=0.8, help="lr_decay")
-    parser.add_argument("--base_lr", type=float, default=1.0, help="base_lr")
-    parser.add_argument("--init_scale", type=float, default=0.05, help="init_scale")
-    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
-    parser.add_argument(
-        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
-    )
-    args = parser.parse_args()
-    return args
diff --git a/examples/language_model/rnnlm/model.py b/examples/language_model/rnnlm/model.py
deleted file mode 100644
index 99c04c43d32a..000000000000
--- a/examples/language_model/rnnlm/model.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-import paddle.nn as nn
-import paddle.nn.initializer as I
-
-
-class RnnLm(nn.Layer):
-    def __init__(self, vocab_size, hidden_size, batch_size, num_layers=1, init_scale=0.1, dropout=0.0):
-        super(RnnLm, self).__init__()
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.init_scale = init_scale
-        self.batch_size = batch_size
-        self.reset_states()
-
-        self.embedder = nn.Embedding(
-            vocab_size,
-            hidden_size,
-            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
-        )
-
-        self.lstm = nn.LSTM(
-            input_size=hidden_size,
-            hidden_size=hidden_size,
-            num_layers=num_layers,
-            dropout=dropout,
-            weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
-            weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
-        )
-
-        self.fc = nn.Linear(
-            hidden_size,
-            vocab_size,
-            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
-            bias_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
-        )
-
-        self.dropout = nn.Dropout(p=dropout)
-
-    def forward(self, inputs):
-        x = inputs
-        x_emb = self.embedder(x)
-        x_emb = self.dropout(x_emb)
-
-        y, (self.hidden, self.cell) = self.lstm(x_emb, (self.hidden, self.cell))
-        (self.hidden, self.cell) = tuple([item.detach() for item in (self.hidden, self.cell)])
-        y = self.dropout(y)
-        y = self.fc(y)
-        return y
-
-    def reset_states(self):
-        self.hidden = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32")
-        self.cell = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32")
-
-
-class CrossEntropyLossForLm(nn.Layer):
-    def __init__(self):
-        super(CrossEntropyLossForLm, self).__init__()
-
-    def forward(self, y, label):
-        label = paddle.unsqueeze(label, axis=2)
-        loss = paddle.nn.functional.cross_entropy(input=y, label=label, reduction="none")
-        loss = paddle.squeeze(loss, axis=[2])
-        loss = paddle.mean(loss, axis=[0])
-        loss = paddle.sum(loss)
-        return loss
-
-
-class UpdateModel(paddle.callbacks.Callback):
-    # This callback reset model hidden states and update learning rate before each epoch begins
-    def on_epoch_begin(self, epoch=None, logs=None):
-        self.model.network.reset_states()
diff --git a/examples/language_model/rnnlm/reader.py b/examples/language_model/rnnlm/reader.py
deleted file mode 100644
index 065ea409076d..000000000000
--- a/examples/language_model/rnnlm/reader.py
+++ /dev/null
@@ -1,58 +0,0 @@
-import numpy as np
-
-import paddle
-
-from paddlenlp.datasets import load_dataset
-from paddlenlp.data import Vocab
-
-
-def create_data_loader(batch_size, num_steps, data_path=None):
-    train_ds, valid_ds, test_ds = load_dataset("ptb", splits=("train", "valid", "test"))
-
-    train_examples = [train_ds[i]["sentence"].split() for i in range(len(train_ds))]
-    vocab = Vocab.build_vocab(train_examples, eos_token="</eos>")
-
-    # Because the sentences in PTB dataset might be consecutive, we need to concatenate
-    # all texts from our dataset and fold them into chunks while the number of rows is
-    # equal to batch size. For example:
-    #
-    #   Sentence1: we're talking about years ago before anyone heard of asbestos having
-    #              any questionable properties.
-    #   Sentence2: there is no asbestos in our products now.
-    #   Batch_size: 5
-    #   Grouped_text: [["we're", "talking", "about", "years"],
-    #                  ["ago", "before", "anyone", "heard"],
-    #                  ["of", "asbestos", "having", "any"],
-    #                  ["questionable", "properties", "there", "is"],
-    #                  ["no", "asbestos", "in", "our"]]
-    #
-    def group_texts(examples):
-        concat_examples = []
-        for example in examples:
-            concat_examples += example["sentence"].split() + ["</eos>"]
-
-        concat_examples = vocab.to_indices(concat_examples)
-
-        max_seq_len = len(concat_examples) // batch_size
-        reshaped_examples = np.asarray(concat_examples[0 : batch_size * max_seq_len], dtype="int64").reshape(
-            (batch_size, max_seq_len)
-        )
-        encoded_examples = []
-        for i in range(max_seq_len // num_steps):
-            encoded_examples.append(
-                (
-                    np.copy(reshaped_examples[:, i * num_steps : (i + 1) * num_steps]),
-                    np.copy(reshaped_examples[:, i * num_steps + 1 : (i + 1) * num_steps + 1]),
-                )
-            )
-
-        return encoded_examples
-
-    train_ds.map(group_texts, batched=True)
-    valid_ds.map(group_texts, batched=True)
-    test_ds.map(group_texts, batched=True)
-
-    train_loader = paddle.io.DataLoader(train_ds, return_list=True, batch_size=None)
-    valid_loader = paddle.io.DataLoader(valid_ds, return_list=True, batch_size=None)
-    test_loader = paddle.io.DataLoader(test_ds, return_list=True, batch_size=None)
-    return train_loader, valid_loader, test_loader, len(vocab)
diff --git a/examples/language_model/rnnlm/train.py b/examples/language_model/rnnlm/train.py
deleted file mode 100644
index bb69ebc37aee..000000000000
--- a/examples/language_model/rnnlm/train.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-from args import parse_args
-from model import CrossEntropyLossForLm, RnnLm, UpdateModel
-from reader import create_data_loader
-
-from paddlenlp.metrics import Perplexity
-
-paddle.seed(102)
-
-
-def train(args):
-    paddle.set_device(args.device)
-    data_path = args.data_path
-    train_loader, valid_loader, test_loader, vocab_size = create_data_loader(
-        batch_size=args.batch_size, num_steps=args.num_steps, data_path=data_path
-    )
-
-    network = RnnLm(
-        vocab_size=vocab_size,
-        hidden_size=args.hidden_size,
-        batch_size=args.batch_size,
-        num_layers=args.num_layers,
-        init_scale=args.init_scale,
-        dropout=args.dropout,
-    )
-    gloabl_norm_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
-    cross_entropy = CrossEntropyLossForLm()
-    ppl_metric = Perplexity()
-    callback = UpdateModel()
-    scheduler = paddle.callbacks.LRScheduler(by_step=False, by_epoch=True)
-    model = paddle.Model(network)
-
-    learning_rate = paddle.optimizer.lr.LambdaDecay(
-        learning_rate=args.base_lr,
-        lr_lambda=lambda x: args.lr_decay ** max(x + 1 - args.epoch_start_decay, 0.0),
-        verbose=True,
-    )
-    optimizer = paddle.optimizer.SGD(
-        learning_rate=learning_rate, parameters=model.parameters(), grad_clip=gloabl_norm_clip
-    )
-
-    model.prepare(optimizer=optimizer, loss=cross_entropy, metrics=ppl_metric)
-
-    if args.init_from_ckpt:
-        model.load(args.init_from_ckpt)
-        print("Loaded checkpoint from %s" % args.init_from_ckpt)
-
-    benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=(len(train_loader) // 10), verbose=3)
-    model.fit(
-        train_data=train_loader,
-        eval_data=valid_loader,
-        epochs=args.max_epoch,
-        shuffle=False,
-        callbacks=[callback, scheduler, benchmark_logger],
-    )
-
-    model.save(path="checkpoint/test")  # save for training
-
-    print("Start to evaluate on test dataset...")
-    model.evaluate(test_loader, log_freq=len(test_loader))
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    train(args)
diff --git a/examples/language_model/t5/dataset_utils.py b/examples/language_model/t5/dataset_utils.py
deleted file mode 120000
index c7149da377ec..000000000000
--- a/examples/language_model/t5/dataset_utils.py
+++ /dev/null
@@ -1 +0,0 @@
-../../../model_zoo/ernie-1.0/data_tools/dataset_utils.py
\ No newline at end of file
diff --git a/examples/language_model/transformer-xl/README.md b/examples/language_model/transformer-xl/README.md
deleted file mode 100644
index 9f6f7dec8da5..000000000000
--- a/examples/language_model/transformer-xl/README.md
+++ /dev/null
@@ -1,114 +0,0 @@
-# Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
-
-以下是本例的简要目录结构及说明：
-
-```text
-.
-├── configs/                # 配置文件
-├── eval.py                 # 预测脚本
-├── gen_data.sh             # 数据下载脚本
-├── mem_transformer.py      # 模型组网
-├── reader.py               # 数据读取接口
-├── README.md               # 文档
-├── train.py                # 训练脚本
-└── utils/                  # 数据处理工具
-```
-
-## 模型简介
-
-本项目是语言模型 Transformer-XL 的 PaddlePaddle 实现， 包含模型训练，预测等内容。
-
-
-## 快速开始
-
-### 环境依赖
-
-- attrdict
-- pyyaml
-
-安装命令 `pip install attrdict pyyaml`
-
-### 数据准备
-
-公开数据集：enwik8、text8、wt103 多用于语言模型的 benchmark 测试。输出获取与处理方式如下：
-
-```shell
-bash gen_data.sh
-```
-
-会在当前路径下的 ./gen_data/ 路径下生成我们需要的数据。
-
-### 单机训练
-
-#### 单机单卡
-
-以提供的 enwik8 数据为例，可以执行以下命令进行模型训练：
-
-``` sh
-# setting visible devices for training
-export CUDA_VISIBLE_DEVICES=0
-python train.py --config ./configs/enwik8.yaml
-```
-
-可以在 enwik8.yaml 文件中设置相应的参数，比如 `batch_size`、`epoch` 等。
-
-如果要更换成 wt103 数据集进行训练，可以在执行的时候通过 `--config` 指定对应的配置文件即可。
-
-``` sh
-# setting visible devices for training
-export CUDA_VISIBLE_DEVICES=0
-python train.py --config ./configs/wt103.yaml
-```
-
-#### 使用 CPU 进行训练
-
-如果要使用 CPU 进行训练，可以修改 `configs/` 路径下，对应的配置文件中的 `use_gpu` 配置为 `False`，用相同的方式启动训练即可使用 CPU 进行训练。
-
-``` sh
-python train.py --config ./configs/enwik8.yaml
-```
-
-### 单机多卡
-
-同样，可以执行如下命令实现八卡训练：
-
-``` sh
-export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config ./configs/enwik8.yaml
-```
-
-### 恢复训练
-
-若需要从之前的 checkpoint 开始继续训练，可以设置 `configs/` 路径中对应的配置文件中的参数 `init_from_checkpoint` 可载入之前的 checkpoint（包括 optimizer 的信息）继续训练。指定的方式是，指定到模型的 checkpoint 保存的路径。比如，指定成 `./trained_models/step_final/`，该路径下的目录结构如下：
-
-```text
-.
-├── mem_transformer.pdopt                   # 存储的优化器相关信息
-└── mem_transformer.pdparams                # 存储模型参数相关信息
-```
-
-若只是从之前训练的参数开始重新训练，无需载入 optimizer 信息，可以设置对应的配置文件中的参数 `init_from_pretrain_model` 可载入指定的参数，从头开始训练。指定的方式也是类似，指定到模型保存的参数文件 `mem_transformer.pdparams` 的路径，比如 `./trained_models/step_final/`。
-
-### 模型推断
-
-以 enwik8 数据为例，模型训练完成后可以执行以下命令可以进行预测：
-
-``` sh
-# setting visible devices for prediction
-export CUDA_VISIBLE_DEVICES=0
-python eval.py --config ./configs/enwik8.yaml
-```
-
-同理，可以通过指定 `--config` 选项来选择需要的数据集对应的配置文件。
-
-``` sh
-# setting visible devices for prediction
-export CUDA_VISIBLE_DEVICES=0
-python eval.py --config ./configs/wt103.yaml
-```
-
-完成推断之后，会将显示在验证集和测试集上的 loss 的结果。
-
-## 参考文献
-
-[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860)
diff --git a/examples/language_model/transformer-xl/configs/enwik8.yaml b/examples/language_model/transformer-xl/configs/enwik8.yaml
deleted file mode 100644
index 12b3ef5cff00..000000000000
--- a/examples/language_model/transformer-xl/configs/enwik8.yaml
+++ /dev/null
@@ -1,112 +0,0 @@
-# The frequency to save trained models when training.
-save_step: 10000
-# The frequency to fetch and print output when training.
-print_step: 100
-# Path of the checkpoint, to resume the previous training
-init_from_checkpoint: ""
-# Path of the pretrain model, to better solve the current task
-init_from_pretrain_model: ""
-# Path of trained parameter, to make prediction
-init_from_params: "./trained_models/step_final/"
-# The directory for saving model
-save_model: "trained_models"
-# The directory for saving inference model.
-inference_model_dir: "infer_model"
-# Set seed for CE or debug
-random_seed: None
-# The path to data files 
-data: "./gen_data/enwik8/"
-# The name of dataset
-dataset: "enwik8"
-
-# Whether to use cuda
-use_gpu: True
-
-# Args for reader, see reader.py for details
-token_delimiter: None
-batch_size: 16
-eval_batch_size: 10
-
-# Hyparams for training:
-# The number of epoches for training
-epoch: 200
-# Max step for training.
-max_step: 400000
-
-# The hyper parameters for optimizer.
-# Type of ptimizer. 
-optim: adam
-# Learning rate schedule. 
-scheduler: cosine
-# This static learning_rate will be applied to the LearningRateScheduler
-# derived learning rate the to get the final learning rate.
-learning_rate: 0.00025
-# The hyper parameters for Adam optimizer.
-beta1: 0.9
-beta2: 0.997
-eps: 1e-9
-# The hyper parameters for Momentum optimizer.
-mom: 0.0
-# Global gradient clip. 
-clip: 0.25
-# The parameters for learning rate scheduling.
-warmup_steps: 0
-# The parameters for CosineAnnealingDecay. Minimum learning rate.
-eta_min: 0.0
-# The parameters for ReduceLROnPlateau.
-# The Ratio that the learning rate will be reduced. 
-decay_rate: 0.5
-# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
-patience: 0
-# The lower bound of the learning rate after reduction.
-min_lr: 0.0
-
-# Hyparams for model:
-# Whe use adaptive softmax. 
-adaptive: False
-# Size of dictionary. This can be obtained automatically. 
-ntokens: 10000
-# The dimension for word embeddings, which is also the last dimension of
-# the input and output of multi-head attention, position-wise feed-forward
-# networks, encoder and decoder.
-d_model: 512
-# Dimension of heads.
-d_head: 64
-# Size of the hidden layer in position-wise feed-forward networks.
-d_inner_hid: 2048
-# Number of head used in multi-head attention.
-n_head: 8
-# Number of sub-layers to be stacked in the encoder and decoder.
-n_layer: 12
-# Dropout rates.
-dropout: 0.1
-# Attention dropout
-attn_dropout: 0.0
-# Attention type for decoder. 
-# 0 for relative partial MHA (in Transformer-XL). 
-# 1 for relative MHA (in Shaw et al). 
-attn_type: 0
-# Apply layer normalization before or after sublayers. 
-normalize_before: False
-# Whether to tie weight or not. 
-tie_weight: True
-# The length of the extended context.
-ext_len: 0
-# The divident value for softmax and adapative input. 
-div_val: 1
-# Target length. The number of tokens to predict. 
-tgt_len: 512
-# Memory length. The length of the retained previous heads. 
-mem_len: 512
-# Use the same attention length for all tokens. 
-same_length: False
-# Use the same positional encoding after clamp len. 
-clamp_len: -1
-# The number of samples in sample softmax. -1 means do not use sampled softmax. 
-sample_softmax: -1
-# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
-eval_tgt_len: 128
-# What kind of mode for evaluation. valid, test or both("all"). 
-mode: "all"
-# Maximum evaluation step. 
-max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/configs/text8.yaml b/examples/language_model/transformer-xl/configs/text8.yaml
deleted file mode 100644
index 5e1353a1d050..000000000000
--- a/examples/language_model/transformer-xl/configs/text8.yaml
+++ /dev/null
@@ -1,112 +0,0 @@
-# The frequency to save trained models when training.
-save_step: 10000
-# The frequency to fetch and print output when training.
-print_step: 100
-# Path of the checkpoint, to resume the previous training
-init_from_checkpoint: ""
-# Path of the pretrain model, to better solve the current task
-init_from_pretrain_model: ""
-# Path of trained parameter, to make prediction
-init_from_params: "./trained_models/step_final/"
-# The directory for saving model
-save_model: "trained_models"
-# The directory for saving inference model.
-inference_model_dir: "infer_model"
-# Set seed for CE or debug
-random_seed: None
-# The path to data files 
-data: "./gen_data/text8/"
-# The name of dataset
-dataset: "text8"
-
-# Whether to use cuda
-use_gpu: True
-
-# Args for reader, see reader.py for details
-token_delimiter: None
-batch_size: 15
-eval_batch_size: 5
-
-# Hyparams for training:
-# The number of epoches for training
-epoch: 200
-# Max step for training.
-max_step: 400000
-
-# The hyper parameters for optimizer.
-# Type of ptimizer. 
-optim: adam
-# Learning rate schedule. 
-scheduler: cosine
-# This static learning_rate will be applied to the LearningRateScheduler
-# derived learning rate the to get the final learning rate.
-learning_rate: 0.00025
-# The hyper parameters for Adam optimizer.
-beta1: 0.9
-beta2: 0.997
-eps: 1e-9
-# The hyper parameters for Momentum optimizer.
-mom: 0.0
-# Global gradient clip. 
-clip: 0.25
-# The parameters for learning rate scheduling.
-warmup_steps: 0
-# The parameters for CosineAnnealingDecay. Minimum learning rate.
-eta_min: 0.0
-# The parameters for ReduceLROnPlateau.
-# The Ratio that the learning rate will be reduced. 
-decay_rate: 0.5
-# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
-patience: 0
-# The lower bound of the learning rate after reduction.
-min_lr: 0.0
-
-# Hyparams for model:
-# Whe use adaptive softmax. 
-adaptive: False
-# Size of dictionary. This can be obtained automatically. 
-ntokens: 10000
-# The dimension for word embeddings, which is also the last dimension of
-# the input and output of multi-head attention, position-wise feed-forward
-# networks, encoder and decoder.
-d_model: 512
-# Dimension of heads.
-d_head: 64
-# Size of the hidden layer in position-wise feed-forward networks.
-d_inner_hid: 2048
-# Number of head used in multi-head attention.
-n_head: 8
-# Number of sub-layers to be stacked in the encoder and decoder.
-n_layer: 12
-# Dropout rates.
-dropout: 0.1
-# Attention dropout
-attn_dropout: 0.0
-# Attention type for decoder. 
-# 0 for relative partial MHA (in Transformer-XL). 
-# 1 for relative MHA (in Shaw et al). 
-attn_type: 0
-# Apply layer normalization before or after sublayers. 
-normalize_before: False
-# Whether to tie weight or not. 
-tie_weight: True
-# The length of the extended context.
-ext_len: 0
-# The divident value for softmax and adapative input. 
-div_val: 1
-# Target length. The number of tokens to predict. 
-tgt_len: 512
-# Memory length. The length of the retained previous heads. 
-mem_len: 512
-# Use the same attention length for all tokens. 
-same_length: False
-# Use the same positional encoding after clamp len. 
-clamp_len: -1
-# The number of samples in sample softmax. -1 means do not use sampled softmax. 
-sample_softmax: -1
-# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
-eval_tgt_len: 128
-# What kind of mode for evaluation. valid, test or both("all"). 
-mode: "all"
-# Maximum evaluation step. 
-max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/configs/wt103.yaml b/examples/language_model/transformer-xl/configs/wt103.yaml
deleted file mode 100644
index 99fec78d1494..000000000000
--- a/examples/language_model/transformer-xl/configs/wt103.yaml
+++ /dev/null
@@ -1,112 +0,0 @@
-# The frequency to save trained models when training.
-save_step: 10000
-# The frequency to fetch and print output when training.
-print_step: 100
-# Path of the checkpoint, to resume the previous training
-init_from_checkpoint: ""
-# Path of the pretrain model, to better solve the current task
-init_from_pretrain_model: ""
-# Path of trained parameter, to make prediction
-init_from_params: "./trained_models/step_final/"
-# The directory for saving model
-save_model: "trained_models"
-# The directory for saving inference model.
-inference_model_dir: "infer_model"
-# Set seed for CE or debug
-random_seed: None
-# The path to data files 
-data: "./gen_data/wikitext-103/"
-# The name of dataset
-dataset: "wt103"
-
-# Whether to use cuda
-use_gpu: True
-
-# Args for reader, see reader.py for details
-token_delimiter: None
-batch_size: 32
-eval_batch_size: 10
-
-# Hyparams for training:
-# The number of epoches for training
-epoch: 200
-# Max step for training.
-max_step: 200000
-
-# The hyper parameters for optimizer.
-# Type of ptimizer. 
-optim: adam
-# Learning rate schedule. 
-scheduler: cosine
-# This static learning_rate will be applied to the LearningRateScheduler
-# derived learning rate the to get the final learning rate.
-learning_rate: 0.00025
-# The hyper parameters for Adam optimizer.
-beta1: 0.9
-beta2: 0.997
-eps: 1e-9
-# The hyper parameters for Momentum optimizer.
-mom: 0.0
-# Global gradient clip. 
-clip: 0.25
-# The parameters for learning rate scheduling.
-warmup_steps: 0
-# The parameters for CosineAnnealingDecay. Minimum learning rate.
-eta_min: 0.0
-# The parameters for ReduceLROnPlateau.
-# The Ratio that the learning rate will be reduced. 
-decay_rate: 0.5
-# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
-patience: 0
-# The lower bound of the learning rate after reduction.
-min_lr: 0.0
-
-# Hyparams for model:
-# Whe use adaptive softmax. 
-adaptive: True
-# Size of dictionary. This can be obtained automatically. 
-ntokens: 10000
-# The dimension for word embeddings, which is also the last dimension of
-# the input and output of multi-head attention, position-wise feed-forward
-# networks, encoder and decoder.
-d_model: 410
-# Dimension of heads.
-d_head: 41
-# Size of the hidden layer in position-wise feed-forward networks.
-d_inner_hid: 2100
-# Number of head used in multi-head attention.
-n_head: 10
-# Number of sub-layers to be stacked in the encoder and decoder.
-n_layer: 16
-# Dropout rates.
-dropout: 0.1
-# Attention dropout
-attn_dropout: 0.0
-# Attention type for decoder. 
-# 0 for relative partial MHA (in Transformer-XL). 
-# 1 for relative MHA (in Shaw et al). 
-attn_type: 0
-# Apply layer normalization before or after sublayers. 
-normalize_before: False
-# Whether to tie weight or not. 
-tie_weight: True
-# The length of the extended context.
-ext_len: 0
-# The divident value for softmax and adapative input. 
-div_val: 1
-# Target length. The number of tokens to predict. 
-tgt_len: 150
-# Memory length. The length of the retained previous heads. 
-mem_len: 150
-# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
-eval_tgt_len: 150
-# Use the same attention length for all tokens. 
-same_length: False
-# Use the same positional encoding after clamp len. 
-clamp_len: -1
-# The number of samples in sample softmax. -1 means do not use sampled softmax. 
-sample_softmax: -1
-# What kind of mode for evaluation. valid, test or both("all"). 
-mode: "all"
-# Maximum evaluation step. 
-max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/eval.py b/examples/language_model/transformer-xl/eval.py
deleted file mode 100644
index de13d17fd9dc..000000000000
--- a/examples/language_model/transformer-xl/eval.py
+++ /dev/null
@@ -1,142 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import os
-from pprint import pprint
-
-import numpy as np
-import paddle
-import yaml
-from attrdict import AttrDict
-from mem_transformer import MemTransformerLM
-from reader import get_lm_data_loader, get_lm_vocab
-
-FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
-logging.basicConfig(level=logging.INFO, format=FORMAT)
-logger = logging.getLogger(__name__)
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ")
-    args = parser.parse_args()
-    return args
-
-
-def do_eval(args):
-    assert args.ext_len >= 0, "Extended context length must be no less than 0"
-
-    def _evaluate(loader):
-        total_len, total_loss = 0, 0.0
-
-        eval_mems = tuple()
-        for i, (src, target, seq_len) in enumerate(loader):
-            if args.max_eval_steps > 0 and i >= args.max_eval_steps:
-                break
-            ret = mem_transformer(src, target, *eval_mems)
-            loss, eval_mems = ret[0], ret[1:]
-            eval_cur_loss = seq_len * loss.numpy()
-            total_loss += eval_cur_loss
-            total_len += seq_len
-        return total_loss / total_len
-
-    def _logger(loss):
-        if args.dataset in ["enwik8", "text8"]:
-            logger_info = "loss: %f, bpc: %f" % (loss, loss / np.log(2))
-        else:
-            logger_info = "loss: %f, ppl: %.2f" % (loss, np.exp(loss))
-        return logger_info
-
-    if not args.use_gpu:
-        paddle.set_device("cpu")
-
-    vocab = get_lm_vocab(args)
-    eval_loader = get_lm_data_loader(args, vocab, "valid")
-    test_loader = get_lm_data_loader(args, vocab, "test")
-
-    cutoffs, tie_projs = [], [False]
-    if args.adaptive:
-        assert args.dataset in ["wt103", "lm1b"]
-        if args.dataset == "wt103":
-            cutoffs = [20000, 40000, 200000]
-            tie_projs += [True] * len(cutoffs)
-        elif args.dataset == "lm1b":
-            cutoffs = [60000, 100000, 640000]
-            tie_projs += [False] * len(cutoffs)
-
-    mem_transformer = MemTransformerLM(
-        args.ntokens,
-        args.n_layer,
-        args.n_head,
-        args.d_model,
-        args.d_head,
-        args.d_inner_hid,
-        args.dropout,
-        args.attn_dropout,
-        tie_weight=args.tie_weight,
-        d_embed=args.d_model,
-        div_val=args.div_val,
-        tie_projs=tie_projs,
-        normalize_before=args.normalize_before,
-        tgt_len=args.tgt_len,
-        ext_len=args.ext_len,
-        mem_len=args.mem_len,
-        cutoffs=cutoffs,
-        same_length=args.same_length,
-        attn_type=args.attn_type,
-        clamp_len=args.clamp_len,
-        sample_softmax=args.sample_softmax,
-    )
-
-    assert args.init_from_params, "Please set init_from_params to load the infer model."
-
-    model_dict = paddle.load(os.path.join(args.init_from_params, "mem_transformer.pdparams"))
-    mem_transformer.load_dict(model_dict)
-
-    logger.info(
-        "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
-            args.eval_batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len
-        )
-    )
-
-    mem_transformer.reset_length(args.tgt_len, args.ext_len, args.mem_len)
-
-    test_loss = None
-    valid_loss = None
-    if args.mode == "all":
-        test_loss = _evaluate(test_loader)
-        valid_loss = _evaluate(eval_loader)
-    elif args.mode == "valid":
-        valid_loss = _evaluate(eval_loader)
-    elif args.mode == "test":
-        test_loss = _evaluate(test_loader)
-
-    logger_info = ""
-    if valid_loss is not None:
-        logger_info = logger_info + "validation loss: " + _logger(valid_loss) + " | "
-    if test_loss is not None:
-        logger_info = logger_info + "test loss: " + _logger(test_loss) + " | "
-    logger.info(logger_info)
-
-
-if __name__ == "__main__":
-    ARGS = parse_args()
-    yaml_file = ARGS.config
-    with open(yaml_file, "rt") as f:
-        args = AttrDict(yaml.safe_load(f))
-        pprint(args)
-
-    do_eval(args)
diff --git a/examples/language_model/transformer-xl/gen_data.sh b/examples/language_model/transformer-xl/gen_data.sh
deleted file mode 100644
index 865a8a583589..000000000000
--- a/examples/language_model/transformer-xl/gen_data.sh
+++ /dev/null
@@ -1,55 +0,0 @@
-echo "Downloading dataset..."
-
-CUR_DIR=$PWD
-
-mkdir -p gen_data
-cd ./gen_data/
-
-if [ ! -d "wikitext-103" ]; then
-    echo "Downloading wikitext-103..."
-    wget -O wikitext-103-v1.zip https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
-    echo "Unzip wikitext-103..."
-    unzip wikitext-103-v1.zip
-    cd wikitext-103
-    # Rename
-    mv wiki.train.tokens train.txt
-    mv wiki.valid.tokens valid.txt
-    mv wiki.test.tokens test.txt
-    cd -
-fi
-
-if [ ! -d 'enwik8' ]; then
-    mkdir -p enwik8
-    cd enwik8
-    echo "Downloading enwik8..."
-    wget -O enwik8.zip http://mattmahoney.net/dc/enwik8.zip
-    wget -O prep_enwik8.py https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py
-    python3 prep_enwik8.py
-    rm -f prep_enwik8.py
-    cd -
-fi
-
-if [ ! -d 'text8' ]; then
-    mkdir -p text8
-    cd text8
-    echo "Downloading text8..."
-    wget -O text8.zip http://mattmahoney.net/dc/text8.zip
-    python ${CUR_DIR}/utils/preprocess_text8.py 5000000
-    cd -
-fi
-
-if [ ! -d 'one-billion-words' ]; then
-    mkdir -p one-billion-words
-    cd one-billion-words
-    echo "Downloading one-billion-words..."
-    wget -O 1-billion-word-language-modeling-benchmark-r13output.tar.gz http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
-    tar xzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
-
-    dir="./1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/"
-    cat ${dir}/news.en.heldout-00000-of-00050 > valid.txt
-    cat ${dir}/news.en.heldout-00000-of-00050 > test.txt
-    wget -O 1b_word_vocab.txt https://github.com/rafaljozefowicz/lm/raw/master/1b_word_vocab.txt
-    cd -
-fi
-
-echo "All done. "
diff --git a/examples/language_model/transformer-xl/mem_transformer.py b/examples/language_model/transformer-xl/mem_transformer.py
deleted file mode 100644
index 122d4e1cc3f2..000000000000
--- a/examples/language_model/transformer-xl/mem_transformer.py
+++ /dev/null
@@ -1,1031 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-import paddle.nn as nn
-import paddle.nn.functional as F
-
-global_dtype = paddle.get_default_dtype()
-
-
-def sample_logits(embedding, bias, labels, inputs, sampler):
-    true_log_probs, samp_log_probs, neg_samples = sampler.sample(labels)
-    n_sample = neg_samples.shape[0]
-    b1, b2 = labels.shape[0], labels.shape[1]
-    all_ids = paddle.concat([paddle.reshape(labels, shape=[-1]), neg_samples])
-    all_w = embedding(all_ids)
-    true_w = paddle.reshape(all_w[:-n_sample], shape=[b1, b2, -1])
-    sample_w = paddle.reshape(all_w[-n_sample:], shape=[n_sample, -1])
-
-    all_b = paddle.gather(bias, all_ids)
-    true_b = paddle.reshape(all_b[:-n_sample], shape=[b1, b2])
-    sample_b = all_b[-n_sample:]
-
-    hit = paddle.cast((labels.unsqueeze([2]) == neg_samples), dtype=global_dtype).detach()
-    true_logits = paddle.sum(true_w * inputs, axis=-1) + true_b - true_log_probs
-    sample_logits = (
-        paddle.transpose(paddle.matmul(sample_w, paddle.transpose(inputs, [0, 2, 1])), [0, 2, 1])
-        + sample_b
-        - samp_log_probs
-    )
-    sample_logits = sample_logits - 1e30 * hit
-    logits = paddle.concat([true_logits.unsqueeze([2]), sample_logits], -1)
-
-    return logits
-
-
-class ProjAdaptiveSoftmax(nn.Layer):
-    """
-    Combine projection and logsoftmax.
-    """
-
-    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):
-        super(ProjAdaptiveSoftmax, self).__init__()
-
-        self.n_token = n_token
-        self.d_embed = d_embed
-        self.d_proj = d_proj
-
-        self.cutoffs = cutoffs + [n_token]
-        self.cutoff_ends = [0] + self.cutoffs
-        self.div_val = div_val
-
-        self.shortlist_size = self.cutoffs[0]
-        self.num_clusters = len(self.cutoffs) - 1
-        self.head_size = self.shortlist_size + self.num_clusters
-
-        if self.num_clusters > 0:
-            self.cluster_weight = paddle.create_parameter(
-                shape=[self.num_clusters, self.d_embed],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-            self.cluster_bias = paddle.create_parameter(
-                shape=[self.num_clusters],
-                dtype=global_dtype,
-                is_bias=True,
-                default_initializer=paddle.nn.initializer.Constant(0.0),
-            )
-
-        self.out_layers_weight = nn.ParameterList()
-        self.out_layers_bias = nn.ParameterList()
-        self.out_projs = nn.ParameterList()
-
-        if div_val == 1:
-            for i in range(len(self.cutoffs)):
-                if d_proj != d_embed:
-                    self.out_projs.append(
-                        paddle.create_parameter(
-                            shape=[d_proj, d_embed],
-                            dtype=global_dtype,
-                            default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                        )
-                    )
-                else:
-                    self.out_projs.append(None)
-
-            self.out_layers_weight.append(
-                paddle.create_parameter(
-                    shape=[n_token, d_embed],
-                    dtype=global_dtype,
-                    default_initializer=paddle.nn.initializer.Constant(0.0),
-                )
-            )
-            self.out_layers_bias.append(
-                paddle.create_parameter(
-                    shape=[n_token],
-                    dtype=global_dtype,
-                    is_bias=True,
-                    default_initializer=paddle.nn.initializer.Constant(0.0),
-                )
-            )
-        else:
-            for i in range(len(self.cutoffs)):
-                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
-                d_emb_i = d_embed // (div_val**i)
-
-                self.out_projs.append(
-                    paddle.create_parameter(
-                        shape=[d_proj, d_emb_i],
-                        dtype=global_dtype,
-                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                    )
-                )
-
-                self.out_layers_weight.append(
-                    paddle.create_parameter(
-                        shape=[r_idx - l_idx, d_emb_i],
-                        dtype=global_dtype,
-                        default_initializer=paddle.nn.initializer.Uniform(
-                            low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0)
-                        ),
-                    )
-                )
-                self.out_layers_bias.append(
-                    paddle.create_parameter(
-                        shape=[r_idx - l_idx],
-                        dtype=global_dtype,
-                        is_bias=True,
-                        default_initializer=paddle.nn.initializer.Uniform(
-                            low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0)
-                        ),
-                    )
-                )
-
-        self.keep_order = keep_order
-
-    def _compute_logits(self, hidden, weight, bias, proj=None):
-        if proj is None:
-            logit = F.linear(hidden, weight.t(), bias=bias)
-        else:
-            proj_hid = F.linear(hidden, proj)
-            logit = F.linear(proj_hid, weight.t(), bias=bias)
-
-        return logit
-
-    def forward(self, hidden, target, keep_order=False):
-        assert hidden.shape[0] == target.shape[0]
-
-        if self.num_clusters == 0:
-            logit = self._compute_logits(hidden, self.out_layers_weight[0], self.out_layers_bias[0], self.out_projs[0])
-            nll = -paddle.log(F.softmax(logit, axis=-1))
-            idx = paddle.concat([paddle.arange(0, nll.shape[0]).unsqueeze([1]), target.unsqueeze(1)], axis=1)
-            nll = paddle.gather_nd(nll, idx)
-        else:
-            weights, biases = [], []
-            for i in range(len(self.cutoffs)):
-                if self.div_val == 1:
-                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
-                    weight_i = self.out_layers_weight[0][l_idx:r_idx]
-                    bias_i = self.out_layers_bias[0][l_idx:r_idx]
-                else:
-                    weight_i = self.out_layers_weight[i]
-                    bias_i = self.out_layers_bias[i]
-
-                if i == 0:
-                    weight_i = paddle.concat([weight_i, self.cluster_weight], axis=0)
-                    bias_i = paddle.concat([bias_i, self.cluster_bias], axis=0)
-
-                weights.append(weight_i)
-                biases.append(bias_i)
-
-            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
-
-            head_logit = self._compute_logits(hidden, head_weight, head_bias, head_proj)
-            head_logprob = paddle.log(F.softmax(head_logit, axis=-1))
-
-            nll = paddle.zeros_like(target, dtype=hidden.dtype)
-
-            offset = 0
-            cutoff_values = [0] + self.cutoffs
-            for i in range(len(cutoff_values) - 1):
-                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]
-
-                mask_i = paddle.cast(target >= l_idx, dtype=paddle.get_default_dtype()) * paddle.cast(
-                    target < r_idx, dtype="int64"
-                )
-                indices_i = paddle.nonzero(mask_i).squeeze([1])
-
-                if paddle.numel(indices_i) == 0:
-                    continue
-                target_i = paddle.gather(target, indices_i, axis=0) - l_idx
-                head_logprob_i = paddle.gather(head_logprob, indices_i, axis=0)
-                if i == 0:
-                    target_i_idx = paddle.concat(
-                        [paddle.arange(0, head_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1
-                    )
-                    logprob_i = head_logprob_i.gather_nd(target_i_idx)
-                else:
-                    weight_i, bias_i, proj_i = (
-                        weights[i],
-                        biases[i],
-                        self.out_projs[i].weight if self.out_projs[i] is not None else None,
-                    )
-
-                    hidden_i = paddle.gather(hidden, indices_i, axis=0)
-
-                    tail_logit_i = self._compute_logits(hidden_i, weight_i, bias_i, proj_i)
-                    tail_logprob_i = paddle.log(F.softmax(tail_logit_i, axis=-1))
-
-                    target_i_idx = paddle.concat(
-                        [paddle.arange(0, tail_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1
-                    )
-                    logprob_i = tail_logprob_i.gather_nd(target_i_idx)
-
-                    logprob_i = head_logprob_i[:, -i] + logprob_i
-
-                if self.keep_order or keep_order:
-                    nll = paddle.scatter(nll, indices_i, -logprob_i)
-                else:
-                    index = paddle.arange(offset, offset + logprob_i.shape[0], 1)
-                    nll = paddle.scatter(nll, index, -logprob_i)
-
-                offset += logprob_i.shape[0]
-
-        return nll
-
-
-class LogUniformSampler(object):
-    def __init__(self, range_max, n_sample):
-        with paddle.no_grad():
-            self.range_max = range_max
-            log_indices = paddle.log(paddle.arange(1.0, range_max + 2.0, 1.0, dtype=global_dtype))
-            self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1]
-
-            self.log_q = paddle.cast(
-                paddle.log(
-                    paddle.exp(-(paddle.log1p(-paddle.cast(self.dist, dtype=global_dtype)) * 2 * n_sample)) - 1
-                ),
-                dtype=global_dtype,
-            )
-
-        self.n_sample = n_sample
-
-    def sample(self, labels):
-        n_sample = self.n_sample
-        n_tries = 2 * n_sample
-        batch_size = labels.shape[0]
-
-        with paddle.no_grad():
-            neg_samples = paddle.unique(paddle.multinomial(self.dist, n_tries, replacement=True))
-            true_log_probs = paddle.gather(self.log_q, labels.flatten())
-            true_log_probs = paddle.reshape(true_log_probs, shape=[batch_size, -1])
-            samp_log_probs = paddle.gather(self.log_q, neg_samples)
-            return true_log_probs, samp_log_probs, neg_samples
-
-
-class PositionEmbedding(nn.Layer):
-    def __init__(self, emb_dim):
-        super(PositionEmbedding, self).__init__()
-        self.emb_dim = emb_dim
-        self.inv_freq = 1.0 / (10000.0 ** (paddle.arange(0.0, emb_dim, 2.0, dtype=global_dtype) / emb_dim))
-
-    def forward(self, pos_seq, bsz=None):
-        sinusoid_inp = paddle.matmul(pos_seq.unsqueeze([1]), self.inv_freq.unsqueeze([0]))
-        pos_emb = paddle.concat([paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)], axis=-1)
-
-        if bsz is not None:
-            pos_emb = pos_emb.unsqueeze([0]).expand([bsz, -1, -1])
-            pos_emb.stop_gradient = True
-            return pos_emb
-        else:
-            pos_emb = pos_emb.unsqueeze([0])
-            pos_emb.stop_gradient = True
-            return pos_emb
-
-
-class PositionwiseFFN(nn.Layer):
-    def __init__(self, d_model, d_inner, dropout, normalize_before=False):
-        super(PositionwiseFFN, self).__init__()
-
-        self.d_model = d_model
-        self.d_inner = d_inner
-
-        self.CoreNet = nn.Sequential(
-            nn.Linear(
-                d_model,
-                d_inner,
-                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                bias_attr=paddle.nn.initializer.Constant(0.0),
-            ),
-            nn.ReLU(),
-            nn.Dropout(dropout),
-            nn.Linear(
-                d_inner,
-                d_model,
-                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                bias_attr=paddle.nn.initializer.Constant(0.0),
-            ),
-            nn.Dropout(dropout),
-        )
-        self.layer_norm = nn.LayerNorm(
-            d_model,
-            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
-            bias_attr=paddle.nn.initializer.Constant(0.0),
-        )
-        self.normalize_before = normalize_before
-
-    def forward(self, inp):
-        if self.normalize_before:
-            core_out = self.CoreNet(self.layer_norm(inp))
-            output = core_out + inp
-        else:
-            core_out = self.CoreNet(inp)
-            output = self.layer_norm(inp + core_out)
-        return output
-
-
-class MultiHeadAttn(nn.Layer):
-    def __init__(self, n_head, d_model, d_head, dropout, attn_dropout=0, normalize_before=False):
-        super(MultiHeadAttn, self).__init__()
-        self.n_head = n_head
-        self.d_model = d_model
-        self.d_head = d_head
-
-        self.q_proj = nn.Linear(
-            d_model, n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
-        )
-        self.kv_proj = nn.Linear(
-            d_model, 2 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
-        )
-        self.drop = nn.Dropout(p=dropout)
-        self.attn_drop = nn.Dropout(p=attn_dropout)
-        self.o_proj = nn.Linear(
-            n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
-        )
-        self.layer_norm = nn.LayerNorm(
-            d_model,
-            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
-            bias_attr=paddle.nn.initializer.Constant(0.0),
-        )
-
-        self.scale = 1 / (d_head**0.5)
-        self.normalize_before = normalize_before
-
-    def forward(self, h, attn_mask=None, mems=None):
-        if mems is not None:
-            c = paddle.concat([mems, h], axis=1)
-        else:
-            c = h
-
-        if self.normalize_before:
-            c = self.layer_norm(c)
-
-        head_q = self.q_proj(h)
-        head_k, head_v = paddle.chunk(self.kv_proj(c), chunks=2, axis=-1)
-
-        head_q = paddle.reshape(head_q, shape=[h.shape[0], h.shape[1], self.n_head, self.d_head])
-        head_k = paddle.reshape(head_k, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head])
-        head_v = paddle.reshape(head_v, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head])
-
-        attn_score = paddle.einsum("bind,bjnd->bnij", head_q, head_k)
-        attn_score = attn_score * self.scale
-        if attn_mask is not None:
-            attn_score = attn_score - float("inf") * attn_mask
-
-        attn_prob = F.softmax(attn_score, dim=-1)
-        attn_prob = self.attn_drop(attn_prob)
-
-        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, head_v)
-        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
-
-        attn_out = self.o_proj(attn_vec)
-        attn_out = self.drop(attn_out)
-        if self.normalize_before:
-            output = h + attn_out
-        else:
-            output = self.layer_norm(h + attn_out)
-
-        return output
-
-
-class RelMultiHeadAttn(nn.Layer):
-    def __init__(
-        self,
-        n_head,
-        d_model,
-        d_head,
-        dropout,
-        attn_dropout=0,
-        tgt_len=None,
-        ext_len=None,
-        mem_len=None,
-        normalize_before=False,
-    ):
-        super(RelMultiHeadAttn, self).__init__()
-
-        self.n_head = n_head
-        self.d_model = d_model
-        self.d_head = d_head
-        self.dropout = dropout
-
-        self.qkv_proj = nn.Linear(
-            d_model, 3 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
-        )
-
-        self.drop = nn.Dropout(dropout)
-        self.attn_drop = nn.Dropout(attn_dropout)
-        self.o_proj = nn.Linear(
-            n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
-        )
-
-        self.layer_norm = nn.LayerNorm(
-            d_model,
-            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
-            bias_attr=paddle.nn.initializer.Constant(0.0),
-        )
-
-        self.scale = 1 / (d_head**0.5)
-
-        self.normalize_before = normalize_before
-
-    def _rel_shift(self, x, zero_triu=False):
-        x_shape = x.shape
-        zero_pad = paddle.zeros([x_shape[0], x_shape[1], x_shape[2], 1], dtype=x.dtype)
-        x_padded = paddle.concat([zero_pad, x], axis=-1)
-
-        x_padded = paddle.reshape(x_padded, shape=[x_shape[0], x_shape[1], x_shape[3] + 1, x_shape[2]])
-
-        x = paddle.reshape(x_padded[:, :, 1:, :], shape=x_shape)
-
-        if zero_triu:
-            ones = paddle.ones([x_shape[2], x_shape[3]])
-            x = x * paddle.tril(ones, diagonal=x_shape[3] - x_shape[2]).unsqueeze([2, 3])
-
-        return x
-
-    def forward(self, w, r, attn_mask=None, mems=None):
-        raise NotImplementedError
-
-
-class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
-    def __init__(self, *args, **kwargs):
-        super(RelPartialLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
-
-        self.r_proj = nn.Linear(
-            self.d_model,
-            self.n_head * self.d_head,
-            weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            bias_attr=False,
-        )
-
-    def forward(self, w, r, r_w_bias, r_r_bias, attn_mask=None, mems=None):
-        qlen, rlen, bsz = w.shape[1], r.shape[1], w.shape[0]
-
-        if mems is not None:
-            cat = paddle.concat([mems, w], axis=1)
-            if self.normalize_before:
-                w_heads = self.qkv_proj(self.layer_norm(cat))
-            else:
-                w_heads = self.qkv_proj(cat)
-            r_head_k = self.r_proj(r)
-
-            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
-
-            w_head_q = w_head_q[:, -qlen:, :]
-        else:
-            if self.normalize_before:
-                w_heads = self.qkv_proj(self.layer_norm(w))
-            else:
-                w_heads = self.qkv_proj(w)
-            r_head_k = self.r_proj(r)
-
-            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
-
-        klen = w_head_k.shape[1]
-
-        w_head_q = paddle.reshape(w_head_q, shape=[bsz, qlen, self.n_head, self.d_head])
-        w_head_k = paddle.reshape(w_head_k, shape=[bsz, klen, self.n_head, self.d_head])
-        w_head_v = paddle.reshape(w_head_v, shape=[bsz, klen, self.n_head, self.d_head])
-
-        r_head_k = paddle.reshape(r_head_k, shape=[bsz, rlen, self.n_head, self.d_head])
-
-        rw_head_q = w_head_q + r_w_bias
-
-        AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k)
-        rr_head_q = w_head_q + r_r_bias
-
-        BD = paddle.einsum("bind,bjnd->bnij", rr_head_q, r_head_k)
-        BD = self._rel_shift(BD)
-
-        attn_score = AC + BD
-        attn_score = attn_score * self.scale
-
-        if attn_mask is not None:
-            attn_score = attn_score - 1e30 * attn_mask
-
-        attn_prob = F.softmax(attn_score, axis=-1)
-        attn_prob = self.attn_drop(attn_prob)
-
-        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v)
-
-        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
-
-        attn_out = self.o_proj(attn_vec)
-        attn_out = self.drop(attn_out)
-
-        if self.normalize_before:
-            output = w + attn_out
-        else:
-            output = self.layer_norm(w + attn_out)
-
-        return output
-
-
-class RelLearnableMultiHeadAttn(RelMultiHeadAttn):
-    def __init__(self, *args, **kwargs):
-        super(RelLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
-
-    def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None):
-        qlen, bsz = w.shape[1], w.shape[0]
-
-        if mems is not None:
-            cat = paddle.concat([mems, w], 1)
-            if self.normalize_before:
-                w_heads = self.qkv_proj(self.layer_norm(cat))
-            else:
-                w_heads = self.qkv_proj(cat)
-            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
-
-            w_head_q = w_head_q[-qlen:]
-        else:
-            if self.normalize_before:
-                w_heads = self.qkv_proj(self.layer_norm(w))
-            else:
-                w_heads = self.qkv_proj(w)
-            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
-
-        klen = w_head_k.shape[1]
-
-        w_head_q = paddle.reshape(w_head_q, shape=[w_head_q.shape[0], w_head_q.shape[1], self.n_head, self.d_head])
-        w_head_k = paddle.reshape(w_head_k, shape=[w_head_k.shape[0], w_head_k.shape[1], self.n_head, self.d_head])
-        w_head_v = paddle.reshape(w_head_v, shape=[w_head_v.shape[0], w_head_v.shape[1], self.n_head, self.d_head])
-
-        if klen > r_emb.shape[0]:
-            r_emb_pad = r_emb[0:1].expand(klen - r_emb.shape[0], -1, -1)
-            r_emb = paddle.concat([r_emb_pad, r_emb], 0)
-            r_bias_pad = r_bias[0:1].expand(klen - r_bias.shape[0], -1)
-            r_bias = paddle.concat([r_bias_pad, r_bias], 0)
-        else:
-            r_emb = r_emb[-klen:]
-            r_bias = r_bias[-klen:]
-
-        rw_head_q = w_head_q + r_w_bias.unsqueeze([0])
-
-        AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k)
-        r_emb = r_emb.unsqueeze([0]).expand([bsz, -1, -1, -1])
-        B_ = paddle.einsum("bind,bjnd->bnij", w_head_q, r_emb)
-        D_ = r_bias.unsqueeze([0, 2])
-        BD = self._rel_shift(B_ + D_)
-
-        attn_score = AC + BD
-        attn_score = attn_score * self.scale
-
-        if attn_mask is not None:
-            attn_score = attn_score - float("inf") * attn_mask
-
-        attn_prob = F.softmax(attn_score, dim=-1)
-        attn_prob = self.attn_drop(attn_prob)
-
-        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v)
-
-        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
-
-        attn_out = self.o_net(attn_vec)
-        attn_out = self.drop(attn_out)
-
-        if self.normalize_before:
-            output = w + attn_out
-        else:
-            output = self.layer_norm(w + attn_out)
-
-        return output
-
-
-class DecoderLayer(nn.Layer):
-    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
-        super(DecoderLayer, self).__init__()
-
-        self.dec_attn = MultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
-        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
-
-    def forward(self, dec_inp, dec_attn_mask=None, mems=None):
-
-        output = self.dec_attn(dec_inp, attn_mask=dec_attn_mask, mems=mems)
-        output = self.pos_ff(output)
-
-        return output
-
-
-class RelLearnableDecoderLayer(nn.Layer):
-    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
-        super(RelLearnableDecoderLayer, self).__init__()
-
-        self.dec_attn = RelLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
-        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
-
-    def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None, mems=None):
-
-        output = self.dec_attn(dec_inp, r_emb, r_w_bias, r_bias, attn_mask=dec_attn_mask, mems=mems)
-        output = self.pos_ff(output)
-
-        return output
-
-
-class RelPartialLearnableDecoderLayer(nn.Layer):
-    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
-        super(RelPartialLearnableDecoderLayer, self).__init__()
-
-        self.dec_attn = RelPartialLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
-        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
-
-    def forward(self, dec_inp, r, r_w_bias, r_r_bias, dec_attn_mask=None, mems=None):
-        output = self.dec_attn(dec_inp, r, r_w_bias, r_r_bias, attn_mask=dec_attn_mask, mems=mems)
-        output = self.pos_ff(output)
-
-        return output
-
-
-class AdaptiveEmbedding(nn.Layer):
-    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):
-        super(AdaptiveEmbedding, self).__init__()
-
-        self.n_token = n_token
-        self.d_embed = d_embed
-
-        self.cutoffs = cutoffs + [n_token]
-        self.div_val = div_val
-        self.d_proj = d_proj
-
-        self.emb_scale = d_proj**0.5
-
-        self.cutoff_ends = [0] + self.cutoffs
-
-        self.emb_layers = nn.LayerList()
-        self.emb_projs = nn.ParameterList()
-        if div_val == 1:
-            self.emb_layers.append(
-                nn.Embedding(
-                    n_token,
-                    d_embed,
-                    sparse=sample_softmax > 0,
-                    weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                )
-            )
-            if d_proj != d_embed:
-                self.emb_projs.append(
-                    paddle.create_parameter(
-                        shape=[d_embed, d_proj],
-                        dtype=global_dtype,
-                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                    )
-                )
-        else:
-            for i in range(len(self.cutoffs)):
-                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
-                d_emb_i = d_embed // (div_val**i)
-                self.emb_layers.append(
-                    nn.Embedding(r_idx - l_idx, d_emb_i, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01))
-                )
-                self.emb_projs.append(
-                    paddle.create_parameter(
-                        shape=[d_emb_i, d_proj],
-                        dtype=global_dtype,
-                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                    )
-                )
-
-    def forward(self, inp):
-        if self.div_val == 1:
-            embed = self.emb_layers[0](inp)
-            if self.d_proj != self.d_embed:
-                embed = F.linear(embed, self.emb_projs[0])
-        else:
-            inp_flat = paddle.reshape(inp, shape=[-1])
-            emb_flat = paddle.zeros([inp_flat.shape[0], self.d_proj], dtype=global_dtype)
-            for i in range(len(self.cutoffs)):
-                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
-
-                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)
-                indices_i = paddle.nonzero(mask_i).squeeze([1])
-
-                if indices_i.numel() == 0:
-                    continue
-
-                inp_i = paddle.gather(inp_flat, indices_i, axis=0) - l_idx
-                emb_i = self.emb_layers[i](inp_i)
-                emb_i = F.linear(emb_i, self.emb_projs[i])
-
-                emb_flat = paddle.scatter(emb_flat, indices_i, emb_i)
-
-            embed = paddle.reshape(emb_flat, shape=inp.shape.append(self.d_proj))
-
-        embed = embed * self.emb_scale
-
-        return embed
-
-
-class MemTransformerLM(nn.Layer):
-    def __init__(
-        self,
-        n_token,
-        n_layer,
-        n_head,
-        d_model,
-        d_head,
-        d_inner,
-        dropout,
-        attn_dropout,
-        tie_weight=True,
-        d_embed=None,
-        div_val=1,
-        tie_projs=[False],
-        normalize_before=False,
-        tgt_len=None,
-        ext_len=None,
-        mem_len=None,
-        cutoffs=[],
-        adapt_inp=False,
-        same_length=False,
-        attn_type=0,
-        clamp_len=-1,
-        sample_softmax=-1,
-    ):
-        super(MemTransformerLM, self).__init__()
-        self.n_token = n_token
-
-        d_embed = d_model if d_embed is None else d_embed
-        self.d_embed = d_embed
-        self.d_model = d_model
-        self.n_head = n_head
-        self.d_head = d_head
-
-        self.word_emb = AdaptiveEmbedding(n_token, d_embed, d_model, cutoffs, div_val=div_val)
-
-        self.drop = nn.Dropout(dropout)
-
-        self.n_layer = n_layer
-
-        self.tgt_len = tgt_len
-        self.mem_len = mem_len
-        self.ext_len = ext_len
-        self.max_klen = tgt_len + ext_len + mem_len
-
-        self.attn_type = attn_type
-
-        self.layers = nn.LayerList()
-        if attn_type == 0:
-            for i in range(n_layer):
-                self.layers.append(
-                    RelPartialLearnableDecoderLayer(
-                        n_head,
-                        d_model,
-                        d_head,
-                        d_inner,
-                        dropout,
-                        tgt_len=tgt_len,
-                        ext_len=ext_len,
-                        mem_len=mem_len,
-                        attn_dropout=attn_dropout,
-                        normalize_before=normalize_before,
-                    )
-                )
-        elif attn_type == 1:
-            for i in range(n_layer):
-                self.layers.append(
-                    RelLearnableDecoderLayer(
-                        n_head,
-                        d_model,
-                        d_head,
-                        d_inner,
-                        dropout,
-                        tgt_len=tgt_len,
-                        ext_len=ext_len,
-                        mem_len=mem_len,
-                        attn_dropout=attn_dropout,
-                        normalize_before=normalize_before,
-                    )
-                )
-        elif attn_type in [2, 3]:
-            for i in range(n_layer):
-                self.layers.append(
-                    DecoderLayer(
-                        n_head,
-                        d_model,
-                        d_head,
-                        d_inner,
-                        dropout,
-                        attn_dropout=attn_dropout,
-                        normalize_before=normalize_before,
-                    )
-                )
-
-        self.sample_softmax = sample_softmax
-        if sample_softmax > 0:
-            self.out_layer = nn.Linear(
-                d_model,
-                n_token,
-                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-                bias_attr=paddle.nn.initializer.Constant(0.0),
-            )
-            self.tie_weight = tie_weight
-            self.sampler = LogUniformSampler(n_token, sample_softmax)
-        else:
-            self.crit = ProjAdaptiveSoftmax(n_token, d_embed, d_model, cutoffs, div_val=div_val)
-
-            if tie_weight:
-                for i in range(len(self.crit.out_layers_weight)):
-                    self.crit.out_layers_weight[i] = self.word_emb.emb_layers[i].weight
-
-            if tie_projs:
-                for i, tie_proj in enumerate(tie_projs):
-                    if tie_proj and div_val == 1 and d_model != d_embed:
-                        self.crit.out_projs[i] = self.word_emb.emb_projs[0]
-                    elif tie_proj and div_val != 1:
-                        self.crit.out_projs[i] = self.word_emb.emb_projs[i]
-
-        self.same_length = same_length
-        self.clamp_len = clamp_len
-
-        self._create_params()
-
-    def backward_compatible(self):
-        self.sample_softmax = -1
-
-    def _create_params(self):
-        if self.attn_type == 0:
-            self.pos_emb = PositionEmbedding(self.d_model)
-            self.r_w_bias = paddle.create_parameter(
-                shape=[self.n_head, self.d_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-            self.r_r_bias = paddle.create_parameter(
-                shape=[self.n_head, self.d_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-        elif self.attn_type == 1:
-            self.r_emb = paddle.create_parameter(
-                shape=[self.n_layer, self.max_klen, self.n_head, self.d_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-            self.r_w_bias = paddle.create_parameter(
-                shape=[self.n_layer, self.n_head, self.d_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-            self.r_bias = paddle.create_parameter(
-                shape=[self.n_layer, self.max_klen, self.n_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-        elif self.attn_type == 2:
-            self.pos_emb = PositionEmbedding(self.d_model)
-        elif self.attn_type == 3:
-            self.r_emb = paddle.create_parameter(
-                shape=[self.n_layer, self.max_klen, self.n_head, self.d_head],
-                dtype=global_dtype,
-                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
-            )
-
-    def reset_length(self, tgt_len, ext_len, mem_len):
-        self.tgt_len = tgt_len
-        self.mem_len = mem_len
-        self.ext_len = ext_len
-
-    def init_mems(self, batch_size, d_model):
-        if self.mem_len > 0:
-            mems = []
-            for _ in range(self.n_layer + 1):
-                empty = paddle.empty(shape=[batch_size, 0, d_model], dtype=global_dtype)
-                mems.append(empty)
-
-            return mems
-        else:
-            return None
-
-    def _update_mems(self, hids, mems, qlen, mlen):
-        if mems is None:
-            return None
-
-        assert len(hids) == len(mems), "length of hids and length of mems must be the same. "
-
-        with paddle.no_grad():
-            new_mems = []
-            end_idx = mlen + max(0, qlen - 0 - self.ext_len)
-            beg_idx = max(0, end_idx - self.mem_len)
-            for i in range(len(hids)):
-                cat = paddle.concat([mems[i], hids[i]], axis=1)
-                new_mems.append(cat[:, beg_idx:end_idx].detach())
-
-        return new_mems
-
-    def _forward(self, dec_inputs, mems=None):
-        bsz, qlen = dec_inputs.shape
-
-        word_emb = self.word_emb(dec_inputs)
-
-        mlen = mems[0].shape[1] if mems is not None else 0
-        klen = mlen + qlen
-        if self.same_length:
-            all_ones = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype)
-            mask_len = klen - self.mem_len
-            if mask_len > 0:
-                mask_shift_len = qlen - mask_len
-            else:
-                mask_shift_len = qlen
-            dec_attn_mask = (
-                paddle.triu(all_ones, diagonal=1 + mlen) + paddle.tril(all_ones, -mask_shift_len)
-            ).unsqueeze([0, 1])
-        else:
-            dec_attn_mask = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype)
-            dec_attn_mask = paddle.triu(dec_attn_mask, diagonal=1 + mlen).unsqueeze([0, 1])
-
-        hids = []
-        if self.attn_type == 0:
-            pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype)
-            if self.clamp_len > 0:
-                # TODO: clamp and clip
-                pos_seq = paddle.clip(pos_seq, max=self.clamp_len)
-            pos_emb = self.pos_emb(pos_seq, bsz)
-
-            core_out = self.drop(word_emb)
-            pos_emb = self.drop(pos_emb)
-
-            hids.append(core_out)
-            for i, layer in enumerate(self.layers):
-                mems_i = None if mems is None else mems[i]
-                core_out = layer(
-                    core_out, pos_emb, self.r_w_bias, self.r_r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i
-                )
-                hids.append(core_out)
-        elif self.attn_type == 1:
-            core_out = self.drop(word_emb)
-            hids.append(core_out)
-            for i, layer in enumerate(self.layers):
-                if self.clamp_len > 0:
-                    r_emb = self.r_emb[i][-self.clamp_len :]
-                    r_bias = self.r_bias[i][-self.clamp_len :]
-                else:
-                    r_emb, r_bias = self.r_emb[i], self.r_bias[i]
-
-                mems_i = None if mems is None else mems[i]
-                core_out = layer(core_out, r_emb, self.r_w_bias[i], r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i)
-                hids.append(core_out)
-        elif self.attn_type == 2:
-            pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype)
-            if self.clamp_len > 0:
-                pos_seq = paddle.clip(pos_seq, max=self.clamp_len)
-            pos_emb = self.pos_emb(pos_seq, bsz)
-
-            core_out = self.drop(word_emb + pos_emb[-qlen:])
-
-            hids.append(core_out)
-            for i, layer in enumerate(self.layers):
-                mems_i = None if mems is None else mems[i]
-                if mems_i is not None and i == 0:
-                    mems_i += pos_emb[:mlen]
-                core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i)
-                hids.append(core_out)
-        elif self.attn_type == 3:
-            core_out = self.drop(word_emb)
-
-            hids.append(core_out)
-            for i, layer in enumerate(self.layers):
-                mems_i = None if mems is None else mems[i]
-                if mems_i is not None and mlen > 0:
-                    cur_emb = self.r_emb[i][:-qlen]
-                    cur_size = cur_emb.size(0)
-                    if cur_size < mlen:
-                        cur_emb_pad = cur_emb[0:1].expand(mlen - cur_size, -1, -1)
-                        cur_emb = paddle.concat([cur_emb_pad, cur_emb], 0)
-                    else:
-                        cur_emb = cur_emb[-mlen:]
-                    mems_i += cur_emb.view(mlen, 1, -1)
-                core_out += self.r_emb[i][-qlen:].view(qlen, 1, -1)
-
-                core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i)
-                hids.append(core_out)
-
-        core_out = self.drop(core_out)
-
-        new_mems = self._update_mems(hids, mems, mlen, qlen)
-
-        return core_out, new_mems
-
-    def forward(self, data, target, *mems):
-        if not mems:
-            batch_size = data.shape[0]
-            mems = self.init_mems(batch_size, self.d_model)
-
-        hidden, new_mems = self._forward(data, mems=mems)
-
-        # TODO(FrostML): use getitem.
-        tgt_len = target.shape[1]
-        pred_hid = paddle.slice(hidden, [1], [-tgt_len], [hidden.shape[1]])
-        if self.sample_softmax > 0 and self.training:
-            assert self.tie_weight, "tie_weight must be True if sample_softmax > 0"
-            logit = sample_logits(self.word_emb, self.out_layer.bias, target, pred_hid, self.sampler)
-            loss = -paddle.log(F.softmax(logit, axis=-1))[:, :, 0]
-        else:
-            loss = self.crit(
-                paddle.reshape(pred_hid, shape=[-1, pred_hid.shape[-1]]), paddle.reshape(target, shape=[-1])
-            )
-
-        if new_mems is None:
-            return [loss.mean()]
-        else:
-            return [loss.mean()] + new_mems
diff --git a/examples/language_model/transformer-xl/reader.py b/examples/language_model/transformer-xl/reader.py
deleted file mode 100644
index 390392e9046a..000000000000
--- a/examples/language_model/transformer-xl/reader.py
+++ /dev/null
@@ -1,193 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-import numpy as np
-import paddle.distributed as dist
-from paddle.io import DataLoader, IterableDataset
-
-from paddlenlp.data import Vocab
-
-
-class LMDataset(IterableDataset):
-    def __init__(self, mode, vocab, path, dataset_name, batch_size, bptt, ext_len, nranks, rank):
-        assert mode in ["train", "valid", "test"], "Parameter mode must be one of [train, valid, test]."
-
-        super(LMDataset, self).__init__()
-        self.vocab = vocab
-        self.dataset_name = dataset_name
-
-        if self.dataset_name in ["wt103"]:
-            self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, lower_case=False)
-        elif self.dataset_name in ["enwik8", "text8"]:
-            self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, add_eos=False)
-        else:
-            raise ValueError("Not supported dataset yet. ")
-        self.rank = rank
-        self.batch_size = batch_size
-        batch_size *= nranks
-
-        self.bptt = bptt
-        self.ext_len = ext_len if ext_len is not None else 0
-
-        self.num_step = len(self.data) // batch_size
-        data = self.data[: self.num_step * batch_size]
-        self.data = data.reshape([batch_size, -1])
-
-        # Number of samples
-        self.num_samples = (self.num_step + self.bptt - 1) // self.bptt
-
-    def __len__(self):
-        return self.num_samples
-
-    def __iter__(self):
-        for i in range(0, self.data.shape[1] - 1, self.bptt):
-            seq_len = min(self.bptt, self.data.shape[1] - 1 - i)
-            end_idx = i + seq_len
-            beg_idx = max(0, i - self.ext_len)
-            src = self.data[:, beg_idx:end_idx]
-            target = self.data[:, i + 1 : i + 1 + seq_len]
-
-            # NOTE: For now, DataLoader can yield `int`. It's not necessary
-            # to transfer `seq_len` after DataLoader.
-            # However, if it's necessary to use `seq_len` as input for some
-            # PaddlePaddle op, then it must be yielded by `[seq_len]` whose
-            # shape is [1], cause some op cannot use shape [] as input.
-            yield [
-                src[self.rank * self.batch_size : (self.rank + 1) * self.batch_size],
-                target[self.rank * self.batch_size : (self.rank + 1) * self.batch_size],
-                seq_len,
-            ]
-
-    def read_raw_data(
-        self, filename, ordered=False, lower_case=True, delimiter=None, add_eos=True, add_double_eos=False
-    ):
-        assert os.path.exists(filename), "%s is not exist. " % filename
-
-        data = []
-        with open(filename, "r", encoding="utf-8") as f:
-            for line in f:
-                tokens = LMDataset.tokenize(line=line, delimiter=delimiter, lower_case=lower_case)
-                if add_double_eos:  # for lm1b
-                    tokens = (
-                        [self.vocab._identifiers_to_tokens["bos_token"]]
-                        + tokens
-                        + [self.vocab._identifiers_to_tokens["bos_token"]]
-                    )
-                elif add_eos:
-                    tokens = tokens + [self.vocab._identifiers_to_tokens["eos_token"]]
-                data.append(np.asarray(self.get_indices(tokens)).astype("int64"))
-
-        if ordered:
-            data = np.concatenate(data)
-
-        return data
-
-    def get_indices(self, tokens):
-        return self.vocab.to_indices(tokens)
-
-    @classmethod
-    def get_vocab(
-        cls,
-        files,
-        max_size=None,
-        min_freq=0,
-        lower_case=True,
-        delimiter=None,
-        unk_token=None,
-        pad_token=None,
-        bos_token=None,
-        eos_token=None,
-        **kwargs
-    ):
-        return Vocab.build_vocab(
-            cls.data_iterator(files=files, delimiter=delimiter, lower_case=lower_case),
-            max_size=max_size,
-            min_freq=min_freq,
-            unk_token=unk_token,
-            pad_token=pad_token,
-            bos_token=bos_token,
-            eos_token=eos_token,
-        )
-
-    @classmethod
-    def tokenize(cls, line, delimiter=None, lower_case=True):
-        line = line.strip()
-        if lower_case:
-            line = line.lower()
-        tokens = list(line) if delimiter == "" else line.split(delimiter)
-        return tokens
-
-    @classmethod
-    def data_iterator(cls, files, delimiter=None, lower_case=True):
-        if isinstance(files, str):
-            files = [files]
-        elif not isinstance(files, (list, tuple)):
-            raise ValueError("The parameter files must be a str or a list/tuple.")
-
-        for fl in files:
-            assert os.path.exists(fl), "%s is not exist. " % fl
-
-            with open(fl, "r", encoding="utf-8") as f:
-                for line in f:
-                    tokens = cls.tokenize(line=line, delimiter=delimiter, lower_case=lower_case)
-                    yield tokens
-
-
-def get_lm_data_loader(args, vocab, mode="train"):
-    lm_dataset = LMDataset(
-        mode=mode,
-        vocab=vocab,
-        path=args.data,
-        dataset_name=args.dataset,
-        batch_size=args.batch_size if mode == "train" else args.eval_batch_size,
-        bptt=args.tgt_len,
-        ext_len=args.ext_len,
-        nranks=dist.get_world_size() if mode == "train" else 1,
-        rank=dist.get_rank() if mode == "train" else 0,
-    )
-
-    data_loader = DataLoader(dataset=lm_dataset, batch_size=None, num_workers=0, return_list=True)
-
-    return data_loader
-
-
-def get_lm_vocab(args):
-    kwargs = {"unk_token": "<unk>"}
-    if args.token_delimiter == "None":
-        kwargs["delimiter"] = None
-    else:
-        kwargs["delimiter"] = args.token_delimiter
-
-    if args.dataset == "wt103":
-        kwargs["eos_token"] = "<eos>"
-        kwargs["lower_case"] = False
-
-    if args.dataset in ["enwik8", "text8"]:
-        files = [
-            os.path.join(args.data, "train.txt"),
-            os.path.join(args.data, "valid.txt"),
-            os.path.join(args.data, "test.txt"),
-        ]
-    elif args.dataset == "wt103":
-        files = [os.path.join(args.data, "train.txt")]
-    else:
-        raise ValueError("Not supported dataset yet. ")
-
-    vocab = LMDataset.get_vocab(files, **kwargs)
-    args.ntokens = len(vocab)
-    print("Finish processing vocabulary, and the size of vocabulary is {}".format(args.ntokens))
-
-    return vocab
diff --git a/examples/language_model/transformer-xl/train.py b/examples/language_model/transformer-xl/train.py
deleted file mode 100644
index 579a9114efd9..000000000000
--- a/examples/language_model/transformer-xl/train.py
+++ /dev/null
@@ -1,297 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import os
-import time
-from pprint import pprint
-
-import numpy as np
-import paddle
-import paddle.distributed as dist
-import yaml
-from attrdict import AttrDict
-from mem_transformer import MemTransformerLM
-from reader import get_lm_data_loader, get_lm_vocab
-
-FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
-logging.basicConfig(level=logging.INFO, format=FORMAT)
-logger = logging.getLogger(__name__)
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ")
-    args = parser.parse_args()
-    return args
-
-
-def do_train(args):
-    if args.use_gpu:
-        rank = dist.get_rank()
-        trainer_count = dist.get_world_size()
-    else:
-        rank = 0
-        trainer_count = 1
-        paddle.set_device("cpu")
-
-    if trainer_count > 1:
-        dist.init_parallel_env()
-
-    random_seed = eval(str(args.random_seed))
-    if random_seed is not None:
-        paddle.seed(random_seed)
-
-    vocab = get_lm_vocab(args)
-    train_loader = get_lm_data_loader(args, vocab, "train")
-    eval_loader = get_lm_data_loader(args, vocab, "valid")
-
-    cutoffs, tie_projs = [], [False]
-    if args.adaptive:
-        assert args.dataset in ["wt103", "lm1b"]
-        if args.dataset == "wt103":
-            cutoffs = [20000, 40000, 200000]
-            tie_projs += [True] * len(cutoffs)
-        elif args.dataset == "lm1b":
-            cutoffs = [60000, 100000, 640000]
-            tie_projs += [False] * len(cutoffs)
-
-    mem_transformer = MemTransformerLM(
-        args.ntokens,
-        args.n_layer,
-        args.n_head,
-        args.d_model,
-        args.d_head,
-        args.d_inner_hid,
-        args.dropout,
-        args.attn_dropout,
-        tie_weight=args.tie_weight,
-        d_embed=args.d_model,
-        div_val=args.div_val,
-        tie_projs=tie_projs,
-        normalize_before=args.normalize_before,
-        tgt_len=args.tgt_len,
-        ext_len=args.ext_len,
-        mem_len=args.mem_len,
-        cutoffs=cutoffs,
-        same_length=args.same_length,
-        attn_type=args.attn_type,
-        clamp_len=args.clamp_len,
-        sample_softmax=args.sample_softmax,
-    )
-
-    if args.scheduler == "cosine":
-        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(
-            learning_rate=args.learning_rate, T_max=args.max_step, eta_min=args.eta_min
-        )
-    elif args.scheduler == "noam":
-        scheduler = paddle.optimizer.lr.NoamDecay(
-            d_model=args.d_model, warmup_steps=args.warmup_steps, learning_rate=args.learning_rate
-        )
-    elif args.scheduler == "dev_perf":
-        paddle.optimizer.lr.ReduceOnPlateau(
-            learning_rate=args.learning_rate, factor=args.decay_rate, patience=args.patience, min_lr=args.lr_min
-        )
-    elif args.scheduler == "constant":
-        scheduler = args.learning_rate
-
-    clip = paddle.nn.ClipGradByGlobalNorm(args.clip)
-    if args.optim.lower() == "momentum":
-        optimizer = paddle.optimizer.Momentum(
-            learning_rate=scheduler, parameters=mem_transformer.parameters(), momentum=args.mom, grad_clip=clip
-        )
-    elif args.optim.lower() == "adam":
-        optimizer = paddle.optimizer.Adam(
-            learning_rate=scheduler,
-            parameters=mem_transformer.parameters(),
-            beta1=args.beta1,
-            beta2=args.beta2,
-            epsilon=eval(args.eps),
-            grad_clip=clip,
-        )
-    elif args.optim.lower() == "adagrad":
-        optimizer = paddle.optimizer.Adagrad(
-            learning_rate=scheduler, parameters=mem_transformer.parameters(), grad_clip=clip
-        )
-
-    # Init from some checkpoint, to resume the previous training
-    if args.init_from_checkpoint:
-        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdparams"))
-        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdopt"))
-        mem_transformer.set_state_dict(model_dict)
-        optimizer.set_state_dict(opt_dict)
-        print("loaded from checkpoint.")
-    # Init from some pretrain models, to better solve the current task
-    if args.init_from_pretrain_model:
-        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "mem_transformer.pdparams"))
-        mem_transformer.set_state_dict(model_dict)
-        print("loaded from pre-trained model.")
-
-    if trainer_count > 1:
-        mem_transformer = paddle.DataParallel(mem_transformer)
-
-    step_idx = 0
-    train_loss = 0.0
-
-    log_start_time = time.time()
-
-    for pass_id in range(args.epoch):
-        batch_id = 0
-
-        mems = tuple()
-        for input_data in train_loader:
-            (src, target, seq_len) = input_data
-            ret = mem_transformer(src, target, *mems)
-            loss = ret[0]
-            mems = ret[1:]
-            train_loss += loss.numpy()
-
-            loss.backward()
-            optimizer.step()
-            optimizer.clear_grad()
-
-            if step_idx > 0 and step_idx % args.print_step == 0 and rank == 0:
-                cur_loss = train_loss / args.print_step
-                elapsed = time.time() - log_start_time
-                if args.scheduler == "constant":
-                    lr = optimizer.get_lr()
-                else:
-                    lr = scheduler.get_lr()
-                logger_info = (
-                    "step_idx: %d, epoch: %d, batch: %d, learning rate: %.8f, "
-                    "speed: %f ms/batch, loss: %f"
-                    % (step_idx, pass_id, batch_id, lr, elapsed * 1000.0 / args.print_step, cur_loss)
-                )
-                if args.dataset in ["enwik8", "text8"]:
-                    logger_info = logger_info + ", bpc: %f" % (cur_loss / np.log(2))
-                else:
-                    logger_info = logger_info + ", ppl: %f" % (np.exp(cur_loss))
-
-                logger.info(logger_info)
-                train_loss = 0.0
-                log_start_time = time.time()
-
-            if step_idx % args.save_step == 0 and step_idx != 0:
-                # Do validation.
-                mem_transformer.eval()
-
-                # TODO(FrostML): simplify this.
-                if args.mem_len == 0:
-                    if dist.get_world_size() == 1:
-                        mem_transformer.reset_length(
-                            tgt_len=args.eval_tgt_len,
-                            ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len,
-                            mem_len=args.mem_len,
-                        )
-                    else:
-                        mem_transformer._layers.reset_length(
-                            tgt_len=args.eval_tgt_len,
-                            ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len,
-                            mem_len=args.mem_len,
-                        )
-                else:
-                    if dist.get_world_size() == 1:
-                        mem_transformer.reset_length(
-                            tgt_len=args.eval_tgt_len,
-                            ext_len=args.ext_len,
-                            mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len,
-                        )
-                    else:
-                        mem_transformer._layers.reset_length(
-                            tgt_len=args.eval_tgt_len,
-                            ext_len=args.ext_len,
-                            mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len,
-                        )
-
-                total_len, total_loss = 0, 0.0
-
-                eval_mems = tuple()
-                with paddle.no_grad():
-                    for i, (src, target, seq_len) in enumerate(eval_loader):
-                        if args.max_eval_steps > 0 and i >= args.max_eval_steps:
-                            break
-                        ret = mem_transformer(src, target, *eval_mems)
-                        loss, eval_mems = ret[0], ret[1:]
-                        eval_cur_loss = seq_len * loss.numpy()
-                        total_loss += eval_cur_loss
-                        total_len += seq_len
-                    eval_loss = total_loss / total_len
-
-                logger_info = "Validation, step_idx: %d, validation loss: %f" % (step_idx, eval_loss)
-                if args.dataset in ["enwik8", "text8"]:
-                    logger_info = logger_info + ", bpc: %f" % (eval_loss / np.log(2))
-                else:
-                    logger_info = logger_info + ", ppl: %f" % (np.exp(eval_loss))
-                logger.info(logger_info)
-
-                if args.save_model and rank == 0:
-                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
-                    if not os.path.exists(model_dir):
-                        os.makedirs(model_dir)
-                    paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams"))
-                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt"))
-                    f = open(
-                        os.path.join(args.save_model, "step_" + str(step_idx), "evaluation_loss_" + str(eval_loss)),
-                        "w",
-                    )
-                    f.close()
-
-                if args.scheduler == "dev_perf":
-                    scheduler.step(eval_loss)
-
-                # TODO(FrostML): simplify this.
-                if dist.get_world_size() == 1:
-                    mem_transformer.reset_length(tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len)
-                else:
-                    mem_transformer._layers.reset_length(
-                        tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len
-                    )
-
-                mem_transformer.train()
-
-            if step_idx >= args.max_step:
-                return
-            step_idx += 1
-            batch_id += 1
-            if args.scheduler in ["cosine", "dev_perf"]:
-                if step_idx < args.warmup_steps:
-                    curr_lr = args.learning_rate * step_idx / args.warmup_steps
-                    scheduler.base_lr = curr_lr
-                else:
-                    if args.scheduler == "cosine":
-                        scheduler.step()
-            elif args.scheduler == "constant":
-                if step_idx < args.warmup_steps:
-                    curr_lr = args.learning_rate * step_idx / args.warmup_steps
-                    optimizer.set_lr(curr_lr)
-            elif args.scheduler == "noam":
-                scheduler.step()
-
-    if args.save_model and rank == 0:
-        model_dir = os.path.join(args.save_model, "step_final")
-        if not os.path.exists(model_dir):
-            os.makedirs(model_dir)
-        paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams"))
-        paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt"))
-
-
-if __name__ == "__main__":
-    ARGS = parse_args()
-    yaml_file = ARGS.config
-    with open(yaml_file, "rt") as f:
-        args = AttrDict(yaml.safe_load(f))
-        pprint(args)
-
-    do_train(args)
diff --git a/examples/language_model/transformer-xl/utils/preprocess_text8.py b/examples/language_model/transformer-xl/utils/preprocess_text8.py
deleted file mode 100644
index ad70ab65ccc2..000000000000
--- a/examples/language_model/transformer-xl/utils/preprocess_text8.py
+++ /dev/null
@@ -1,33 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import sys
-import zipfile
-
-if __name__ == "__main__":
-    data = zipfile.ZipFile("text8.zip").extractall()
-    data = open("text8", "r", encoding="utf-8").read()
-
-    num_test_char = int(sys.argv[1])
-
-    train_data = data[: -2 * num_test_char]
-    valid_data = data[-2 * num_test_char : -num_test_char]
-    test_data = data[-num_test_char:]
-
-    for files, data in [("train.txt", train_data), ("valid.txt", valid_data), ("test.txt", test_data)]:
-        data_str = " ".join(["_" if c == " " else c for c in data.strip()])
-        with open(files, "w") as f:
-            f.write(data_str)
-        with open(files + ".raw", "w", encoding="utf-8") as fw:
-            fw.write(data)
diff --git a/examples/language_model/xlnet/README.md b/examples/language_model/xlnet/README.md
deleted file mode 100644
index bb17af1529bb..000000000000
--- a/examples/language_model/xlnet/README.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# XLNet
-
-## 模型简介
-
-[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型，XLNet通过最大化输入序列所有排列的期望来进行语言建模，这使得它可以同时关注到上下文的信息。 另外，XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型，Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列，这使得XLNet在长文本序列的语言任务上有着优秀的表现。
-
-本项目是XLNet在 Paddle 2.0上的开源实现，包含了在 [GLUE评测任务](https://gluebenchmark.com/tasks) 上的微调代码。
-
-## 快速开始
-
-### 环境依赖
-
-- sentencepiece
-
-安装命令：`pip install sentencepiece`
-
-### 数据准备
-
-GLUE评测任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_glue.py`执行时将会自动下载。
-
-### 执行Fine-tuning
-
-以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
-
-```shell
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus "0" ./run_glue.py \
-    --model_name_or_path xlnet-base-cased \
-    --task_name SST-2 \
-    --max_seq_length 128 \
-    --batch_size 32 \
-    --learning_rate 2e-5 \
-    --num_train_epochs 3 \
-    --logging_steps 100 \
-    --save_steps 500 \
-    --output_dir ./tmp/
-```
-
-其中参数释义如下：
-- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
-- `task_name` 表示Fine-tuning的任务。
-- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
-- `batch_size` 表示每次迭代**每张卡**上的样本数目。
-- `learning_rate` 表示基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
-- `num_train_epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔。
-- `save_steps` 表示模型保存及评估间隔。
-- `output_dir` 表示模型保存路径。
-
-基于`xlnet-base-cased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
-
-| Task  | Metric                       | Result             |
-|:-----:|:----------------------------:|:------------------:|
-| SST-2 | Accuracy                     |      94.266        |
-| QNLI  | Accuracy                     |      91.708        |
-| CoLA  | Mattehew's corr              |      50.264        |
-| MRPC  | F1/Accuracy                  |   91.071/87.745    |
-| STS-B | Person/Spearman corr         |   86.243/85.973    |
-| QQP   | Accuracy/F1                  |   90.838/87.644    |
-| MNLI  | Matched acc/MisMatched acc   |   87.468/86.859    |
-| RTE   | Accuracy                     |      70.036        |
-| WNLI  | Accuracy                     |      56.338        |
-
-## Reference
-
-- [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
-- [zihangdai/xlnet](https://github.com/zihangdai/xlnet)
diff --git a/examples/language_model/xlnet/run_glue.py b/examples/language_model/xlnet/run_glue.py
deleted file mode 100644
index a59e3acb39d7..000000000000
--- a/examples/language_model/xlnet/run_glue.py
+++ /dev/null
@@ -1,377 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from functools import partial
-from math import ceil
-
-import numpy as np
-import paddle
-from paddle.io import DataLoader
-from paddle.metric import Accuracy
-
-from paddlenlp.data import Pad, Stack, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
-from paddlenlp.transformers import LinearDecayWithWarmup
-from paddlenlp.transformers.xlnet.modeling import (
-    XLNetForSequenceClassification,
-    XLNetPretrainedModel,
-)
-from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
-from paddlenlp.utils import profiler
-
-final_res = "Not evaluated yet!"
-
-METRIC_CLASSES = {
-    "cola": Mcc,
-    "sst-2": Accuracy,
-    "mrpc": AccuracyAndF1,
-    "sts-b": PearsonAndSpearman,
-    "qqp": AccuracyAndF1,
-    "mnli": Accuracy,
-    "qnli": Accuracy,
-    "rte": Accuracy,
-    "wnli": Accuracy,
-}
-
-
-def parse_args():
-    # yapf: disable
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),)
-    parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(XLNetPretrainedModel.pretrained_init_configuration.keys()),)
-    parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.",)
-    parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",)
-    parser.add_argument("--pad_to_max_seq_len", default=False, type=bool, help="Whether to pad all sequences to max length for sequences shorter than max length.",)
-    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per device for training.",)
-    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",)
-    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",)
-    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",)
-    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",)
-    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",)
-    parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.",)
-    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.",)
-    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",)
-    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "npu"], help="Select cpu, gpu, xpu, npu devices.",)
-    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup_steps. If > 0: Override warmup_proportion",)
-    parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps.",)
-    parser.add_argument('-p', '--profiler_options', type=str, default=None, help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".',)
-    # yapf: enable
-
-    args = parser.parse_args()
-    return args
-
-
-def set_seed(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    paddle.seed(args.seed)
-
-
-@paddle.no_grad()
-def evaluate(model, loss_fct, metric, data_loader):
-    model.eval()
-    metric.reset()
-    losses = []
-    global final_res
-    for batch in data_loader:
-        input_ids, token_type_ids, attention_mask, labels = batch
-        logits = model(input_ids, token_type_ids, attention_mask)
-        loss = loss_fct(logits, labels)
-        losses.append(loss.detach().numpy())
-        correct = metric.compute(logits, labels)
-        metric.update(correct)
-    res = metric.accumulate()
-    if isinstance(metric, AccuracyAndF1):
-        print(
-            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s"
-            % (np.average(losses), res[0], res[1], res[2], res[3], res[4])
-        )
-
-        final_res = "final:    acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s" % (
-            res[0],
-            res[1],
-            res[2],
-            res[3],
-            res[4],
-        )
-    elif isinstance(metric, Mcc):
-        print("eval loss: %f, mcc: %s" % (np.average(losses), res[0]))
-        final_res = "final:    mcc: %s" % (res[0])
-    elif isinstance(metric, PearsonAndSpearman):
-        print(
-            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s"
-            % (np.average(losses), res[0], res[1], res[2])
-        )
-        final_res = "final:    pearson: %s, spearman: %s, pearson and spearman: %s" % (res[0], res[1], res[2])
-    else:
-        print("eval loss: %f, acc: %s" % (np.average(losses), res))
-        final_res = "final:    acc: %s" % res
-    model.train()
-
-
-def convert_example(example, tokenizer, label_list, max_seq_length=512, pad_to_max_seq_len=False, is_test=False):
-    """convert a glue example into necessary features"""
-    if not is_test:
-        # `label_list == None` is for regression task
-        label_dtype = "int64" if label_list else "float32"
-        # Get the label
-        label = example["labels"]
-        label = np.array([label], dtype=label_dtype)
-    # Convert raw text to feature
-    if (int(is_test) + len(example)) == 2:
-        example = tokenizer(
-            example["sentence"],
-            max_seq_len=max_seq_length,
-            pad_to_max_seq_len=pad_to_max_seq_len,
-            return_attention_mask=True,
-        )
-    else:
-        example = tokenizer(
-            example["sentence1"],
-            text_pair=example["sentence2"],
-            max_seq_len=max_seq_length,
-            pad_to_max_seq_len=pad_to_max_seq_len,
-            return_attention_mask=True,
-        )
-
-    if not is_test:
-        return example["input_ids"], example["token_type_ids"], example["attention_mask"], label
-    else:
-        return example["input_ids"], example["token_type_ids"], example["attention_mask"]
-
-
-def create_data_loader(args, tokenizer):
-    train_ds = load_dataset("glue", args.task_name, splits="train")
-
-    trans_func = partial(
-        convert_example,
-        tokenizer=tokenizer,
-        label_list=train_ds.label_list,
-        max_seq_length=args.max_seq_length,
-        pad_to_max_seq_len=args.pad_to_max_seq_len,
-    )
-    train_ds = train_ds.map(trans_func, lazy=True)
-    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
-
-    batchify_fn = lambda samples, fn=Tuple(
-        Pad(axis=0, pad_val=tokenizer.pad_token_id, pad_right=False),  # input
-        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, pad_right=False),  # token_type
-        Pad(axis=0, pad_val=0, pad_right=False),  # attention_mask
-        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
-    ): fn(samples)
-
-    train_data_loader = DataLoader(
-        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-    )
-
-    if args.task_name == "mnli":
-        dev_ds_matched, dev_ds_mismatched = load_dataset(
-            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
-        )
-        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
-        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
-        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
-        dev_data_loader_matched = DataLoader(
-            dataset=dev_ds_matched,
-            batch_sampler=dev_batch_sampler_matched,
-            collate_fn=batchify_fn,
-            num_workers=0,
-            return_list=True,
-        )
-        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
-            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
-        )
-        dev_data_loader_mismatched = DataLoader(
-            dataset=dev_ds_mismatched,
-            batch_sampler=dev_batch_sampler_mismatched,
-            collate_fn=batchify_fn,
-            num_workers=0,
-            return_list=True,
-        )
-
-        return (
-            train_data_loader,
-            dev_data_loader_matched,
-            dev_data_loader_mismatched,
-            train_ds,
-            dev_ds_matched,
-            dev_ds_mismatched,
-        )
-    else:
-        dev_ds = load_dataset("glue", args.task_name, splits="dev")
-        dev_ds = dev_ds.map(trans_func, lazy=True)
-        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
-
-        dev_data_loader = DataLoader(
-            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-        )
-
-        return train_data_loader, dev_data_loader, train_ds, dev_ds
-
-
-def do_train(args):
-    paddle.set_device(args.device)
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args)
-    global final_res
-
-    args.task_name = args.task_name.lower()
-    metric_class = METRIC_CLASSES[args.task_name]
-    tokenizer_class = XLNetTokenizer
-
-    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-
-    if args.task_name == "mnli":
-        (
-            train_data_loader,
-            dev_data_loader_matched,
-            dev_data_loader_mismatched,
-            train_ds,
-            dev_ds_matched,
-            dev_ds_mismatched,
-        ) = create_data_loader(args, tokenizer)
-    else:
-        train_data_loader, dev_data_loader, train_ds, dev_ds = create_data_loader(args, tokenizer)
-
-    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
-    model = XLNetForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
-
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    if args.max_steps > 0:
-        num_training_steps = args.max_steps
-        num_train_epochs = ceil(num_training_steps / len(train_data_loader))
-    else:
-        num_training_steps = len(train_data_loader) * args.num_train_epochs
-        num_train_epochs = args.num_train_epochs
-
-    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
-
-    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm)
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "layer_norm"])]
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        beta1=0.9,
-        beta2=0.999,
-        epsilon=args.adam_epsilon,
-        parameters=model.parameters(),
-        grad_clip=clip,
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
-
-    metric = metric_class()
-
-    global_step = 0
-    model.train()
-
-    train_reader_cost = 0.0
-    train_run_cost = 0.0
-    reader_start = time.time()
-    for epoch in range(num_train_epochs):
-        for step, batch in enumerate(train_data_loader):
-            train_reader_cost += time.time() - reader_start
-            train_start = time.time()
-
-            global_step += 1
-            input_ids, token_type_ids, attention_mask, labels = batch
-            logits = model(input_ids, token_type_ids, attention_mask)
-            loss = loss_fct(logits, labels)
-            loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-
-            train_run_cost += time.time() - train_start
-            # Profile for model benchmark
-            profiler.add_profiler_step(args.profiler_options)
-
-            if global_step % args.logging_steps == 0:
-                speed = args.logging_steps / (train_reader_cost + train_run_cost)
-                avg_reader_cost = train_reader_cost / args.logging_steps
-                print(
-                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s, avg_reader_cost: %.4f sec, avg_batch_cost: %.4f sec, avg_samples: %d, avg_ips: %.4f sequences/sec"
-                    % (
-                        global_step,
-                        num_training_steps,
-                        epoch,
-                        step,
-                        paddle.distributed.get_rank(),
-                        loss,
-                        optimizer.get_lr(),
-                        speed,
-                        avg_reader_cost,
-                        1.0 / speed,
-                        args.batch_size,
-                        speed * args.batch_size,
-                    )
-                )
-                train_reader_cost = 0.0
-                train_run_cost = 0.0
-
-            if global_step % args.save_steps == 0 or global_step == num_training_steps:
-                tic_eval = time.time()
-                if args.task_name == "mnli":
-                    print("matched ", end="")
-                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
-                    final_res1 = "matched " + final_res
-                    print("mismatched ", end="")
-                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
-                    final_res2 = "mismatched " + final_res
-                    final_res = final_res1 + "\r\n" + final_res2
-                    print("eval done total : %s s" % (time.time() - tic_eval))
-                else:
-                    evaluate(model, loss_fct, metric, dev_data_loader)
-                    print("eval done total : %s s" % (time.time() - tic_eval))
-                if (not paddle.distributed.get_world_size() > 1) or paddle.distributed.get_rank() == 0:
-                    output_dir = os.path.join(args.output_dir, "%s_ft_model_%d" % (args.task_name, global_step))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    # Need better way to get inner model of DataParallel
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-                if global_step == num_training_steps:
-                    print(final_res)
-                    exit(0)
-
-            reader_start = time.time()
-
-
-def print_arguments(args):
-    """print arguments"""
-    print("-----------  Configuration Arguments -----------")
-    for arg, value in sorted(vars(args).items()):
-        print("%s: %s" % (arg, value))
-    print("------------------------------------------------")
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_arguments(args)
-    do_train(args)
diff --git a/model_zoo/README.md b/legacy/model_zoo/README.md
similarity index 100%
rename from model_zoo/README.md
rename to legacy/model_zoo/README.md
diff --git a/model_zoo/bert/README.md b/legacy/model_zoo/bert/README.md
similarity index 100%
rename from model_zoo/bert/README.md
rename to legacy/model_zoo/bert/README.md
diff --git a/model_zoo/bert/create_pretraining_data.py b/legacy/model_zoo/bert/create_pretraining_data.py
similarity index 100%
rename from model_zoo/bert/create_pretraining_data.py
rename to legacy/model_zoo/bert/create_pretraining_data.py
diff --git a/model_zoo/bert/data/sample_text.txt b/legacy/model_zoo/bert/data/sample_text.txt
similarity index 100%
rename from model_zoo/bert/data/sample_text.txt
rename to legacy/model_zoo/bert/data/sample_text.txt
diff --git a/model_zoo/bert/deploy/python/README.md b/legacy/model_zoo/bert/deploy/python/README.md
similarity index 100%
rename from model_zoo/bert/deploy/python/README.md
rename to legacy/model_zoo/bert/deploy/python/README.md
diff --git a/model_zoo/bert/deploy/python/seq_cls_infer.py b/legacy/model_zoo/bert/deploy/python/seq_cls_infer.py
similarity index 100%
rename from model_zoo/bert/deploy/python/seq_cls_infer.py
rename to legacy/model_zoo/bert/deploy/python/seq_cls_infer.py
diff --git a/model_zoo/bert/export_model.py b/legacy/model_zoo/bert/export_model.py
similarity index 100%
rename from model_zoo/bert/export_model.py
rename to legacy/model_zoo/bert/export_model.py
diff --git a/model_zoo/bert/run_glue_trainer.py b/legacy/model_zoo/bert/run_glue_trainer.py
similarity index 100%
rename from model_zoo/bert/run_glue_trainer.py
rename to legacy/model_zoo/bert/run_glue_trainer.py
diff --git a/model_zoo/bert/run_pretrain.py b/legacy/model_zoo/bert/run_pretrain.py
similarity index 100%
rename from model_zoo/bert/run_pretrain.py
rename to legacy/model_zoo/bert/run_pretrain.py
diff --git a/model_zoo/bert/run_pretrain_trainer.py b/legacy/model_zoo/bert/run_pretrain_trainer.py
similarity index 100%
rename from model_zoo/bert/run_pretrain_trainer.py
rename to legacy/model_zoo/bert/run_pretrain_trainer.py
diff --git a/model_zoo/bert/static/README.md b/legacy/model_zoo/bert/static/README.md
similarity index 100%
rename from model_zoo/bert/static/README.md
rename to legacy/model_zoo/bert/static/README.md
diff --git a/model_zoo/bert/static/create_pretraining_data.py b/legacy/model_zoo/bert/static/create_pretraining_data.py
similarity index 100%
rename from model_zoo/bert/static/create_pretraining_data.py
rename to legacy/model_zoo/bert/static/create_pretraining_data.py
diff --git a/model_zoo/bert/static/data/sample_text.txt b/legacy/model_zoo/bert/static/data/sample_text.txt
similarity index 100%
rename from model_zoo/bert/static/data/sample_text.txt
rename to legacy/model_zoo/bert/static/data/sample_text.txt
diff --git a/model_zoo/bert/static/dataset.py b/legacy/model_zoo/bert/static/dataset.py
similarity index 100%
rename from model_zoo/bert/static/dataset.py
rename to legacy/model_zoo/bert/static/dataset.py
diff --git a/model_zoo/bert/static/predict_glue.py b/legacy/model_zoo/bert/static/predict_glue.py
similarity index 100%
rename from model_zoo/bert/static/predict_glue.py
rename to legacy/model_zoo/bert/static/predict_glue.py
diff --git a/model_zoo/bert/static/run_glue.py b/legacy/model_zoo/bert/static/run_glue.py
similarity index 100%
rename from model_zoo/bert/static/run_glue.py
rename to legacy/model_zoo/bert/static/run_glue.py
diff --git a/model_zoo/bert/static/run_glue_with_sparaity.py b/legacy/model_zoo/bert/static/run_glue_with_sparaity.py
similarity index 100%
rename from model_zoo/bert/static/run_glue_with_sparaity.py
rename to legacy/model_zoo/bert/static/run_glue_with_sparaity.py
diff --git a/model_zoo/bert/static/run_pretrain.py b/legacy/model_zoo/bert/static/run_pretrain.py
similarity index 100%
rename from model_zoo/bert/static/run_pretrain.py
rename to legacy/model_zoo/bert/static/run_pretrain.py
diff --git a/model_zoo/bert/static_ipu/README.md b/legacy/model_zoo/bert/static_ipu/README.md
similarity index 100%
rename from model_zoo/bert/static_ipu/README.md
rename to legacy/model_zoo/bert/static_ipu/README.md
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/custom_detach.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/custom_identity.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/tied_gather.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/utils.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/utils.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/utils.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/utils.cc
diff --git a/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc b/legacy/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc
similarity index 100%
rename from model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc
rename to legacy/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc
diff --git a/model_zoo/bert/static_ipu/dataset_ipu.py b/legacy/model_zoo/bert/static_ipu/dataset_ipu.py
similarity index 100%
rename from model_zoo/bert/static_ipu/dataset_ipu.py
rename to legacy/model_zoo/bert/static_ipu/dataset_ipu.py
diff --git a/model_zoo/bert/static_ipu/load_tf_ckpt.py b/legacy/model_zoo/bert/static_ipu/load_tf_ckpt.py
similarity index 100%
rename from model_zoo/bert/static_ipu/load_tf_ckpt.py
rename to legacy/model_zoo/bert/static_ipu/load_tf_ckpt.py
diff --git a/model_zoo/bert/static_ipu/modeling.py b/legacy/model_zoo/bert/static_ipu/modeling.py
similarity index 100%
rename from model_zoo/bert/static_ipu/modeling.py
rename to legacy/model_zoo/bert/static_ipu/modeling.py
diff --git a/model_zoo/bert/static_ipu/requirements.txt b/legacy/model_zoo/bert/static_ipu/requirements.txt
similarity index 100%
rename from model_zoo/bert/static_ipu/requirements.txt
rename to legacy/model_zoo/bert/static_ipu/requirements.txt
diff --git a/model_zoo/bert/static_ipu/run_pretrain.py b/legacy/model_zoo/bert/static_ipu/run_pretrain.py
similarity index 100%
rename from model_zoo/bert/static_ipu/run_pretrain.py
rename to legacy/model_zoo/bert/static_ipu/run_pretrain.py
diff --git a/model_zoo/bert/static_ipu/run_squad.py b/legacy/model_zoo/bert/static_ipu/run_squad.py
similarity index 100%
rename from model_zoo/bert/static_ipu/run_squad.py
rename to legacy/model_zoo/bert/static_ipu/run_squad.py
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
similarity index 63%
rename from model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
index cd1c5bb00f40..9d782399eab0 100755
--- a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export RDMAV_FORK_SAFE=1
 python3 run_pretrain.py \
         --input_files "path_to_phase1_hdf5_dataset" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
similarity index 66%
rename from model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
index 8458ed48b6b2..2a532a5d39a8 100755
--- a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export RDMAV_FORK_SAFE=1
 python3 run_pretrain.py \
         --input_files "path_to_phase2_hdf5_dataset" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
similarity index 67%
rename from model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
index 4c36ef69d6b1..07460e72e4f2 100755
--- a/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_squad.py \
         --output_dir squad_model \
         --task "SQUAD" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
similarity index 65%
rename from model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
index 28ffa7285443..5c5e95768b78 100755
--- a/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_squad.py \
         --output_dir squad_model \
         --task "SQUAD" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
similarity index 63%
rename from model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
index 299e0dc25981..55d3e6bc9387 100755
--- a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export RDMAV_FORK_SAFE=1
 python3 run_pretrain.py \
         --input_files "path_to_phase1_hdf5_dataset" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
similarity index 66%
rename from model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
index 89ec3ec4bab9..88ee4975ab15 100755
--- a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export RDMAV_FORK_SAFE=1
 python3 run_pretrain.py \
         --input_files "path_to_phase2_hdf5_dataset" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
similarity index 67%
rename from model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
index 81302949c4a8..4268b643c974 100755
--- a/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_squad.py \
         --output_dir squad_model \
         --task "SQUAD" \
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
similarity index 65%
rename from model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
rename to legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
index ae400c59e528..f9899e5c1ea9 100755
--- a/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
+++ b/legacy/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
@@ -1,5 +1,19 @@
 #!/usr/bin/env bash
 
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_squad.py \
         --output_dir squad_model \
         --task "SQUAD" \
diff --git a/model_zoo/bert/static_ipu/utils.py b/legacy/model_zoo/bert/static_ipu/utils.py
similarity index 100%
rename from model_zoo/bert/static_ipu/utils.py
rename to legacy/model_zoo/bert/static_ipu/utils.py
diff --git a/examples/language_model/bigbird/README.md b/legacy/model_zoo/bigbird/README.md
similarity index 100%
rename from examples/language_model/bigbird/README.md
rename to legacy/model_zoo/bigbird/README.md
diff --git a/examples/language_model/bigbird/args.py b/legacy/model_zoo/bigbird/args.py
similarity index 100%
rename from examples/language_model/bigbird/args.py
rename to legacy/model_zoo/bigbird/args.py
diff --git a/examples/language_model/bigbird/data/data.txt b/legacy/model_zoo/bigbird/data/data.txt
similarity index 100%
rename from examples/language_model/bigbird/data/data.txt
rename to legacy/model_zoo/bigbird/data/data.txt
diff --git a/examples/language_model/bigbird/run_classifier.py b/legacy/model_zoo/bigbird/run_classifier.py
similarity index 100%
rename from examples/language_model/bigbird/run_classifier.py
rename to legacy/model_zoo/bigbird/run_classifier.py
diff --git a/examples/language_model/bigbird/run_glue.py b/legacy/model_zoo/bigbird/run_glue.py
similarity index 100%
rename from examples/language_model/bigbird/run_glue.py
rename to legacy/model_zoo/bigbird/run_glue.py
diff --git a/examples/language_model/bigbird/run_pretrain.py b/legacy/model_zoo/bigbird/run_pretrain.py
similarity index 100%
rename from examples/language_model/bigbird/run_pretrain.py
rename to legacy/model_zoo/bigbird/run_pretrain.py
diff --git a/examples/language_model/bloom b/legacy/model_zoo/bloom
similarity index 100%
rename from examples/language_model/bloom
rename to legacy/model_zoo/bloom
diff --git a/examples/language_model/chatglm b/legacy/model_zoo/chatglm
similarity index 100%
rename from examples/language_model/chatglm
rename to legacy/model_zoo/chatglm
diff --git a/examples/language_model/chinesebert/README.md b/legacy/model_zoo/chinesebert/README.md
similarity index 100%
rename from examples/language_model/chinesebert/README.md
rename to legacy/model_zoo/chinesebert/README.md
diff --git a/examples/language_model/end_to_end_memory_networks/utils.py b/legacy/model_zoo/chinesebert/cmrc_eval.sh
similarity index 71%
rename from examples/language_model/end_to_end_memory_networks/utils.py
rename to legacy/model_zoo/chinesebert/cmrc_eval.sh
index 632c790093d7..04988d6b4f7f 100644
--- a/examples/language_model/end_to_end_memory_networks/utils.py
+++ b/legacy/model_zoo/chinesebert/cmrc_eval.sh
@@ -1,21 +1,15 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
-#
+# 
 #     http://www.apache.org/licenses/LICENSE-2.0
-#
+# 
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from progress.bar import Bar
-
-
-class ProgressBar(Bar):
-    message = "Loading"
-    fill = "#"
-    suffix = "%(percent).1f%% | ETA: %(eta)ds"
+python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65
diff --git a/examples/language_model/chinesebert/cmrc_evaluate.py b/legacy/model_zoo/chinesebert/cmrc_evaluate.py
similarity index 100%
rename from examples/language_model/chinesebert/cmrc_evaluate.py
rename to legacy/model_zoo/chinesebert/cmrc_evaluate.py
diff --git a/examples/language_model/chinesebert/dataset_cmrc2018.py b/legacy/model_zoo/chinesebert/dataset_cmrc2018.py
similarity index 100%
rename from examples/language_model/chinesebert/dataset_cmrc2018.py
rename to legacy/model_zoo/chinesebert/dataset_cmrc2018.py
diff --git a/examples/language_model/chinesebert/eval_cmrc.py b/legacy/model_zoo/chinesebert/eval_cmrc.py
similarity index 99%
rename from examples/language_model/chinesebert/eval_cmrc.py
rename to legacy/model_zoo/chinesebert/eval_cmrc.py
index 9cd8d8d8d50f..dabc59da345d 100644
--- a/examples/language_model/chinesebert/eval_cmrc.py
+++ b/legacy/model_zoo/chinesebert/eval_cmrc.py
@@ -14,14 +14,13 @@
 # limitations under the License.
 
 import argparse
-from tqdm.auto import tqdm
 import os
 
 import paddle
-
 from dataset_cmrc2018 import get_dev_dataloader
-from train_cmrc2018 import MODEL_CLASSES
 from metric import compute_prediction
+from tqdm.auto import tqdm
+from train_cmrc2018 import MODEL_CLASSES
 from utils import save_json
 
 
diff --git a/examples/language_model/chinesebert/metric_cmrc.py b/legacy/model_zoo/chinesebert/metric_cmrc.py
similarity index 99%
rename from examples/language_model/chinesebert/metric_cmrc.py
rename to legacy/model_zoo/chinesebert/metric_cmrc.py
index 13208fb9773a..6efa36b55a8a 100644
--- a/examples/language_model/chinesebert/metric_cmrc.py
+++ b/legacy/model_zoo/chinesebert/metric_cmrc.py
@@ -17,6 +17,7 @@
 import json
 import re
 import string
+
 import numpy as np
 
 
diff --git a/examples/language_model/chinesebert/run_chn.sh b/legacy/model_zoo/chinesebert/run_chn.sh
similarity index 100%
rename from examples/language_model/chinesebert/run_chn.sh
rename to legacy/model_zoo/chinesebert/run_chn.sh
diff --git a/examples/language_model/chinesebert/run_cmrc2018.sh b/legacy/model_zoo/chinesebert/run_cmrc2018.sh
similarity index 100%
rename from examples/language_model/chinesebert/run_cmrc2018.sh
rename to legacy/model_zoo/chinesebert/run_cmrc2018.sh
diff --git a/examples/language_model/chinesebert/run_xnli.sh b/legacy/model_zoo/chinesebert/run_xnli.sh
similarity index 100%
rename from examples/language_model/chinesebert/run_xnli.sh
rename to legacy/model_zoo/chinesebert/run_xnli.sh
diff --git a/examples/language_model/chinesebert/train_chn.py b/legacy/model_zoo/chinesebert/train_chn.py
similarity index 100%
rename from examples/language_model/chinesebert/train_chn.py
rename to legacy/model_zoo/chinesebert/train_chn.py
diff --git a/examples/language_model/chinesebert/train_cmrc2018.py b/legacy/model_zoo/chinesebert/train_cmrc2018.py
similarity index 100%
rename from examples/language_model/chinesebert/train_cmrc2018.py
rename to legacy/model_zoo/chinesebert/train_cmrc2018.py
diff --git a/examples/language_model/chinesebert/train_xnli.py b/legacy/model_zoo/chinesebert/train_xnli.py
similarity index 100%
rename from examples/language_model/chinesebert/train_xnli.py
rename to legacy/model_zoo/chinesebert/train_xnli.py
diff --git a/examples/language_model/chinesebert/utils.py b/legacy/model_zoo/chinesebert/utils.py
similarity index 100%
rename from examples/language_model/chinesebert/utils.py
rename to legacy/model_zoo/chinesebert/utils.py
diff --git a/examples/language_model/convbert/README.md b/legacy/model_zoo/convbert/README.md
similarity index 100%
rename from examples/language_model/convbert/README.md
rename to legacy/model_zoo/convbert/README.md
diff --git a/examples/language_model/convbert/convert.py b/legacy/model_zoo/convbert/convert.py
similarity index 100%
rename from examples/language_model/convbert/convert.py
rename to legacy/model_zoo/convbert/convert.py
index 96208861028a..96aa2ebcf34b 100644
--- a/examples/language_model/convbert/convert.py
+++ b/legacy/model_zoo/convbert/convert.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from collections import OrderedDict
 import argparse
+from collections import OrderedDict
 
 huggingface_to_paddle = {
     "embeddings.LayerNorm": "embeddings.layer_norm",
@@ -36,8 +36,8 @@
 
 
 def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
-    import torch
     import paddle
+    import torch
 
     pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
     paddle_state_dict = OrderedDict()
diff --git a/examples/language_model/convbert/run_glue.py b/legacy/model_zoo/convbert/run_glue.py
similarity index 100%
rename from examples/language_model/convbert/run_glue.py
rename to legacy/model_zoo/convbert/run_glue.py
diff --git a/examples/language_model/convbert/run_pretrain.py b/legacy/model_zoo/convbert/run_pretrain.py
similarity index 100%
rename from examples/language_model/convbert/run_pretrain.py
rename to legacy/model_zoo/convbert/run_pretrain.py
diff --git a/legacy/model_zoo/electra/README.md b/legacy/model_zoo/electra/README.md
new file mode 100644
index 000000000000..64bdb96f0d67
--- /dev/null
+++ b/legacy/model_zoo/electra/README.md
@@ -0,0 +1,8 @@
+# ELECTRA with PaddleNLP
+
+[ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) 在[BERT](https://arxiv.org/abs/1810.04805)的基础上对其预训练过程进行了改进：预训练由两部分模型网络组成，称为Generator和Discriminator，各自包含1个BERT模型。Generator的预训练使用和BERT一样的Masked Language Model(MLM)任务，但Discriminator的预训练使用Replaced Token Detection(RTD)任务（主要改进点）。预训练完成后，使用Discriminator作为精调模型，后续的Fine-tuning不再使用Generator。
+![avatar](./electra_model_brief_introduce.JPG)
+
+图片来源：来自[electra论文](https://openreview.net/pdf?id=r1xMH1BtvB)
+
+详细请参考[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/electra).
diff --git a/model_zoo/ernie-1.0/README.md b/legacy/model_zoo/ernie-1.0/README.md
similarity index 100%
rename from model_zoo/ernie-1.0/README.md
rename to legacy/model_zoo/ernie-1.0/README.md
diff --git a/model_zoo/ernie-1.0/args.py b/legacy/model_zoo/ernie-1.0/args.py
similarity index 100%
rename from model_zoo/ernie-1.0/args.py
rename to legacy/model_zoo/ernie-1.0/args.py
diff --git a/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py b/legacy/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
similarity index 99%
rename from model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
rename to legacy/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
index c86cc1fea018..b66b70ced9a8 100644
--- a/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
+++ b/legacy/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
@@ -13,7 +13,9 @@
 # limitations under the License.
 
 import argparse
+
 import paddle
+
 from paddlenlp.transformers import AutoModelForPretraining
 from paddlenlp.utils.log import logger
 
diff --git a/model_zoo/ernie-1.0/data_tools/Makefile b/legacy/model_zoo/ernie-1.0/data_tools/Makefile
similarity index 100%
rename from model_zoo/ernie-1.0/data_tools/Makefile
rename to legacy/model_zoo/ernie-1.0/data_tools/Makefile
diff --git a/model_zoo/ernie-1.0/data_tools/dataset_utils.py b/legacy/model_zoo/ernie-1.0/data_tools/dataset_utils.py
similarity index 100%
rename from model_zoo/ernie-1.0/data_tools/dataset_utils.py
rename to legacy/model_zoo/ernie-1.0/data_tools/dataset_utils.py
diff --git a/model_zoo/ernie-1.0/data_tools/ernie_dataset.py b/legacy/model_zoo/ernie-1.0/data_tools/ernie_dataset.py
similarity index 100%
rename from model_zoo/ernie-1.0/data_tools/ernie_dataset.py
rename to legacy/model_zoo/ernie-1.0/data_tools/ernie_dataset.py
diff --git a/model_zoo/ernie-1.0/data_tools/helpers.cpp b/legacy/model_zoo/ernie-1.0/data_tools/helpers.cpp
similarity index 100%
rename from model_zoo/ernie-1.0/data_tools/helpers.cpp
rename to legacy/model_zoo/ernie-1.0/data_tools/helpers.cpp
diff --git a/model_zoo/ernie-1.0/finetune/README.md b/legacy/model_zoo/ernie-1.0/finetune/README.md
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/README.md
rename to legacy/model_zoo/ernie-1.0/finetune/README.md
diff --git a/model_zoo/ernie-1.0/finetune/config.yml b/legacy/model_zoo/ernie-1.0/finetune/config.yml
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/config.yml
rename to legacy/model_zoo/ernie-1.0/finetune/config.yml
diff --git a/model_zoo/ernie-1.0/finetune/deploy/README.md b/legacy/model_zoo/ernie-1.0/finetune/deploy/README.md
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/deploy/README.md
rename to legacy/model_zoo/ernie-1.0/finetune/deploy/README.md
diff --git a/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py b/legacy/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py
rename to legacy/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py
diff --git a/model_zoo/ernie-1.0/finetune/question_answering.py b/legacy/model_zoo/ernie-1.0/finetune/question_answering.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/question_answering.py
rename to legacy/model_zoo/ernie-1.0/finetune/question_answering.py
diff --git a/model_zoo/ernie-1.0/finetune/run_ner.py b/legacy/model_zoo/ernie-1.0/finetune/run_ner.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/run_ner.py
rename to legacy/model_zoo/ernie-1.0/finetune/run_ner.py
diff --git a/model_zoo/ernie-1.0/finetune/run_qa.py b/legacy/model_zoo/ernie-1.0/finetune/run_qa.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/run_qa.py
rename to legacy/model_zoo/ernie-1.0/finetune/run_qa.py
diff --git a/model_zoo/ernie-1.0/finetune/run_seq_cls.py b/legacy/model_zoo/ernie-1.0/finetune/run_seq_cls.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/run_seq_cls.py
rename to legacy/model_zoo/ernie-1.0/finetune/run_seq_cls.py
diff --git a/model_zoo/ernie-1.0/finetune/sequence_classification.py b/legacy/model_zoo/ernie-1.0/finetune/sequence_classification.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/sequence_classification.py
rename to legacy/model_zoo/ernie-1.0/finetune/sequence_classification.py
diff --git a/model_zoo/ernie-1.0/finetune/token_classification.py b/legacy/model_zoo/ernie-1.0/finetune/token_classification.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/token_classification.py
rename to legacy/model_zoo/ernie-1.0/finetune/token_classification.py
diff --git a/model_zoo/ernie-1.0/finetune/utils.py b/legacy/model_zoo/ernie-1.0/finetune/utils.py
similarity index 100%
rename from model_zoo/ernie-1.0/finetune/utils.py
rename to legacy/model_zoo/ernie-1.0/finetune/utils.py
diff --git a/model_zoo/ernie-1.0/preprocess/README.md b/legacy/model_zoo/ernie-1.0/preprocess/README.md
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/README.md
rename to legacy/model_zoo/ernie-1.0/preprocess/README.md
diff --git a/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py b/legacy/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/create_pretraining_data.py
rename to legacy/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py
diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md b/legacy/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md
rename to legacy/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md
diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md b/legacy/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md
rename to legacy/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md
diff --git a/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md b/legacy/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md
rename to legacy/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md
diff --git a/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md b/legacy/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md
rename to legacy/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md
diff --git a/model_zoo/ernie-1.0/preprocess/merge.py b/legacy/model_zoo/ernie-1.0/preprocess/merge.py
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/merge.py
rename to legacy/model_zoo/ernie-1.0/preprocess/merge.py
diff --git a/model_zoo/ernie-1.0/preprocess/trans_to_json.py b/legacy/model_zoo/ernie-1.0/preprocess/trans_to_json.py
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/trans_to_json.py
rename to legacy/model_zoo/ernie-1.0/preprocess/trans_to_json.py
diff --git a/model_zoo/ernie-1.0/preprocess/words_segmentation.py b/legacy/model_zoo/ernie-1.0/preprocess/words_segmentation.py
similarity index 100%
rename from model_zoo/ernie-1.0/preprocess/words_segmentation.py
rename to legacy/model_zoo/ernie-1.0/preprocess/words_segmentation.py
diff --git a/model_zoo/ernie-1.0/pretraining_introduction.md b/legacy/model_zoo/ernie-1.0/pretraining_introduction.md
similarity index 100%
rename from model_zoo/ernie-1.0/pretraining_introduction.md
rename to legacy/model_zoo/ernie-1.0/pretraining_introduction.md
diff --git a/model_zoo/ernie-1.0/run_gb512_s1m.sh b/legacy/model_zoo/ernie-1.0/run_gb512_s1m.sh
similarity index 100%
rename from model_zoo/ernie-1.0/run_gb512_s1m.sh
rename to legacy/model_zoo/ernie-1.0/run_gb512_s1m.sh
diff --git a/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh b/legacy/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh
similarity index 100%
rename from model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh
rename to legacy/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh
diff --git a/model_zoo/ernie-1.0/run_npu_single_card.sh b/legacy/model_zoo/ernie-1.0/run_npu_single_card.sh
similarity index 100%
rename from model_zoo/ernie-1.0/run_npu_single_card.sh
rename to legacy/model_zoo/ernie-1.0/run_npu_single_card.sh
diff --git a/model_zoo/ernie-1.0/run_pretrain.py b/legacy/model_zoo/ernie-1.0/run_pretrain.py
similarity index 100%
rename from model_zoo/ernie-1.0/run_pretrain.py
rename to legacy/model_zoo/ernie-1.0/run_pretrain.py
diff --git a/model_zoo/ernie-1.0/run_pretrain_trainer.py b/legacy/model_zoo/ernie-1.0/run_pretrain_trainer.py
similarity index 100%
rename from model_zoo/ernie-1.0/run_pretrain_trainer.py
rename to legacy/model_zoo/ernie-1.0/run_pretrain_trainer.py
diff --git a/model_zoo/ernie-1.0/vocab/README.md b/legacy/model_zoo/ernie-1.0/vocab/README.md
similarity index 100%
rename from model_zoo/ernie-1.0/vocab/README.md
rename to legacy/model_zoo/ernie-1.0/vocab/README.md
diff --git a/model_zoo/ernie-1.0/vocab/gen_char.py b/legacy/model_zoo/ernie-1.0/vocab/gen_char.py
similarity index 100%
rename from model_zoo/ernie-1.0/vocab/gen_char.py
rename to legacy/model_zoo/ernie-1.0/vocab/gen_char.py
index fbda9900f245..4fd4d280ec70 100644
--- a/model_zoo/ernie-1.0/vocab/gen_char.py
+++ b/legacy/model_zoo/ernie-1.0/vocab/gen_char.py
@@ -13,9 +13,9 @@
 # limitations under the License.
 
 import os
-import time
-import sys
 import pickle
+import sys
+import time
 from collections import defaultdict
 
 input_path = sys.argv[1]
diff --git a/model_zoo/ernie-1.0/vocab/gen_vocab.py b/legacy/model_zoo/ernie-1.0/vocab/gen_vocab.py
similarity index 99%
rename from model_zoo/ernie-1.0/vocab/gen_vocab.py
rename to legacy/model_zoo/ernie-1.0/vocab/gen_vocab.py
index 4df40721c411..e6f3d81647cf 100644
--- a/model_zoo/ernie-1.0/vocab/gen_vocab.py
+++ b/legacy/model_zoo/ernie-1.0/vocab/gen_vocab.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import sys
+
 import sentencepiece as spm
 
 input_path = sys.argv[1]
diff --git a/model_zoo/ernie-1.0/vocab/merge_vocab.py b/legacy/model_zoo/ernie-1.0/vocab/merge_vocab.py
similarity index 100%
rename from model_zoo/ernie-1.0/vocab/merge_vocab.py
rename to legacy/model_zoo/ernie-1.0/vocab/merge_vocab.py
diff --git a/model_zoo/ernie-3.0-tiny/README.md b/legacy/model_zoo/ernie-3.0-tiny/README.md
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/README.md
rename to legacy/model_zoo/ernie-3.0-tiny/README.md
diff --git a/model_zoo/ernie-3.0-tiny/data/intent_label.txt b/legacy/model_zoo/ernie-3.0-tiny/data/intent_label.txt
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/data/intent_label.txt
rename to legacy/model_zoo/ernie-3.0-tiny/data/intent_label.txt
diff --git a/model_zoo/ernie-3.0-tiny/data/slot_label.txt b/legacy/model_zoo/ernie-3.0-tiny/data/slot_label.txt
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/data/slot_label.txt
rename to legacy/model_zoo/ernie-3.0-tiny/data/slot_label.txt
diff --git a/model_zoo/ernie-3.0-tiny/deploy/README.md b/legacy/model_zoo/ernie-3.0-tiny/deploy/README.md
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/deploy/README.md
rename to legacy/model_zoo/ernie-3.0-tiny/deploy/README.md
diff --git a/model_zoo/ernie-3.0-tiny/deploy/python/README.md b/legacy/model_zoo/ernie-3.0-tiny/deploy/python/README.md
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/deploy/python/README.md
rename to legacy/model_zoo/ernie-3.0-tiny/deploy/python/README.md
diff --git a/model_zoo/ernie-3.0-tiny/deploy/python/infer_demo.py b/legacy/model_zoo/ernie-3.0-tiny/deploy/python/infer_demo.py
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/deploy/python/infer_demo.py
rename to legacy/model_zoo/ernie-3.0-tiny/deploy/python/infer_demo.py
diff --git a/model_zoo/ernie-3.0-tiny/model.py b/legacy/model_zoo/ernie-3.0-tiny/model.py
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/model.py
rename to legacy/model_zoo/ernie-3.0-tiny/model.py
diff --git a/model_zoo/ernie-3.0-tiny/run_eval.py b/legacy/model_zoo/ernie-3.0-tiny/run_eval.py
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/run_eval.py
rename to legacy/model_zoo/ernie-3.0-tiny/run_eval.py
diff --git a/model_zoo/ernie-3.0-tiny/run_train.py b/legacy/model_zoo/ernie-3.0-tiny/run_train.py
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/run_train.py
rename to legacy/model_zoo/ernie-3.0-tiny/run_train.py
diff --git a/model_zoo/ernie-3.0-tiny/utils.py b/legacy/model_zoo/ernie-3.0-tiny/utils.py
similarity index 100%
rename from model_zoo/ernie-3.0-tiny/utils.py
rename to legacy/model_zoo/ernie-3.0-tiny/utils.py
diff --git a/model_zoo/ernie-3.0/README.md b/legacy/model_zoo/ernie-3.0/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/README.md
rename to legacy/model_zoo/ernie-3.0/README.md
diff --git a/model_zoo/ernie-3.0/compress_qa.py b/legacy/model_zoo/ernie-3.0/compress_qa.py
similarity index 100%
rename from model_zoo/ernie-3.0/compress_qa.py
rename to legacy/model_zoo/ernie-3.0/compress_qa.py
diff --git a/model_zoo/ernie-3.0/compress_seq_cls.py b/legacy/model_zoo/ernie-3.0/compress_seq_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/compress_seq_cls.py
rename to legacy/model_zoo/ernie-3.0/compress_seq_cls.py
diff --git a/model_zoo/ernie-3.0/compress_token_cls.py b/legacy/model_zoo/ernie-3.0/compress_token_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/compress_token_cls.py
rename to legacy/model_zoo/ernie-3.0/compress_token_cls.py
diff --git a/model_zoo/ernie-3.0/configs/default.yml b/legacy/model_zoo/ernie-3.0/configs/default.yml
similarity index 100%
rename from model_zoo/ernie-3.0/configs/default.yml
rename to legacy/model_zoo/ernie-3.0/configs/default.yml
diff --git a/model_zoo/ernie-3.0/deploy/README.md b/legacy/model_zoo/ernie-3.0/deploy/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/README.md
diff --git a/model_zoo/ernie-3.0/deploy/python/README.md b/legacy/model_zoo/ernie-3.0/deploy/python/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/python/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/python/README.md
diff --git a/model_zoo/ernie-3.0/deploy/python/requirements.txt b/legacy/model_zoo/ernie-3.0/deploy/python/requirements.txt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/python/requirements.txt
rename to legacy/model_zoo/ernie-3.0/deploy/python/requirements.txt
diff --git a/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py b/legacy/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py
rename to legacy/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py
diff --git a/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py b/legacy/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/python/token_cls_infer.py
rename to legacy/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/README.md b/legacy/model_zoo/ernie-3.0/deploy/serving/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/serving/README.md
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt b/legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt
rename to legacy/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt
diff --git a/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py b/legacy/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py
diff --git a/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py b/legacy/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py
rename to legacy/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/README.md b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/README.md
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/README.md
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/README.md
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py b/legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py
rename to legacy/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py
diff --git a/model_zoo/ernie-3.0/infer.py b/legacy/model_zoo/ernie-3.0/infer.py
similarity index 100%
rename from model_zoo/ernie-3.0/infer.py
rename to legacy/model_zoo/ernie-3.0/infer.py
diff --git a/model_zoo/ernie-3.0/run_qa.py b/legacy/model_zoo/ernie-3.0/run_qa.py
similarity index 100%
rename from model_zoo/ernie-3.0/run_qa.py
rename to legacy/model_zoo/ernie-3.0/run_qa.py
diff --git a/model_zoo/ernie-3.0/run_seq_cls.py b/legacy/model_zoo/ernie-3.0/run_seq_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/run_seq_cls.py
rename to legacy/model_zoo/ernie-3.0/run_seq_cls.py
diff --git a/model_zoo/ernie-3.0/run_token_cls.py b/legacy/model_zoo/ernie-3.0/run_token_cls.py
similarity index 100%
rename from model_zoo/ernie-3.0/run_token_cls.py
rename to legacy/model_zoo/ernie-3.0/run_token_cls.py
diff --git a/model_zoo/ernie-3.0/utils.py b/legacy/model_zoo/ernie-3.0/utils.py
similarity index 100%
rename from model_zoo/ernie-3.0/utils.py
rename to legacy/model_zoo/ernie-3.0/utils.py
diff --git a/legacy/model_zoo/ernie-code/README.md b/legacy/model_zoo/ernie-code/README.md
new file mode 100644
index 000000000000..c64001bd2cbb
--- /dev/null
+++ b/legacy/model_zoo/ernie-code/README.md
@@ -0,0 +1,16 @@
+# ERNIE-Code
+
+[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [English version](./README.en.md)
+
+![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4)
+
+[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf)
+
+
+ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型（Code LLM），支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练：
+- Span-Corruption Language Modeling (SCLM) 从单语言的自然语言或编程语言中进行掩码语言学习；
+- Pivot-based Translation Language Modeling (PTLM)，将多自然语言到多编程语言的映射 规约为，以英语为枢轴(pivot)的多自然语言到英语、和英语到多编程语言的联合学习。
+
+ERNIE-Code在代码智能的各种下游任务中，包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务，优于以前的多语言代码和文本模型（例如mT5 和 CodeT5），同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。
+
+详细请参考[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/ernie-code).
diff --git a/legacy/model_zoo/ernie-doc/README.md b/legacy/model_zoo/ernie-doc/README.md
new file mode 100644
index 000000000000..6669d77fd2b1
--- /dev/null
+++ b/legacy/model_zoo/ernie-doc/README.md
@@ -0,0 +1,6 @@
+# ERNIE-Doc: A Retrospective Long-Document Modeling Transformer
+
+## 模型简介
+[ERNIE-Doc](https://arxiv.org/abs/2012.15688)是百度NLP提出的针对长文本的预训练模型。在循环Transformer机制之上，创新性地提出两阶段重复学习以及增强的循环机制，以此提高模型感受野，加强模型对长文本的理解能力。
+
+详细参考这里: https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/ernie-doc
diff --git a/legacy/model_zoo/ernie-gen/README.md b/legacy/model_zoo/ernie-gen/README.md
new file mode 100644
index 000000000000..84f2d717ffce
--- /dev/null
+++ b/legacy/model_zoo/ernie-gen/README.md
@@ -0,0 +1,9 @@
+# ERNIE-Gen: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
+
+## 1. 简介
+
+ERNIE-GEN 是面向生成任务的预训练-微调框架，首次在预训练阶段加入**span-by-span 生成任务**，让模型每次能够生成一个语义完整的片段。在预训练和微调中通过**填充式生成机制**和**噪声感知机制**来缓解曝光偏差问题。此外, ERNIE-GEN 采样**多片段-多粒度目标文本采样策略**, 增强源文本和目标文本的关联性，加强了编码器和解码器的交互。
+
+![multi-flow-attention](https://github.com/PaddlePaddle/ERNIE/raw/repro/ernie-gen/.meta/multi-flow-attention.png)
+
+详细参考这里: https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/ernie-gen
diff --git a/legacy/model_zoo/ernie-health/README.md b/legacy/model_zoo/ernie-health/README.md
new file mode 100644
index 000000000000..80ebd7e68fcf
--- /dev/null
+++ b/legacy/model_zoo/ernie-health/README.md
@@ -0,0 +1,16 @@
+# ERNIE-Health 中文医疗预训练模型
+
+医疗领域存在大量的专业知识和医学术语，人类经过长时间的学习才能成为一名优秀的医生。那机器如何才能“读懂”医疗文献呢？尤其是面对电子病历、生物医疗文献中存在的大量非结构化、非标准化文本，计算机是无法直接使用、处理的。这就需要自然语言处理（Natural Language Processing，NLP）技术大展身手了。
+
+## 模型介绍
+
+本项目针对中文医疗语言理解任务，开源了中文医疗预训练模型 [ERNIE-Health](https://arxiv.org/pdf/2110.07244.pdf)（模型名称`ernie-health-chinese`）。
+
+ERNIE-Health 依托百度文心 ERNIE 先进的知识增强预训练语言模型打造, 通过医疗知识增强技术进一步学习海量的医疗数据, 精准地掌握了专业的医学知识。ERNIE-Health 利用医疗实体掩码策略对专业术语等实体级知识学习, 学会了海量的医疗实体知识。同时，通过医疗问答匹配任务学习病患病状描述与医生专业治疗方案的对应关系，获得了医疗实体知识之间的内在联系。ERNIE-Health 共学习了 60 多万的医疗专业术语和 4000 多万的医疗专业问答数据，大幅提升了对医疗专业知识的理解和建模能力。此外，ERNIE-Health 还探索了多级语义判别预训练任务，提升了模型对医疗知识的学习效率。该模型的整体结构与 ELECTRA 相似，包括生成器和判别器两部分。
+
+![Overview_of_EHealth](https://user-images.githubusercontent.com/25607475/163949632-8b34e23c-d0cd-49df-8d88-8549a253d221.png)
+
+更多技术细节可参考论文
+- [Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/pdf/2110.07244.pdf)
+
+详细请参考: https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/ernie-health
diff --git a/model_zoo/ernie-layout/README.md b/legacy/model_zoo/ernie-layout/README.md
similarity index 100%
rename from model_zoo/ernie-layout/README.md
rename to legacy/model_zoo/ernie-layout/README.md
diff --git a/model_zoo/ernie-layout/README_ch.md b/legacy/model_zoo/ernie-layout/README_ch.md
similarity index 100%
rename from model_zoo/ernie-layout/README_ch.md
rename to legacy/model_zoo/ernie-layout/README_ch.md
diff --git a/model_zoo/ernie-layout/data_collator.py b/legacy/model_zoo/ernie-layout/data_collator.py
similarity index 97%
rename from model_zoo/ernie-layout/data_collator.py
rename to legacy/model_zoo/ernie-layout/data_collator.py
index bee1a06cf816..3483d8e51222 100644
--- a/model_zoo/ernie-layout/data_collator.py
+++ b/legacy/model_zoo/ernie-layout/data_collator.py
@@ -13,10 +13,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from typing import Optional, Union
 from dataclasses import dataclass
+from typing import Optional, Union
 
-from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase, PaddingStrategy
+from paddlenlp.transformers.tokenizer_utils_base import (
+    PaddingStrategy,
+    PretrainedTokenizerBase,
+)
 
 
 @dataclass
diff --git a/model_zoo/ernie-layout/deploy/python/README.md b/legacy/model_zoo/ernie-layout/deploy/python/README.md
similarity index 100%
rename from model_zoo/ernie-layout/deploy/python/README.md
rename to legacy/model_zoo/ernie-layout/deploy/python/README.md
diff --git a/model_zoo/ernie-layout/deploy/python/README_ch.md b/legacy/model_zoo/ernie-layout/deploy/python/README_ch.md
similarity index 100%
rename from model_zoo/ernie-layout/deploy/python/README_ch.md
rename to legacy/model_zoo/ernie-layout/deploy/python/README_ch.md
diff --git a/model_zoo/ernie-layout/deploy/python/infer.py b/legacy/model_zoo/ernie-layout/deploy/python/infer.py
similarity index 100%
rename from model_zoo/ernie-layout/deploy/python/infer.py
rename to legacy/model_zoo/ernie-layout/deploy/python/infer.py
diff --git a/model_zoo/ernie-layout/deploy/python/predictor.py b/legacy/model_zoo/ernie-layout/deploy/python/predictor.py
similarity index 100%
rename from model_zoo/ernie-layout/deploy/python/predictor.py
rename to legacy/model_zoo/ernie-layout/deploy/python/predictor.py
diff --git a/model_zoo/ernie-layout/deploy/python/requirements.txt b/legacy/model_zoo/ernie-layout/deploy/python/requirements.txt
similarity index 100%
rename from model_zoo/ernie-layout/deploy/python/requirements.txt
rename to legacy/model_zoo/ernie-layout/deploy/python/requirements.txt
diff --git a/model_zoo/ernie-layout/export_model.py b/legacy/model_zoo/ernie-layout/export_model.py
similarity index 99%
rename from model_zoo/ernie-layout/export_model.py
rename to legacy/model_zoo/ernie-layout/export_model.py
index ea6e6e2a2cbd..6d74341b838f 100644
--- a/model_zoo/ernie-layout/export_model.py
+++ b/legacy/model_zoo/ernie-layout/export_model.py
@@ -16,9 +16,10 @@
 import os
 
 import paddle
+
 from paddlenlp.transformers import (
-    AutoModelForSequenceClassification,
     AutoModelForQuestionAnswering,
+    AutoModelForSequenceClassification,
     AutoModelForTokenClassification,
 )
 
diff --git a/model_zoo/ernie-layout/finetune_args.py b/legacy/model_zoo/ernie-layout/finetune_args.py
similarity index 100%
rename from model_zoo/ernie-layout/finetune_args.py
rename to legacy/model_zoo/ernie-layout/finetune_args.py
index 11d0e8fa940f..0e84bf9497cb 100644
--- a/model_zoo/ernie-layout/finetune_args.py
+++ b/legacy/model_zoo/ernie-layout/finetune_args.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from typing import Optional
 from dataclasses import dataclass, field
+from typing import Optional
 
 
 @dataclass
diff --git a/model_zoo/ernie-layout/layout_trainer.py b/legacy/model_zoo/ernie-layout/layout_trainer.py
similarity index 100%
rename from model_zoo/ernie-layout/layout_trainer.py
rename to legacy/model_zoo/ernie-layout/layout_trainer.py
diff --git a/model_zoo/ernie-layout/requirements.txt b/legacy/model_zoo/ernie-layout/requirements.txt
similarity index 100%
rename from model_zoo/ernie-layout/requirements.txt
rename to legacy/model_zoo/ernie-layout/requirements.txt
diff --git a/model_zoo/ernie-layout/run_cls.py b/legacy/model_zoo/ernie-layout/run_cls.py
similarity index 100%
rename from model_zoo/ernie-layout/run_cls.py
rename to legacy/model_zoo/ernie-layout/run_cls.py
diff --git a/model_zoo/ernie-layout/run_mrc.py b/legacy/model_zoo/ernie-layout/run_mrc.py
similarity index 100%
rename from model_zoo/ernie-layout/run_mrc.py
rename to legacy/model_zoo/ernie-layout/run_mrc.py
diff --git a/model_zoo/ernie-layout/run_ner.py b/legacy/model_zoo/ernie-layout/run_ner.py
similarity index 100%
rename from model_zoo/ernie-layout/run_ner.py
rename to legacy/model_zoo/ernie-layout/run_ner.py
diff --git a/model_zoo/ernie-layout/utils.py b/legacy/model_zoo/ernie-layout/utils.py
similarity index 100%
rename from model_zoo/ernie-layout/utils.py
rename to legacy/model_zoo/ernie-layout/utils.py
diff --git a/legacy/model_zoo/ernie-m/README.md b/legacy/model_zoo/ernie-m/README.md
new file mode 100644
index 000000000000..5df966108f72
--- /dev/null
+++ b/legacy/model_zoo/ernie-m/README.md
@@ -0,0 +1,14 @@
+# ERNIE-M
+
+## 模型介绍
+
+[ERNIE-M](https://arxiv.org/abs/2012.15674) 是百度提出的一种多语言语言模型。原文提出了一种新的训练方法，让模型能够将多种语言的表示与单语语料库对齐，以克服平行语料库大小对模型性能的限制。原文的主要想法是将回译机制整合到预训练的流程中，在单语语料库上生成伪平行句对，以便学习不同语言之间的语义对齐，从而增强跨语言模型的语义建模。实验结果表明，ERNIE-M 优于现有的跨语言模型，并在各种跨语言下游任务中提供了最新的 SOTA 结果。
+原文提出两种方法建模各种语言间的对齐关系:
+
+- **Cross-Attention Masked Language Modeling(CAMLM)**: 该算法在少量双语语料上捕捉语言间的对齐信息。其需要在不利用源句子上下文的情况下，通过目标句子还原被掩盖的词语，使模型初步建模了语言间的对齐关系。
+- **Back-Translation masked language modeling(BTMLM)**: 该方法基于回译机制从单语语料中学习语言间的对齐关系。通过CAMLM 生成伪平行语料，然后让模型学习生成的伪平行句子，使模型可以利用单语语料更好地建模语义对齐关系。
+
+
+![framework](https://user-images.githubusercontent.com/40912707/201308423-bf4f0100-3ada-4bae-89d5-b07ffec1e2c0.png)
+
+详细请参考: https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/ernie-m
diff --git a/model_zoo/ernie-vil2.0/README.md b/legacy/model_zoo/ernie-vil2.0/README.md
similarity index 100%
rename from model_zoo/ernie-vil2.0/README.md
rename to legacy/model_zoo/ernie-vil2.0/README.md
diff --git a/model_zoo/ernie-vil2.0/data_util.py b/legacy/model_zoo/ernie-vil2.0/data_util.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/data_util.py
rename to legacy/model_zoo/ernie-vil2.0/data_util.py
diff --git a/model_zoo/ernie-vil2.0/deploy/python/infer.py b/legacy/model_zoo/ernie-vil2.0/deploy/python/infer.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/deploy/python/infer.py
rename to legacy/model_zoo/ernie-vil2.0/deploy/python/infer.py
diff --git a/model_zoo/ernie-vil2.0/export_model.py b/legacy/model_zoo/ernie-vil2.0/export_model.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/export_model.py
rename to legacy/model_zoo/ernie-vil2.0/export_model.py
diff --git a/model_zoo/ernie-vil2.0/extract_features.py b/legacy/model_zoo/ernie-vil2.0/extract_features.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/extract_features.py
rename to legacy/model_zoo/ernie-vil2.0/extract_features.py
diff --git a/model_zoo/ernie-vil2.0/predict.py b/legacy/model_zoo/ernie-vil2.0/predict.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/predict.py
rename to legacy/model_zoo/ernie-vil2.0/predict.py
diff --git a/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py b/legacy/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py
rename to legacy/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py
diff --git a/model_zoo/ernie-vil2.0/requirements.txt b/legacy/model_zoo/ernie-vil2.0/requirements.txt
similarity index 100%
rename from model_zoo/ernie-vil2.0/requirements.txt
rename to legacy/model_zoo/ernie-vil2.0/requirements.txt
diff --git a/model_zoo/ernie-vil2.0/run_finetune.py b/legacy/model_zoo/ernie-vil2.0/run_finetune.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/run_finetune.py
rename to legacy/model_zoo/ernie-vil2.0/run_finetune.py
diff --git a/model_zoo/ernie-vil2.0/trainer_util.py b/legacy/model_zoo/ernie-vil2.0/trainer_util.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/trainer_util.py
rename to legacy/model_zoo/ernie-vil2.0/trainer_util.py
diff --git a/model_zoo/ernie-vil2.0/utils/evaluation.py b/legacy/model_zoo/ernie-vil2.0/utils/evaluation.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/utils/evaluation.py
rename to legacy/model_zoo/ernie-vil2.0/utils/evaluation.py
diff --git a/model_zoo/ernie-vil2.0/utils/evaluation_tr.py b/legacy/model_zoo/ernie-vil2.0/utils/evaluation_tr.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/utils/evaluation_tr.py
rename to legacy/model_zoo/ernie-vil2.0/utils/evaluation_tr.py
diff --git a/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py b/legacy/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/utils/make_topk_predictions.py
rename to legacy/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py
diff --git a/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py b/legacy/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py
rename to legacy/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py
diff --git a/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py b/legacy/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py
similarity index 100%
rename from model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py
rename to legacy/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py
diff --git a/examples/language_model/glm b/legacy/model_zoo/glm
similarity index 100%
rename from examples/language_model/glm
rename to legacy/model_zoo/glm
diff --git a/model_zoo/gpt-3/.pre-commit-config.yaml b/legacy/model_zoo/gpt-3/.pre-commit-config.yaml
similarity index 100%
rename from model_zoo/gpt-3/.pre-commit-config.yaml
rename to legacy/model_zoo/gpt-3/.pre-commit-config.yaml
diff --git a/model_zoo/gpt-3/README.md b/legacy/model_zoo/gpt-3/README.md
similarity index 100%
rename from model_zoo/gpt-3/README.md
rename to legacy/model_zoo/gpt-3/README.md
diff --git a/model_zoo/gpt-3/benchmarks/README.md b/legacy/model_zoo/gpt-3/benchmarks/README.md
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/README.md
rename to legacy/model_zoo/gpt-3/benchmarks/README.md
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N1C8/gpt_auto_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N2C16/gpt_auto_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N2C16/gpt_auto_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N2C16/gpt_auto_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N2C16/gpt_auto_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N4C32/gpt_auto_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N4C32/gpt_auto_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N4C32/gpt_auto_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/N4C32/gpt_auto_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O1_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O2_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP1-PP4-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs16_fp16O3_DP2-MP2-PP2-SD2-stage2.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O1_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O2_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP1-PP8-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N1C8/gpt_auto_pir_bs8_fp16O3_DP1-MP2-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N2C16/gpt_auto_pir_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N2C16/gpt_auto_pir_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N2C16/gpt_auto_pir_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N2C16/gpt_auto_pir_bs16_fp32_DP1-MP8-PP2-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N4C32/gpt_auto_pir_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N4C32/gpt_auto_pir_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N4C32/gpt_auto_pir_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/N4C32/gpt_auto_pir_bs4_fp32_DP1-MP8-PP4-SD1-stage1.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/prepare.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/prepare.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/prepare.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/prepare.sh
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/run_benchmark.sh b/legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/new_exec_pp_pir/benchmark_common/run_benchmark.sh
diff --git a/model_zoo/gpt-3/codestyle/clang_format.hook b/legacy/model_zoo/gpt-3/codestyle/clang_format.hook
similarity index 100%
rename from model_zoo/gpt-3/codestyle/clang_format.hook
rename to legacy/model_zoo/gpt-3/codestyle/clang_format.hook
diff --git a/model_zoo/gpt-3/codestyle/copyright.hook b/legacy/model_zoo/gpt-3/codestyle/copyright.hook
similarity index 100%
rename from model_zoo/gpt-3/codestyle/copyright.hook
rename to legacy/model_zoo/gpt-3/codestyle/copyright.hook
diff --git a/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook b/legacy/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
similarity index 100%
rename from model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
rename to legacy/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
diff --git a/model_zoo/gpt-3/codestyle/docstring_checker.py b/legacy/model_zoo/gpt-3/codestyle/docstring_checker.py
similarity index 100%
rename from model_zoo/gpt-3/codestyle/docstring_checker.py
rename to legacy/model_zoo/gpt-3/codestyle/docstring_checker.py
diff --git a/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook b/legacy/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
similarity index 100%
rename from model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
rename to legacy/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
diff --git a/model_zoo/gpt-3/codestyle/test_docstring_checker.py b/legacy/model_zoo/gpt-3/codestyle/test_docstring_checker.py
similarity index 100%
rename from model_zoo/gpt-3/codestyle/test_docstring_checker.py
rename to legacy/model_zoo/gpt-3/codestyle/test_docstring_checker.py
diff --git a/model_zoo/gpt-3/docs/cluster_deployment.md b/legacy/model_zoo/gpt-3/docs/cluster_deployment.md
similarity index 100%
rename from model_zoo/gpt-3/docs/cluster_deployment.md
rename to legacy/model_zoo/gpt-3/docs/cluster_deployment.md
diff --git a/model_zoo/gpt-3/docs/compression.md b/legacy/model_zoo/gpt-3/docs/compression.md
similarity index 100%
rename from model_zoo/gpt-3/docs/compression.md
rename to legacy/model_zoo/gpt-3/docs/compression.md
diff --git a/model_zoo/gpt-3/docs/deployment_faq.md b/legacy/model_zoo/gpt-3/docs/deployment_faq.md
similarity index 100%
rename from model_zoo/gpt-3/docs/deployment_faq.md
rename to legacy/model_zoo/gpt-3/docs/deployment_faq.md
diff --git a/model_zoo/gpt-3/docs/docker_install.md b/legacy/model_zoo/gpt-3/docs/docker_install.md
similarity index 100%
rename from model_zoo/gpt-3/docs/docker_install.md
rename to legacy/model_zoo/gpt-3/docs/docker_install.md
diff --git a/model_zoo/gpt-3/docs/images/fleetx_arc.png b/legacy/model_zoo/gpt-3/docs/images/fleetx_arc.png
similarity index 100%
rename from model_zoo/gpt-3/docs/images/fleetx_arc.png
rename to legacy/model_zoo/gpt-3/docs/images/fleetx_arc.png
diff --git a/model_zoo/gpt-3/docs/images/throughput_compare.png b/legacy/model_zoo/gpt-3/docs/images/throughput_compare.png
similarity index 100%
rename from model_zoo/gpt-3/docs/images/throughput_compare.png
rename to legacy/model_zoo/gpt-3/docs/images/throughput_compare.png
diff --git a/model_zoo/gpt-3/docs/images/throughput_compare_graph.png b/legacy/model_zoo/gpt-3/docs/images/throughput_compare_graph.png
similarity index 100%
rename from model_zoo/gpt-3/docs/images/throughput_compare_graph.png
rename to legacy/model_zoo/gpt-3/docs/images/throughput_compare_graph.png
diff --git a/model_zoo/gpt-3/docs/quick_start.md b/legacy/model_zoo/gpt-3/docs/quick_start.md
similarity index 100%
rename from model_zoo/gpt-3/docs/quick_start.md
rename to legacy/model_zoo/gpt-3/docs/quick_start.md
diff --git a/model_zoo/gpt-3/docs/standard.md b/legacy/model_zoo/gpt-3/docs/standard.md
similarity index 100%
rename from model_zoo/gpt-3/docs/standard.md
rename to legacy/model_zoo/gpt-3/docs/standard.md
diff --git a/model_zoo/gpt-3/external_ops/.gitignore b/legacy/model_zoo/gpt-3/external_ops/.gitignore
similarity index 100%
rename from model_zoo/gpt-3/external_ops/.gitignore
rename to legacy/model_zoo/gpt-3/external_ops/.gitignore
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln.h b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln.h
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln.h
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h b/legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h
rename to legacy/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h
diff --git a/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu b/legacy/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu
rename to legacy/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu
diff --git a/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h b/legacy/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h
similarity index 100%
rename from model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h
rename to legacy/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h
diff --git a/model_zoo/gpt-3/external_ops/setup.py b/legacy/model_zoo/gpt-3/external_ops/setup.py
similarity index 100%
rename from model_zoo/gpt-3/external_ops/setup.py
rename to legacy/model_zoo/gpt-3/external_ops/setup.py
diff --git a/model_zoo/gpt-3/ppfleetx/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_debug_mp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_debug_mp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_debug_mp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_debug_mp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_sep8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_sep8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_sep8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_sep8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml b/legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
rename to legacy/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
diff --git a/model_zoo/gpt-3/ppfleetx/core/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/core/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/engine/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py b/legacy/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py b/legacy/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py b/legacy/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py b/legacy/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/module/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/core/module/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/module/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/module/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py b/legacy/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/core/module/basic_module.py
rename to legacy/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py b/legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py b/legacy/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py b/legacy/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py b/legacy/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py b/legacy/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/sampler/collate.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py b/legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py b/legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py b/legacy/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py b/legacy/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/transforms/utils.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/utils/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py b/legacy/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py
rename to legacy/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/env.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/io.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py b/legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py
rename to legacy/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/models/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py b/legacy/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/models/language_model/utils.py
rename to legacy/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py
diff --git a/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py b/legacy/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/ops/fused_layers.py
rename to legacy/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py
diff --git a/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py b/legacy/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py
rename to legacy/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py
diff --git a/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py b/legacy/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py
rename to legacy/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py
diff --git a/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu b/legacy/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu
rename to legacy/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu
diff --git a/model_zoo/gpt-3/ppfleetx/optims/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/optims/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/optims/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/optims/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py b/legacy/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/optims/grad_clip.py
rename to legacy/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py
diff --git a/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py b/legacy/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py
rename to legacy/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py
diff --git a/model_zoo/gpt-3/ppfleetx/optims/optimizer.py b/legacy/model_zoo/gpt-3/ppfleetx/optims/optimizer.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/optims/optimizer.py
rename to legacy/model_zoo/gpt-3/ppfleetx/optims/optimizer.py
diff --git a/model_zoo/gpt-3/ppfleetx/tools/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/tools/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/tools/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/tools/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py b/legacy/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py
rename to legacy/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/__init__.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/__init__.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/__init__.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/__init__.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/auto_config.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/auto_config.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/auto_config.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/auto_config.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/check.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/check.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/check.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/check.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/compression_helper.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/config.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/config.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/config.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/config.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/device.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/device.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/device.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/device.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/download.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/download.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/download.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/download.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/export.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/export.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/export.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/export.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/file.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/file.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/file.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/file.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/log.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/log.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/log.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/log.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py
diff --git a/model_zoo/gpt-3/ppfleetx/utils/version.py b/legacy/model_zoo/gpt-3/ppfleetx/utils/version.py
similarity index 100%
rename from model_zoo/gpt-3/ppfleetx/utils/version.py
rename to legacy/model_zoo/gpt-3/ppfleetx/utils/version.py
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh
diff --git a/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh b/legacy/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh
diff --git a/model_zoo/gpt-3/projects/gpt/benchmark.py b/legacy/model_zoo/gpt-3/projects/gpt/benchmark.py
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/benchmark.py
rename to legacy/model_zoo/gpt-3/projects/gpt/benchmark.py
diff --git a/model_zoo/gpt-3/projects/gpt/docs/README.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/README.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/README.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/README.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/inference.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/inference.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/inference.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/inference.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/single_card.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/single_card.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/single_card.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/single_card.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/single_finetune.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md
diff --git a/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md b/legacy/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md
rename to legacy/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md
diff --git a/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/inference.py b/legacy/model_zoo/gpt-3/projects/gpt/inference.py
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/inference.py
rename to legacy/model_zoo/gpt-3/projects/gpt/inference.py
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh b/legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_sep8.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_sep8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_sep8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_sep8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh b/legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh
diff --git a/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh b/legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh b/legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh b/legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh
diff --git a/model_zoo/gpt-3/projects/gpt/run_benchmark.sh b/legacy/model_zoo/gpt-3/projects/gpt/run_benchmark.sh
similarity index 100%
rename from model_zoo/gpt-3/projects/gpt/run_benchmark.sh
rename to legacy/model_zoo/gpt-3/projects/gpt/run_benchmark.sh
diff --git a/model_zoo/gpt-3/requirements.txt b/legacy/model_zoo/gpt-3/requirements.txt
similarity index 100%
rename from model_zoo/gpt-3/requirements.txt
rename to legacy/model_zoo/gpt-3/requirements.txt
diff --git a/model_zoo/gpt-3/run_mp8.sh b/legacy/model_zoo/gpt-3/run_mp8.sh
similarity index 100%
rename from model_zoo/gpt-3/run_mp8.sh
rename to legacy/model_zoo/gpt-3/run_mp8.sh
diff --git a/model_zoo/gpt-3/tasks/gpt/generation.py b/legacy/model_zoo/gpt-3/tasks/gpt/generation.py
similarity index 100%
rename from model_zoo/gpt-3/tasks/gpt/generation.py
rename to legacy/model_zoo/gpt-3/tasks/gpt/generation.py
diff --git a/model_zoo/gpt-3/tasks/gpt/inference.py b/legacy/model_zoo/gpt-3/tasks/gpt/inference.py
similarity index 100%
rename from model_zoo/gpt-3/tasks/gpt/inference.py
rename to legacy/model_zoo/gpt-3/tasks/gpt/inference.py
diff --git a/model_zoo/gpt-3/tasks/gpt/run_generation.sh b/legacy/model_zoo/gpt-3/tasks/gpt/run_generation.sh
similarity index 100%
rename from model_zoo/gpt-3/tasks/gpt/run_generation.sh
rename to legacy/model_zoo/gpt-3/tasks/gpt/run_generation.sh
diff --git a/model_zoo/gpt-3/tasks/gpt/train_pir.py b/legacy/model_zoo/gpt-3/tasks/gpt/train_pir.py
similarity index 100%
rename from model_zoo/gpt-3/tasks/gpt/train_pir.py
rename to legacy/model_zoo/gpt-3/tasks/gpt/train_pir.py
diff --git a/model_zoo/gpt-3/tools/auto.py b/legacy/model_zoo/gpt-3/tools/auto.py
similarity index 100%
rename from model_zoo/gpt-3/tools/auto.py
rename to legacy/model_zoo/gpt-3/tools/auto.py
diff --git a/model_zoo/gpt-3/tools/auto_export.py b/legacy/model_zoo/gpt-3/tools/auto_export.py
similarity index 100%
rename from model_zoo/gpt-3/tools/auto_export.py
rename to legacy/model_zoo/gpt-3/tools/auto_export.py
diff --git a/model_zoo/gpt-3/tools/eval.py b/legacy/model_zoo/gpt-3/tools/eval.py
similarity index 100%
rename from model_zoo/gpt-3/tools/eval.py
rename to legacy/model_zoo/gpt-3/tools/eval.py
diff --git a/model_zoo/gpt-3/tools/export.py b/legacy/model_zoo/gpt-3/tools/export.py
similarity index 100%
rename from model_zoo/gpt-3/tools/export.py
rename to legacy/model_zoo/gpt-3/tools/export.py
diff --git a/model_zoo/gpt-3/tools/inference.py b/legacy/model_zoo/gpt-3/tools/inference.py
similarity index 100%
rename from model_zoo/gpt-3/tools/inference.py
rename to legacy/model_zoo/gpt-3/tools/inference.py
diff --git a/model_zoo/gpt-3/tools/train.py b/legacy/model_zoo/gpt-3/tools/train.py
similarity index 100%
rename from model_zoo/gpt-3/tools/train.py
rename to legacy/model_zoo/gpt-3/tools/train.py
diff --git a/examples/language_model/llama b/legacy/model_zoo/llama
similarity index 100%
rename from examples/language_model/llama
rename to legacy/model_zoo/llama
diff --git a/examples/language_model/luke/README.md b/legacy/model_zoo/luke/README.md
similarity index 100%
rename from examples/language_model/luke/README.md
rename to legacy/model_zoo/luke/README.md
diff --git a/examples/language_model/luke/args.py b/legacy/model_zoo/luke/args.py
similarity index 100%
rename from examples/language_model/luke/args.py
rename to legacy/model_zoo/luke/args.py
diff --git a/examples/language_model/luke/open_entity_processor.py b/legacy/model_zoo/luke/open_entity_processor.py
similarity index 100%
rename from examples/language_model/luke/open_entity_processor.py
rename to legacy/model_zoo/luke/open_entity_processor.py
diff --git a/examples/language_model/luke/run_open_entity.py b/legacy/model_zoo/luke/run_open_entity.py
similarity index 100%
rename from examples/language_model/luke/run_open_entity.py
rename to legacy/model_zoo/luke/run_open_entity.py
diff --git a/examples/language_model/luke/run_squad.py b/legacy/model_zoo/luke/run_squad.py
similarity index 100%
rename from examples/language_model/luke/run_squad.py
rename to legacy/model_zoo/luke/run_squad.py
diff --git a/examples/language_model/megatronbert/README.md b/legacy/model_zoo/megatronbert/README.md
similarity index 100%
rename from examples/language_model/megatronbert/README.md
rename to legacy/model_zoo/megatronbert/README.md
diff --git a/examples/language_model/megatronbert/args.py b/legacy/model_zoo/megatronbert/args.py
similarity index 100%
rename from examples/language_model/megatronbert/args.py
rename to legacy/model_zoo/megatronbert/args.py
diff --git a/examples/language_model/megatronbert/run_glue.py b/legacy/model_zoo/megatronbert/run_glue.py
similarity index 100%
rename from examples/language_model/megatronbert/run_glue.py
rename to legacy/model_zoo/megatronbert/run_glue.py
diff --git a/examples/language_model/megatronbert/run_squad.py b/legacy/model_zoo/megatronbert/run_squad.py
similarity index 100%
rename from examples/language_model/megatronbert/run_squad.py
rename to legacy/model_zoo/megatronbert/run_squad.py
diff --git a/legacy/model_zoo/moe/data_tools b/legacy/model_zoo/moe/data_tools
new file mode 120000
index 000000000000..445072d971a9
--- /dev/null
+++ b/legacy/model_zoo/moe/data_tools
@@ -0,0 +1 @@
+../ernie-1.0/data_tools/
\ No newline at end of file
diff --git a/examples/language_model/moe/dygraph/args.py b/legacy/model_zoo/moe/dygraph/args.py
similarity index 100%
rename from examples/language_model/moe/dygraph/args.py
rename to legacy/model_zoo/moe/dygraph/args.py
diff --git a/examples/language_model/moe/dygraph/checkpointing.py b/legacy/model_zoo/moe/dygraph/checkpointing.py
similarity index 99%
rename from examples/language_model/moe/dygraph/checkpointing.py
rename to legacy/model_zoo/moe/dygraph/checkpointing.py
index e9ebb7eae088..c5223a680852 100644
--- a/examples/language_model/moe/dygraph/checkpointing.py
+++ b/legacy/model_zoo/moe/dygraph/checkpointing.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import os
+
 import paddle
 
 
diff --git a/examples/language_model/moe/dygraph/dataset.py b/legacy/model_zoo/moe/dygraph/dataset.py
similarity index 100%
rename from examples/language_model/moe/dygraph/dataset.py
rename to legacy/model_zoo/moe/dygraph/dataset.py
diff --git a/examples/language_model/moe/dygraph/framework/__init__.py b/legacy/model_zoo/moe/dygraph/framework/__init__.py
similarity index 82%
rename from examples/language_model/moe/dygraph/framework/__init__.py
rename to legacy/model_zoo/moe/dygraph/framework/__init__.py
index e10badb568d2..8824c6897ae7 100644
--- a/examples/language_model/moe/dygraph/framework/__init__.py
+++ b/legacy/model_zoo/moe/dygraph/framework/__init__.py
@@ -12,8 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from .storage_process import assign_group_by_size
-from .storage_process import flatten_dense_tensors
-from .storage_process import obtain_storage
 from paddle.optimizer import AdamW
+
 from .group_sharded import group_sharded_parallel
+from .storage_process import assign_group_by_size, flatten_dense_tensors, obtain_storage
diff --git a/examples/language_model/moe/dygraph/framework/group_sharded.py b/legacy/model_zoo/moe/dygraph/framework/group_sharded.py
similarity index 100%
rename from examples/language_model/moe/dygraph/framework/group_sharded.py
rename to legacy/model_zoo/moe/dygraph/framework/group_sharded.py
diff --git a/examples/language_model/moe/dygraph/framework/storage_process.py b/legacy/model_zoo/moe/dygraph/framework/storage_process.py
similarity index 100%
rename from examples/language_model/moe/dygraph/framework/storage_process.py
rename to legacy/model_zoo/moe/dygraph/framework/storage_process.py
diff --git a/examples/language_model/moe/dygraph/modeling.py b/legacy/model_zoo/moe/dygraph/modeling.py
similarity index 100%
rename from examples/language_model/moe/dygraph/modeling.py
rename to legacy/model_zoo/moe/dygraph/modeling.py
diff --git a/examples/language_model/moe/dygraph/run.sh b/legacy/model_zoo/moe/dygraph/run.sh
similarity index 59%
rename from examples/language_model/moe/dygraph/run.sh
rename to legacy/model_zoo/moe/dygraph/run.sh
index 2f513281b60d..eaf7cb04e9f7 100644
--- a/examples/language_model/moe/dygraph/run.sh
+++ b/legacy/model_zoo/moe/dygraph/run.sh
@@ -1,3 +1,17 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export PYTHONPATH=$PYTHONPATH:../../../../
 
 log_dir=dp8
diff --git a/examples/language_model/moe/dygraph/run_moe_pretrain.py b/legacy/model_zoo/moe/dygraph/run_moe_pretrain.py
similarity index 100%
rename from examples/language_model/moe/dygraph/run_moe_pretrain.py
rename to legacy/model_zoo/moe/dygraph/run_moe_pretrain.py
diff --git a/examples/language_model/moe/dygraph/sync_files.sh b/legacy/model_zoo/moe/dygraph/sync_files.sh
similarity index 73%
rename from examples/language_model/moe/dygraph/sync_files.sh
rename to legacy/model_zoo/moe/dygraph/sync_files.sh
index 0e297e908f0d..0dc500e80ade 100644
--- a/examples/language_model/moe/dygraph/sync_files.sh
+++ b/legacy/model_zoo/moe/dygraph/sync_files.sh
@@ -1,4 +1,18 @@
 #!/bin/bash                                                                                                                                               
+
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
                                                                                                                                                             
 # get sshd port                                                                                                                                           
 sshport=$(lsof -i | grep sshd | awk '{print $9}' | sed s/\*://)                                                                                           
diff --git a/examples/language_model/moe/dygraph/utils.py b/legacy/model_zoo/moe/dygraph/utils.py
similarity index 100%
rename from examples/language_model/moe/dygraph/utils.py
rename to legacy/model_zoo/moe/dygraph/utils.py
diff --git a/examples/language_model/mpnet/README.md b/legacy/model_zoo/mpnet/README.md
similarity index 100%
rename from examples/language_model/mpnet/README.md
rename to legacy/model_zoo/mpnet/README.md
diff --git a/examples/language_model/mpnet/convert.py b/legacy/model_zoo/mpnet/convert.py
similarity index 100%
rename from examples/language_model/mpnet/convert.py
rename to legacy/model_zoo/mpnet/convert.py
index c9f3fb8fd143..444f64740d04 100644
--- a/examples/language_model/mpnet/convert.py
+++ b/legacy/model_zoo/mpnet/convert.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from collections import OrderedDict
 import argparse
+from collections import OrderedDict
 
 huggingface_to_paddle = {
     ".attn.": ".",
@@ -34,8 +34,8 @@
 
 
 def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
-    import torch
     import paddle
+    import torch
 
     pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
     paddle_state_dict = OrderedDict()
diff --git a/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py b/legacy/model_zoo/mpnet/glue/predict.sh
similarity index 57%
rename from model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py
rename to legacy/model_zoo/mpnet/glue/predict.sh
index b38202ece3d1..3472ec6ad91e 100644
--- a/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py
+++ b/legacy/model_zoo/mpnet/glue/predict.sh
@@ -1,25 +1,17 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+# 
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
-#
+# 
 #     http://www.apache.org/licenses/LICENSE-2.0
-#
+# 
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from paddlenlp import SimpleServer
-from paddlenlp.server import ERNIEMHandler, MultiClassificationPostHandler
+# task name ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]
 
-app = SimpleServer()
-app.register(
-    "models/ernie_m_cls",
-    model_path="../../finetuned_models/export",
-    tokenizer_name="ernie-m-base",
-    model_handler=ERNIEMHandler,
-    post_handler=MultiClassificationPostHandler,
-)
+python run_predict.py --task_name qqp  --ckpt_path qqp/best-qqp_ft_model_106000.pdparams
\ No newline at end of file
diff --git a/examples/language_model/mpnet/glue/run_glue.py b/legacy/model_zoo/mpnet/glue/run_glue.py
similarity index 100%
rename from examples/language_model/mpnet/glue/run_glue.py
rename to legacy/model_zoo/mpnet/glue/run_glue.py
diff --git a/examples/language_model/mpnet/glue/run_predict.py b/legacy/model_zoo/mpnet/glue/run_predict.py
similarity index 99%
rename from examples/language_model/mpnet/glue/run_predict.py
rename to legacy/model_zoo/mpnet/glue/run_predict.py
index 5f8c3233c90c..452bcd1bef06 100644
--- a/examples/language_model/mpnet/glue/run_predict.py
+++ b/legacy/model_zoo/mpnet/glue/run_predict.py
@@ -13,16 +13,18 @@
 # limitations under the License.
 
 import argparse
-from functools import partial
 import os
+from functools import partial
+
 import paddle
-from paddle.io import DataLoader
 import pandas as pd
+from paddle.io import DataLoader
+from run_glue import convert_example
 from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
 from paddlenlp.datasets import load_dataset
-from paddlenlp.data import Tuple, Pad
 from paddlenlp.transformers import MPNetForSequenceClassification, MPNetTokenizer
-from run_glue import convert_example
 
 task2filename = {
     "cola": "CoLA.tsv",
diff --git a/examples/language_model/mpnet/glue/train.sh b/legacy/model_zoo/mpnet/glue/train.sh
similarity index 100%
rename from examples/language_model/mpnet/glue/train.sh
rename to legacy/model_zoo/mpnet/glue/train.sh
diff --git a/examples/language_model/mpnet/squad/run_squad.py b/legacy/model_zoo/mpnet/squad/run_squad.py
similarity index 100%
rename from examples/language_model/mpnet/squad/run_squad.py
rename to legacy/model_zoo/mpnet/squad/run_squad.py
diff --git a/examples/language_model/opt b/legacy/model_zoo/opt
similarity index 100%
rename from examples/language_model/opt
rename to legacy/model_zoo/opt
diff --git a/legacy/model_zoo/plato-xl/README.md b/legacy/model_zoo/plato-xl/README.md
new file mode 100644
index 000000000000..452f610ec5b5
--- /dev/null
+++ b/legacy/model_zoo/plato-xl/README.md
@@ -0,0 +1,13 @@
+# PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
+
+## 模型简介
+
+构建高质量的开放领域（Open-Domain）的对话机器人，使得它能用自然语言与人自由地交流，这一直是自然语言处理领域终极目标之一。
+
+PLATO-XL 是业界首个开源的百亿超大规模开放域对话预训练模型，其使用了参数高效(encoder-decoder共享参数)的 UnifiedTransformer（prefix LM）模型架构，将模型参数量提升到了11B量级，经过了十亿级样本对话数据的预训练，并引入role embedding区分多方对话中的对话角色提升预训练效果，最终模型闲聊测试效果超过了众多代表性的对话模型。可以直接使用 PLATO-XL 构建高质量的开放领域对话机器人。
+
+PaddleNLP 内置了 PLATO-XL 英文预训练模型以供使用。由于 PLATO-XL 模型规模较大，这使得其在预测时生成对话回复的时间较长，并且 11B 的参数量也可能超出部分型号 GPU 显存容量，这是大模型推理与落地存在的普遍和关键问题。PaddleNLP FastGeneration 为 PLATO-XL 提供了 GPU 上的高性能生成加速能力，并且支持模型并行（张量并行）推理允许通过多张小显存容量的 GPU 使用百亿大模型，相比单卡代码中也只增加了`enable_ft_para()`一行，此外模型并行能进一步提升预测速度。
+
+本项目提供了 PLATO-XL 英文模型使用 PaddleNLP FastGeneration 进行高性能预测的使用示例。PLATO-XL 的训练及更多内容请参考 [PaddlePaddle/Knover](https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL)。
+
+详细请参考: https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/model_zoo/plato-xl
diff --git a/examples/language_model/rembert/README.md b/legacy/model_zoo/rembert/README.md
similarity index 100%
rename from examples/language_model/rembert/README.md
rename to legacy/model_zoo/rembert/README.md
diff --git a/examples/language_model/rembert/data_processor.py b/legacy/model_zoo/rembert/data_processor.py
similarity index 99%
rename from examples/language_model/rembert/data_processor.py
rename to legacy/model_zoo/rembert/data_processor.py
index f5f6568d7a68..57c60508d77a 100644
--- a/examples/language_model/rembert/data_processor.py
+++ b/legacy/model_zoo/rembert/data_processor.py
@@ -12,11 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import os
-from paddlenlp.transformers import RemBertTokenizer
 import csv
+import os
+
 from paddle.io import Dataset
 
+from paddlenlp.transformers import RemBertTokenizer
+
 tokenization = RemBertTokenizer.from_pretrained("rembert")
 
 
diff --git a/examples/language_model/rembert/main.py b/legacy/model_zoo/rembert/main.py
similarity index 100%
rename from examples/language_model/rembert/main.py
rename to legacy/model_zoo/rembert/main.py
diff --git a/examples/language_model/rembert/trainer.py b/legacy/model_zoo/rembert/trainer.py
similarity index 100%
rename from examples/language_model/rembert/trainer.py
rename to legacy/model_zoo/rembert/trainer.py
diff --git a/examples/language_model/roberta/README.md b/legacy/model_zoo/roberta/README.md
similarity index 100%
rename from examples/language_model/roberta/README.md
rename to legacy/model_zoo/roberta/README.md
diff --git a/examples/language_model/roberta/create_data.py b/legacy/model_zoo/roberta/create_data.py
similarity index 100%
rename from examples/language_model/roberta/create_data.py
rename to legacy/model_zoo/roberta/create_data.py
diff --git a/examples/language_model/roberta/run_pretrain.py b/legacy/model_zoo/roberta/run_pretrain.py
similarity index 100%
rename from examples/language_model/roberta/run_pretrain.py
rename to legacy/model_zoo/roberta/run_pretrain.py
diff --git a/examples/language_model/roberta/utils.py b/legacy/model_zoo/roberta/utils.py
similarity index 100%
rename from examples/language_model/roberta/utils.py
rename to legacy/model_zoo/roberta/utils.py
diff --git a/examples/language_model/roformer/README.md b/legacy/model_zoo/roformer/README.md
similarity index 100%
rename from examples/language_model/roformer/README.md
rename to legacy/model_zoo/roformer/README.md
diff --git a/examples/language_model/roformer/convert.py b/legacy/model_zoo/roformer/convert.py
similarity index 100%
rename from examples/language_model/roformer/convert.py
rename to legacy/model_zoo/roformer/convert.py
index c33830e633dd..ea0a2e5bbf27 100644
--- a/examples/language_model/roformer/convert.py
+++ b/legacy/model_zoo/roformer/convert.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from collections import OrderedDict
 import argparse
+from collections import OrderedDict
 
 huggingface_to_paddle = {
     "embeddings.LayerNorm": "embeddings.layer_norm",
@@ -34,8 +34,8 @@
 
 def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
 
-    import torch
     import paddle
+    import torch
 
     pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
     paddle_state_dict = OrderedDict()
diff --git a/examples/language_model/roformer/run_cail2019_scm.py b/legacy/model_zoo/roformer/run_cail2019_scm.py
similarity index 100%
rename from examples/language_model/roformer/run_cail2019_scm.py
rename to legacy/model_zoo/roformer/run_cail2019_scm.py
diff --git a/examples/language_model/roformer/run_thucnews.py b/legacy/model_zoo/roformer/run_thucnews.py
similarity index 100%
rename from examples/language_model/roformer/run_thucnews.py
rename to legacy/model_zoo/roformer/run_thucnews.py
diff --git a/examples/language_model/roformer/test_mlm.py b/legacy/model_zoo/roformer/test_mlm.py
similarity index 99%
rename from examples/language_model/roformer/test_mlm.py
rename to legacy/model_zoo/roformer/test_mlm.py
index 31bb14c550df..0124675ea018 100644
--- a/examples/language_model/roformer/test_mlm.py
+++ b/legacy/model_zoo/roformer/test_mlm.py
@@ -12,8 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import paddle
 import argparse
+
+import paddle
+
 from paddlenlp.transformers import RoFormerForMaskedLM, RoFormerTokenizer
 
 
diff --git a/examples/language_model/rw/README.md b/legacy/model_zoo/rw/README.md
similarity index 100%
rename from examples/language_model/rw/README.md
rename to legacy/model_zoo/rw/README.md
diff --git a/examples/language_model/rw/predict_generation.py b/legacy/model_zoo/rw/predict_generation.py
similarity index 100%
rename from examples/language_model/rw/predict_generation.py
rename to legacy/model_zoo/rw/predict_generation.py
diff --git a/examples/language_model/t5/README.md b/legacy/model_zoo/t5/README.md
similarity index 100%
rename from examples/language_model/t5/README.md
rename to legacy/model_zoo/t5/README.md
diff --git a/examples/language_model/t5/convert.py b/legacy/model_zoo/t5/convert.py
similarity index 100%
rename from examples/language_model/t5/convert.py
rename to legacy/model_zoo/t5/convert.py
index 8c1fa30ea144..1193d62ebb0c 100644
--- a/examples/language_model/t5/convert.py
+++ b/legacy/model_zoo/t5/convert.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from collections import OrderedDict
 import argparse
+from collections import OrderedDict
 
 dont_transpose = [
     "shared.weight",
@@ -25,8 +25,8 @@
 
 
 def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
-    import torch
     import paddle
+    import torch
 
     pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
     paddle_state_dict = OrderedDict()
diff --git a/examples/language_model/t5/data.py b/legacy/model_zoo/t5/data.py
similarity index 100%
rename from examples/language_model/t5/data.py
rename to legacy/model_zoo/t5/data.py
diff --git a/legacy/model_zoo/t5/dataset_utils.py b/legacy/model_zoo/t5/dataset_utils.py
new file mode 120000
index 000000000000..df78cef170ea
--- /dev/null
+++ b/legacy/model_zoo/t5/dataset_utils.py
@@ -0,0 +1 @@
+../ernie-1.0/data_tools/dataset_utils.py
\ No newline at end of file
diff --git a/examples/language_model/t5/glue_demo.py b/legacy/model_zoo/t5/glue_demo.py
similarity index 99%
rename from examples/language_model/t5/glue_demo.py
rename to legacy/model_zoo/t5/glue_demo.py
index 61ada574d22c..8ffd6dfc2c48 100644
--- a/examples/language_model/t5/glue_demo.py
+++ b/legacy/model_zoo/t5/glue_demo.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import paddle
+
 from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
 
 
diff --git a/examples/language_model/t5/run_clue_trainer.py b/legacy/model_zoo/t5/run_clue_trainer.py
similarity index 100%
rename from examples/language_model/t5/run_clue_trainer.py
rename to legacy/model_zoo/t5/run_clue_trainer.py
diff --git a/examples/language_model/t5/run_glue.py b/legacy/model_zoo/t5/run_glue.py
similarity index 100%
rename from examples/language_model/t5/run_glue.py
rename to legacy/model_zoo/t5/run_glue.py
diff --git a/examples/language_model/t5/run_glue_trainer.py b/legacy/model_zoo/t5/run_glue_trainer.py
similarity index 100%
rename from examples/language_model/t5/run_glue_trainer.py
rename to legacy/model_zoo/t5/run_glue_trainer.py
diff --git a/examples/language_model/t5/t5_dataset.py b/legacy/model_zoo/t5/t5_dataset.py
similarity index 100%
rename from examples/language_model/t5/t5_dataset.py
rename to legacy/model_zoo/t5/t5_dataset.py
diff --git a/examples/language_model/t5/t5_run_pretrain_trainer.py b/legacy/model_zoo/t5/t5_run_pretrain_trainer.py
similarity index 100%
rename from examples/language_model/t5/t5_run_pretrain_trainer.py
rename to legacy/model_zoo/t5/t5_run_pretrain_trainer.py
diff --git a/examples/language_model/t5/tests/t5_mp.py b/legacy/model_zoo/t5/tests/t5_mp.py
similarity index 100%
rename from examples/language_model/t5/tests/t5_mp.py
rename to legacy/model_zoo/t5/tests/t5_mp.py
diff --git a/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py b/legacy/model_zoo/t5/tests/test_parallel_dygraph_dataparallel.py
similarity index 100%
rename from examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py
rename to legacy/model_zoo/t5/tests/test_parallel_dygraph_dataparallel.py
diff --git a/examples/language_model/t5/tests/test_t5_mp.py b/legacy/model_zoo/t5/tests/test_t5_mp.py
similarity index 100%
rename from examples/language_model/t5/tests/test_t5_mp.py
rename to legacy/model_zoo/t5/tests/test_t5_mp.py
diff --git a/examples/language_model/t5/utils.py b/legacy/model_zoo/t5/utils.py
similarity index 100%
rename from examples/language_model/t5/utils.py
rename to legacy/model_zoo/t5/utils.py
diff --git a/examples/language_model/t5/zero_shot_demo.py b/legacy/model_zoo/t5/zero_shot_demo.py
similarity index 99%
rename from examples/language_model/t5/zero_shot_demo.py
rename to legacy/model_zoo/t5/zero_shot_demo.py
index 1f255c258529..31c73c214cb9 100644
--- a/examples/language_model/t5/zero_shot_demo.py
+++ b/legacy/model_zoo/t5/zero_shot_demo.py
@@ -16,8 +16,10 @@
 https://github.com/Langboat/mengzi-zero-shot
 """
 
-import paddle
 from collections import Counter
+
+import paddle
+
 from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
 
 
diff --git a/model_zoo/tinybert/README.md b/legacy/model_zoo/tinybert/README.md
similarity index 100%
rename from model_zoo/tinybert/README.md
rename to legacy/model_zoo/tinybert/README.md
diff --git a/model_zoo/tinybert/data_augmentation.py b/legacy/model_zoo/tinybert/data_augmentation.py
similarity index 100%
rename from model_zoo/tinybert/data_augmentation.py
rename to legacy/model_zoo/tinybert/data_augmentation.py
diff --git a/model_zoo/tinybert/general_distill.py b/legacy/model_zoo/tinybert/general_distill.py
similarity index 100%
rename from model_zoo/tinybert/general_distill.py
rename to legacy/model_zoo/tinybert/general_distill.py
diff --git a/model_zoo/tinybert/imgs/tinybert.png b/legacy/model_zoo/tinybert/imgs/tinybert.png
similarity index 100%
rename from model_zoo/tinybert/imgs/tinybert.png
rename to legacy/model_zoo/tinybert/imgs/tinybert.png
diff --git a/model_zoo/tinybert/task_distill.py b/legacy/model_zoo/tinybert/task_distill.py
similarity index 100%
rename from model_zoo/tinybert/task_distill.py
rename to legacy/model_zoo/tinybert/task_distill.py
diff --git a/legacy/model_zoo/transformer-xl/README.md b/legacy/model_zoo/transformer-xl/README.md
new file mode 100644
index 000000000000..0ed6e46dadb2
--- /dev/null
+++ b/legacy/model_zoo/transformer-xl/README.md
@@ -0,0 +1,8 @@
+# Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
+
+
+## 模型简介
+
+本项目是语言模型 Transformer-XL 的 PaddlePaddle 实现， 包含模型训练，预测等内容。
+
+详细请参考[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/language_model/transformer-xl).
diff --git a/model_zoo/uie/README.md b/legacy/model_zoo/uie/README.md
similarity index 100%
rename from model_zoo/uie/README.md
rename to legacy/model_zoo/uie/README.md
diff --git a/model_zoo/uie/deploy/python/README.md b/legacy/model_zoo/uie/deploy/python/README.md
similarity index 100%
rename from model_zoo/uie/deploy/python/README.md
rename to legacy/model_zoo/uie/deploy/python/README.md
diff --git a/model_zoo/uie/deploy/python/infer.py b/legacy/model_zoo/uie/deploy/python/infer.py
similarity index 100%
rename from model_zoo/uie/deploy/python/infer.py
rename to legacy/model_zoo/uie/deploy/python/infer.py
diff --git a/model_zoo/uie/deploy/serving/simple_serving/README.md b/legacy/model_zoo/uie/deploy/serving/simple_serving/README.md
similarity index 100%
rename from model_zoo/uie/deploy/serving/simple_serving/README.md
rename to legacy/model_zoo/uie/deploy/serving/simple_serving/README.md
diff --git a/model_zoo/uie/deploy/serving/simple_serving/client.py b/legacy/model_zoo/uie/deploy/serving/simple_serving/client.py
similarity index 100%
rename from model_zoo/uie/deploy/serving/simple_serving/client.py
rename to legacy/model_zoo/uie/deploy/serving/simple_serving/client.py
diff --git a/model_zoo/uie/deploy/serving/simple_serving/server.py b/legacy/model_zoo/uie/deploy/serving/simple_serving/server.py
similarity index 100%
rename from model_zoo/uie/deploy/serving/simple_serving/server.py
rename to legacy/model_zoo/uie/deploy/serving/simple_serving/server.py
diff --git a/model_zoo/uie/doccano.md b/legacy/model_zoo/uie/doccano.md
similarity index 100%
rename from model_zoo/uie/doccano.md
rename to legacy/model_zoo/uie/doccano.md
diff --git a/model_zoo/uie/doccano.py b/legacy/model_zoo/uie/doccano.py
similarity index 100%
rename from model_zoo/uie/doccano.py
rename to legacy/model_zoo/uie/doccano.py
diff --git a/model_zoo/uie/evaluate.py b/legacy/model_zoo/uie/evaluate.py
similarity index 100%
rename from model_zoo/uie/evaluate.py
rename to legacy/model_zoo/uie/evaluate.py
diff --git a/model_zoo/uie/finetune.py b/legacy/model_zoo/uie/finetune.py
similarity index 100%
rename from model_zoo/uie/finetune.py
rename to legacy/model_zoo/uie/finetune.py
diff --git a/model_zoo/uie/labelstudio2doccano.py b/legacy/model_zoo/uie/labelstudio2doccano.py
similarity index 100%
rename from model_zoo/uie/labelstudio2doccano.py
rename to legacy/model_zoo/uie/labelstudio2doccano.py
index 9c36ccd0c04f..d13ace8d6048 100644
--- a/model_zoo/uie/labelstudio2doccano.py
+++ b/legacy/model_zoo/uie/labelstudio2doccano.py
@@ -13,8 +13,8 @@
 # limitations under the License.
 
 import argparse
-import os
 import json
+import os
 
 
 def append_attrs(data, item, label_id, relation_id):
diff --git a/model_zoo/uie/utils.py b/legacy/model_zoo/uie/utils.py
similarity index 100%
rename from model_zoo/uie/utils.py
rename to legacy/model_zoo/uie/utils.py
diff --git a/examples/language_model/xlm/README.md b/legacy/model_zoo/xlm/README.md
similarity index 100%
rename from examples/language_model/xlm/README.md
rename to legacy/model_zoo/xlm/README.md
diff --git a/examples/language_model/xlm/framework.jpg b/legacy/model_zoo/xlm/framework.jpg
similarity index 100%
rename from examples/language_model/xlm/framework.jpg
rename to legacy/model_zoo/xlm/framework.jpg
diff --git a/examples/language_model/xlm/xnli_eval.py b/legacy/model_zoo/xlm/xnli_eval.py
similarity index 99%
rename from examples/language_model/xlm/xnli_eval.py
rename to legacy/model_zoo/xlm/xnli_eval.py
index 54e65cfdc65c..02e6eb0429d8 100644
--- a/examples/language_model/xlm/xnli_eval.py
+++ b/legacy/model_zoo/xlm/xnli_eval.py
@@ -14,15 +14,16 @@
 
 import argparse
 from functools import partial
-import numpy as np
 
+import numpy as np
 import paddle
 from paddle.io import BatchSampler, DataLoader
-from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer
-from paddlenlp.datasets import load_dataset
-from paddlenlp.data import Stack, Tuple, Pad
 from paddle.metric import Accuracy
 
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer
+
 all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
 
 
diff --git a/examples/language_model/xlm/xnli_train.py b/legacy/model_zoo/xlm/xnli_train.py
similarity index 100%
rename from examples/language_model/xlm/xnli_train.py
rename to legacy/model_zoo/xlm/xnli_train.py
diff --git a/legacy/model_zoo/xlnet/README.md b/legacy/model_zoo/xlnet/README.md
new file mode 100644
index 000000000000..06a0c30331bb
--- /dev/null
+++ b/legacy/model_zoo/xlnet/README.md
@@ -0,0 +1,7 @@
+# XLNet
+
+## 模型简介
+
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型，XLNet通过最大化输入序列所有排列的期望来进行语言建模，这使得它可以同时关注到上下文的信息。 另外，XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型，Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列，这使得XLNet在长文本序列的语言任务上有着优秀的表现。
+
+详细请参考[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/language_model/xlnet).
diff --git a/model_zoo/electra/README.md b/model_zoo/electra/README.md
deleted file mode 100644
index b1c7d1e6dd24..000000000000
--- a/model_zoo/electra/README.md
+++ /dev/null
@@ -1,270 +0,0 @@
-# ELECTRA with PaddleNLP
-
-[ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) 在[BERT](https://arxiv.org/abs/1810.04805)的基础上对其预训练过程进行了改进：预训练由两部分模型网络组成，称为Generator和Discriminator，各自包含1个BERT模型。Generator的预训练使用和BERT一样的Masked Language Model(MLM)任务，但Discriminator的预训练使用Replaced Token Detection(RTD)任务（主要改进点）。预训练完成后，使用Discriminator作为精调模型，后续的Fine-tuning不再使用Generator。
-![avatar](./electra_model_brief_introduce.JPG)
-
-图片来源：来自[electra论文](https://openreview.net/pdf?id=r1xMH1BtvB)
-
-根据论文中给出的实验结果，在和BERT具有相同的模型参数、预训练计算量一样的情况下，GLUE得分比BERT明显好，small模型为79.9：75.1，Base模型为85.1：82.2，Large模型为89.0：87.2。
-
-本项目是 ELECTRA 在 Paddle 2.0上的开源实现。
-
-## **环境依赖**
-
-- jieba, 安装方式：`pip install jieba`
-- colorlog, 安装方式：`pip install colorlog`
-- colorama, 安装方式：`pip install colorama`
-- seqeval, 安装方式：`pip install seqeval`
-
-## **数据准备**
-### 建议的预训练数据
-论文中提到预训练需要两部分数据：Book Corpus数据 和 Wikipedia Corpus数据，均为英文文本，utf-8编码。但是当前BookCorpus数据已不再开源，可以使用其它数据替代，只要是纯英文文本数据，utf-8编码即可（例如 Gutenberg Dataset）。
-。另外，Wikipedia Corpus数据建议从[官方获取](https://www.english-corpora.org/wiki/)，下面例子假设这些数据都已获取并都放在./BookCorpus/train.data 文件中，每行一句英文文本
-
-### 自定义预训练数据
-支持用户自定义数据进行训练，自定义数据为文本形式，每行一句英文文本，utf-8编码，下面例子假设数据在./BookCorpus/train.data
-
-### Fine-tuning数据
-Fine-tuning 使用GLUE数据，这部分Paddle已提供，在执行第4章 Fine-tuning 命令时会自动下载并加载
-
-### 推理数据
-可以使用GLUE test数据集（Paddle已提供，在Fine-tuning时会自动下载），或者也可以自定义，格式要求和2.2 自定义预训练数据一样，每行一句英文文本，utf-8编码
-
-## **模型预训练**
-
-**特别注意**：预训练模型如果想要达到较好的效果，需要训练几乎全量的Book Corpus数据 和 Wikipedia Corpus数据，原始文本接近20G，建议用GPU进行预训练，最好4片GPU以上。如果资源较少，Paddle提供已经预训练好的模型进行Fine-tuning，可以直接跳转到下面：运行Fine-tuning-使用Paddle提供的预训练模型运行 Fine-tuning
-
-### 单机单卡
-```shell
-export CUDA_VISIBLE_DEVICES="0"
-export DATA_DIR=./BookCorpus/
-
-python -u ./run_pretrain.py \
-    --model_type electra \
-    --model_name_or_path electra-small \
-    --input_dir $DATA_DIR \
-    --output_dir ./pretrain_model/ \
-    --train_batch_size 64 \
-    --learning_rate 5e-4 \
-    --max_seq_length 128 \
-    --weight_decay 1e-2 \
-    --adam_epsilon 1e-6 \
-    --warmup_steps 10000 \
-    --num_train_epochs 4 \
-    --logging_steps 100 \
-    --save_steps 10000 \
-    --max_steps -1 \
-    --device gpu
-```
-其中参数释义如下：
-- `model_type` 表示模型类型，默认为ELECTRA模型。
-- `model_name_or_path` 如果配置1个名字，则表示预训练模型的规模，当前支持的名字为：electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。如果配置1个路径，则表示按照路径中的模型规模进行训练，这时需配置 --init_from_ckpt 参数一起使用，一般用于断点恢复训练场景。
-- `input_dir` 表示输入数据的目录，该目录下需要有1个train.data纯英文文本文件，utf-8编码。
-- `output_dir` 表示将要保存预训练模型的目录。
-- `train_batch_size` 表示 每次迭代**每张卡**上的样本数目。此例子train_batch_size=64 运行时大致需要单卡12G显存，如果实际GPU显存小于12G或大大多于12G，可适当调小/调大此配置。
-- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
-- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
-- `weight_decay` 表示每次迭代中参数缩小的比例，该值乘以学习率为真正缩小的比例。
-- `adam_epsilon` 表示adam优化器中的epsilon值。
-- `warmup_steps` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
-- `num_train_epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔。
-- `save_steps` 表示模型保存间隔。
-- `max_steps` 如果配置且大于0，表示预训练最多执行的迭代数量；如果不配置或配置小于0，则根据输入数据量、train_batch_size和num_train_epochs来确定预训练迭代数量
-- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
-
-另外还有一些额外参数不在如上命令中：
-- `use_amp` 表示是否开启混合精度(float16)进行训练，默认不开启。如果在命令中加上了--use_amp，则会开启。
-- `init_from_ckpt` 表示是否从某个checkpoint继续训练（断点恢复训练），默认不开启。如果在命令中加上了--init_from_ckpt，且 --model_name_or_path 配置的是路径，则会开启从某个checkpoint继续训练。例如下面的命令从第40000步的checkpoint继续训练：
-```shell
-export CUDA_VISIBLE_DEVICES="0"
-export DATA_DIR=./BookCorpus/
-
-python -u ./run_pretrain.py \
-    --model_type electra \
-    --model_name_or_path ./pretrain_model/model_40000.pdparams/ \
-    --input_dir $DATA_DIR \
-    --output_dir ./pretrain_model/ \
-    --train_batch_size 64 \
-    --learning_rate 5e-4 \
-    --max_seq_length 128 \
-    --weight_decay 1e-2 \
-    --adam_epsilon 1e-6 \
-    --warmup_steps 10000 \
-    --num_train_epochs 4 \
-    --logging_steps 100 \
-    --save_steps 10000 \
-    --max_steps -1 \
-    --device gpu \
-    --init_from_ckpt
-```
-
-训练过程将按照 `logging_steps`的设置打印如下日志：
-
-```
-global step 100/322448, epoch: 0, loss: 46.2487393681735099, lr: 0.000100000000, speed: 0.6439 step/s
-global step 200/322448, epoch: 0, loss: 45.2436411214760099, lr: 0.000200000000, speed: 0.6041 step/s
-global step 300/322448, epoch: 0, loss: 43.2906827821215998, lr: 0.000300000000, speed: 0.5991 step/s
-```
-
-### 单机多卡
-```shell
-export CUDA_VISIBLE_DEVICES="0,1,2,3"
-export DATA_DIR=./BookCorpus/
-
-python -u ./run_pretrain.py \
-    --model_type electra \
-    --model_name_or_path electra-small \
-    --input_dir $DATA_DIR \
-    --output_dir ./pretrain_model/ \
-    --train_batch_size 64 \
-    --learning_rate 5e-4 \
-    --max_seq_length 128 \
-    --weight_decay 1e-2 \
-    --adam_epsilon 1e-6 \
-    --warmup_steps 10000 \
-    --num_train_epochs 4 \
-    --logging_steps 100 \
-    --save_steps 10000 \
-    --max_steps -1 \
-    --device gpu
-```
-执行命令和单机单卡一样，只有环境变量CUDA_VISIBLE_DEVICES配置多个GPU-id，配置后预训练程序使用配置中的GPU-id，不会使用未配置的GPU-id
-
-## **Fine-tuning**
-### 从预训练模型得到Fine-tuning所需模型
-由第一段简介得知，Electra Fine-tuning时只需要Discriminator部分，所以通过如下命令从预训练模型中提取出Discriminator，得到Fine-tuning所需模型
-```shell
-python -u ./get_ft_model.py \
-    --model_dir ./pretrain_model/model_40000.pdparams/
-```
-其中参数释义如下：
-- `model_dir` 表示预训练模型所在目录，这里例子取预训练40000步的checkpoint来生成Fine-tuning所需模型，生成的用于Fine-tuning的模型也会在这个目录下。
-
-此命令可多次执行，但只有第1次会生成Fine-tuning所需模型
-
-**特别注意**：如果使用windows系统执行此命令，需使用**管理员**权限运行，否则会出错。Linux无此限制
-
-### 运行Fine-tuning
-使用./run_glue.py运行，有两种方式：
-#### **使用Paddle提供的预训练模型运行 Fine-tuning**
-此方式无需在本地进行预训练，即可以跳过上面 模型预训练 和 从预训练模型得到Fine-tuning所需模型 的介绍，直接运行Fine-tuning。
-
-以 GLUE/SST-2 任务为例，启动 Fine-tuning 的方式如下：
-```shell
-export CUDA_VISIBLE_DEVICES=0,1
-export TASK_NAME=SST-2
-
-python -u ./run_glue.py \
-    --model_type electra \
-    --model_name_or_path electra-small \
-    --task_name $TASK_NAME \
-    --max_seq_length 128 \
-    --batch_size 32   \
-    --learning_rate 1e-4 \
-    --num_train_epochs 3 \
-    --logging_steps 100 \
-    --save_steps 100 \
-    --output_dir ./$TASK_NAME/ \
-    --device gpu
-```
-其中参数释义如下：
-- `model_type` 指示了模型类型，当前支持BERT、ELECTRA、ERNIE模型。
-- `model_name_or_path` 如果配置模型名（electra模型当前支持electra-small、electra-base、electra-large几种规格）则为本节介绍的方式。如果配置本地目录（例如执行get_ft_model.py 命令得到Fine-tuning所需模型，配置其所在的目录 pretrain_model/model_40000.pdparams/）则为下一节中介绍的方式。
-- `task_name` 表示 Fine-tuning 的任务，当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。
-- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
-- `batch_size` 表示每次迭代**每张卡**上的样本数目。
-- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
-- `num_train_epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔。
-- `save_steps` 表示模型保存及评估间隔。
-- `output_dir` 表示模型保存路径。
-- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU、NPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
-
-#### **使用本地预训练模型运行 Fine-tuning**
-按照上面模型预训练的介绍，在本地运行 ELECTRA 模型的预训练后，执行get_ft_model.py命令得到Fine-tuning所需模型，然后运行 Fine-tuning。
-
-以 GLUE/SST-2 任务为例，启动 Fine-tuning 的方式如下：
-```shell
-export CUDA_VISIBLE_DEVICES=0,1
-export TASK_NAME=SST-2
-
-python -u ./run_glue.py \
-    --model_type electra \
-    --model_name_or_path ./pretrain_model/model_40000.pdparams/ \
-    --task_name $TASK_NAME \
-    --max_seq_length 128 \
-    --batch_size 32   \
-    --learning_rate 1e-4 \
-    --num_train_epochs 3 \
-    --logging_steps 100 \
-    --save_steps 100 \
-    --output_dir ./$TASK_NAME/ \
-    --device gpu
-```
-其中绝大部分参数和上节中一样，只有参数model_name_or_path配置了本地预训练模型的路径
-
-无论使用哪种方式进行 Fine-tuning，过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下格式的日志：
-
-```
-global step 100/6315, epoch: 0, batch: 99, rank_id: 0, loss: 0.687738, lr: 0.0000158479, speed: 3.3566 step/s
-eval loss: 0.693736, acc: 0.5137614678899083, eval done total : 2.0170159339904785 s
-global step 200/6315, epoch: 0, batch: 199, rank_id: 0, loss: 0.342201, lr: 0.0000316957, speed: 3.1531 step/s
-eval loss: 0.715023, acc: 0.8256880733944955, eval done total : 1.9682419300079346 s
-global step 300/6315, epoch: 0, batch: 299, rank_id: 0, loss: 0.516034, lr: 0.0000475436, speed: 3.1663 step/s
-eval loss: 0.653879, acc: 0.8658256880733946, eval done total : 1.9738705158233643 s
-global step 400/6315, epoch: 0, batch: 399, rank_id: 0, loss: 0.228789, lr: 0.0000633914, speed: 3.1512 step/s
-eval loss: 0.863306, acc: 0.8600917431192661, eval done total : 1.960683822631836 s
-global step 500/6315, epoch: 0, batch: 499, rank_id: 0, loss: 0.320570, lr: 0.0000792393, speed: 3.1495 step/s
-eval loss: 0.732358, acc: 0.8704128440366973, eval done total : 1.9749321937561035 s
-```
-
-使用electra-small预训练模型进行单卡 Fine-tuning ，在验证集上有如下结果（这里各类任务的结果是运行3次取最好得到）：
-
-| Task  | Metric                       | Result      |
-|-------|------------------------------|-------------|
-| CoLA  | Matthews corr                | 58.22       |
-| SST-2 | acc.                         | 91.85       |
-| MRPC  | acc./F1                      | 88.24       |
-| STS-B | Pearson/Spearman corr        | 87.24       |
-| QQP   | acc./F1                      | 88.83       |
-| MNLI  | matched acc./mismatched acc. | 82.45       |
-| QNLI  | acc.                         | 88.61       |
-| RTE   | acc.                         | 66.78       |
-
-注：acc.是Accuracy的简称，表中Metric字段名词取自[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)
-
-## **推理部署**
-运行某个GLUE任务后（还是继续以GLUE/SST-2 情感分类任务为例），想要将Fine-tuning模型导出以加速类似场景更多数据的推理，可以按照如下步骤完成推理部署
-
-### 导出推理模型
-```shell
-python -u ./export_model.py \
-    --input_model_dir ./SST-2/sst-2_ft_model_6000.pdparams/ \
-    --output_model_dir ./ \
-    --model_name electra-deploy
-```
-其中参数释义如下：
-- `input_model_dir` 表示输入的预训练模型所在目录，这里例子取SST-2 Fine-tuning 6000步的checkpoint来导出推理模型。
-- `output_model_dir` 表示将要保存推理模型的目录，这里例子取当前路径。
-- `model_name` 表示输出推理模型的名字前缀，任意字符串均可，默认为electra-deploy。
-
-例如，执行如上命令后，可以看到在output_model_dir配置的目录下，导出的推理模型包括3个文件：
-| 文件                          | 说明                                   |
-|-------------------------------|----------------------------------------|
-| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
-| electra-deploy.pdiparams.info | 模型权重信息文件                         |
-| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
-
-### **使用Paddle Inference API进行推理**
-准备好如上推理模型后，可参考[Paddle Inference API推理步骤](./deploy/python/README.md)。
-
-### **使用Paddle Serving API进行推理**
-上面介绍的Paddle Inference为使用本地模型推理，Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过RPC/HTTP方式发送数据进行推理，实现模型推理的服务化。准备好如上推理模型后，可参考[Paddle Serving API推理步骤](./deploy/serving/README.md)。
-
-### **使用Paddle Lite API进行推理**
-上面介绍的Paddle Inference和Serving主要在服务器上进行推理，而在移动设备（手机、平板等）上需要使用Paddle Lite进行推理。准备好如上推理模型后，可参考[Paddle Lite API推理步骤](./deploy/lite/README.md)。
-
-
-## Reference
-[Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR 2020](https://openreview.net/pdf?id=r1xMH1BtvB)
diff --git a/model_zoo/electra/deploy/lite/Makefile b/model_zoo/electra/deploy/lite/Makefile
deleted file mode 100644
index ca4be9cfc7ae..000000000000
--- a/model_zoo/electra/deploy/lite/Makefile
+++ /dev/null
@@ -1,33 +0,0 @@
-ARM_ABI = arm8
-export ARM_ABI
-
-include ../Makefile.def
-
-LITE_ROOT=../../../
-
-CXX_INCLUDES = $(INCLUDES) ${OPENCV_INCLUDE} -I$(LITE_ROOT)/cxx/include
-
-CXX_LIBS = -L$(LITE_ROOT)/cxx/lib/ -lpaddle_light_api_shared $(SYSTEM_LIBS)
-
-###############################################################
-# How to use one of static libaray:                           #
-#  `libpaddle_api_full_bundled.a`                             #
-#  `libpaddle_api_light_bundled.a`                            #
-###############################################################
-# Note: default use lite's shared library.                    #
-###############################################################
-# 1. Comment above line using `libpaddle_light_api_shared.so`
-# 2. Undo comment below line using `libpaddle_api_light_bundled.a`
-
-#CXX_LIBS = $(LITE_ROOT)/cxx/lib/libpaddle_api_light_bundled.a $(SYSTEM_LIBS)
-
-electra_lite: electra_lite.o
-	$(CC) $(SYSROOT_LINK) $(CXXFLAGS_LINK) electra_lite.o -o electra_lite  $(CXX_LIBS) $(LDFLAGS)
-
-electra_lite.o: sentiment_classfication.cpp
-	$(CC) $(SYSROOT_COMPLILE) $(CXX_DEFINES) $(CXX_INCLUDES) $(CXX_FLAGS) -o electra_lite.o -c sentiment_classfication.cpp
-
-.PHONY: clean
-clean:
-	rm -f electra_lite.o
-	rm -f electra_lite
diff --git a/model_zoo/electra/deploy/lite/README.md b/model_zoo/electra/deploy/lite/README.md
deleted file mode 100644
index 00230fafacf7..000000000000
--- a/model_zoo/electra/deploy/lite/README.md
+++ /dev/null
@@ -1,162 +0,0 @@
-# **ELECTRA 使用Paddle Lite API进行推理**
-在移动设备（手机、平板等）上需要使用Paddle Lite进行推理。[Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)是飞桨轻量化推理引擎，为手机、IOT端提供高效推理能力，并广泛整合跨平台硬件，为端侧部署及应用落地问题提供轻量化的部署方案。下面以Android手机(Armv7或Armv8)为例，使用Paddle Lite进行ELECTRA模型的推理。
-
-## 前提条件
-准备好Inference所需模型，需要2个文件：
-| 文件                          | 说明                                   |
-|-------------------------------|----------------------------------------|
-| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
-| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
-
-如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下
-
-## 准备硬件和系统
-- 电脑。用于保存代码和数据；编译Paddle Lite（看需要）
-- 手机。Android手机(Armv7或Armv8)，手机要能直接连接电脑，或者手机直连某个设备，其能连接到电脑。
-
-如果在其它特殊硬件上或想要自己编译Paddle Lite预测库和优化工具，则电脑上还需准备：
-- 交叉编译环境。不同开发环境的编译流程请参考对应文档。
-   - [Docker](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#docker)
-   - [Linux](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#linux)
-   - [MAC OS](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#mac-os)
-
-## 准备Paddle Lite预测库
-有两种方法：
-- 直接下载。[官方预测库下载地址](https://paddle-lite.readthedocs.io/zh/latest/quick_start/release_lib.html)，注意选择和手机arm系统版本匹配的，并带with_extra=ON的下载链接。
-- 编译Paddle-Lite得到预测库。**需要先准备好交叉编译环境**，然后依次执行如下命令，例如编译在 armv8 硬件上运行的预测库并打开extra op：
-```shell
-git clone https://github.com/PaddlePaddle/Paddle-Lite.git
-cd Paddle-Lite
-git checkout develop
-./lite/tools/build_android.sh --arch=armv8 --with_extra=ON
-```
-直接下载预测库并解压后，可以得到`inference_lite_lib.android.armv8/`文件夹，通过编译Paddle-Lite得到的预测库位于`Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/`文件夹。无论使用哪种方法，得到的预测库的文件目录结构都如下，为了方便统一说明，下面都假设预测库位于${Paddle_Lite_root}/inference_lite_lib.android.armv8/目录中：
-```
-${Paddle_Lite_root}/inference_lite_lib.android.armv8/
-|-- cxx                                        C++ 预测库和头文件
-|   |-- include                                C++ 头文件
-|   |   |-- paddle_api.h
-|   |   |-- paddle_image_preprocess.h
-|   |   |-- paddle_lite_factory_helper.h
-|   |   |-- paddle_place.h
-|   |   |-- paddle_use_kernels.h
-|   |   |-- paddle_use_ops.h
-|   |   `-- paddle_use_passes.h
-|   `-- lib                                           C++预测库
-|       |-- libpaddle_api_light_bundled.a             C++静态库
-|       `-- libpaddle_light_api_shared.so             C++动态库
-|-- java                                     Java预测库
-|   |-- jar
-|   |   `-- PaddlePredictor.jar
-|   |-- so
-|   |   `-- libpaddle_lite_jni.so
-|   `-- src
-|-- demo                                     C++和Java示例代码
-|   |-- cxx                                  C++  预测库demo
-|   `-- java                                 Java 预测库demo
-```
-
-## 准备Paddle Lite模型优化工具
-因为移动设备上对模型的要求很严格，所以需要使用Paddle Lite模型优化工具将Inference模型优化后才能将模型部署到移动设备上进行推理，模型优化的方法包括量化、子图融合、混合调度、Kernel优选等等方法。准备Paddle Lite模型优化工具也有两种方法：
-- 直接下载。
-```shell
-pip install paddlelite。
-```
-- 编译Paddle-Lite得到模型优化工具。**需要先准备好交叉编译环境**，然后依次执行如下命令：
-```shell
-# 如果准备环境时已经clone了Paddle-Lite，则不用重新clone Paddle-Lite
-git clone https://github.com/PaddlePaddle/Paddle-Lite.git
-cd Paddle-Lite
-git checkout develop
-# 启动编译
-./lite/tools/build.sh build_optimize_tool
-```
-如果是直接下载，工具可执行文件为`paddle_lite_opt`，并放在系统环境变量PATH中，所以无需进入到工具所在目录就可以直接执行；如果是编译得到，则工具可执行文件为`Paddle-Lite/build.opt/lite/api/opt`，为了后面统一说明，可将工具统一命名为`paddle_lite_opt`，并将其所处目录添加到系统环境变量PATH中，通过如下方式查看其运行选项和使用方式；
-```shell
-cd build.opt/lite/api/ && mv opt paddle_lite_opt
-./paddle_lite_opt
-```
-
-## 使用Paddle Lite模型优化工具转换Inference模型
-以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例，执行：
-```shell
-paddle_lite_opt \
-    --model_file ./electra-deploy.pdmodel \
-    --param_file ./electra-deploy.pdiparams \
-    --optimize_out ./electra-deploy-lite \
-    --optimize_out_type protobuf \
-    --valid_targets arm \
-    --record_tailoring_info false
-```
-其中参数释义如下：
-- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
-- `param_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
-- `optimize_out` 表示输出的Lite模型**名字前缀**。例如配置./electra-deploy-lite，最终得到的Lite模型为./electra-deploy-lite.nb。
-- `optimize_out_type` 表示输出模型类型，目前支持两种类型：protobuf和naive_buffer，其中naive_buffer是一种更轻量级的序列化/反序列化实现。若您需要在mobile端执行模型预测，请将此选项设置为naive_buffer。默认为protobuf。
-- `valid_targets` 表示模型将要运行的硬件类型，默认为arm。目前可支持x86、arm、opencl、npu、xpu，可以同时指定多个backend(以空格分隔)，Model Optimize Tool将会自动选择最佳方式。如果需要支持华为NPU（Kirin 810/990 Soc搭载的达芬奇架构NPU），应当设置为npu, arm。
-- `record_tailoring_info` 表示是否使用 根据模型裁剪库文件 功能，如使用则设置该选项为true，以记录优化后模型含有的kernel和OP信息，默认为false。
-
-如上命令执行后，得到Lite模型为./electra-deploy-lite.nb
-
-## 预处理输入数据，并和Lite预测库、Lite模型、编译好的C++代码/配置 一起打包。
-```shell
-# 假设${Paddle_Lite_root}已经配置了正确的Lite预测库路径
-python -u ./prepare.py \
-    --lite_lib_path ${Paddle_Lite_root}/inference_lite_lib.android.armv8/ \
-    --lite_model_file ./electra-deploy-lite.nb \
-    --predict_file ./test.txt \
-    --max_seq_length 128 \
-    --model_name electra-small
-
-# 进入lite demo的工作目录
-cd ${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/
-make -j && mv electra_lite debug
-```
-其中prepare.py的参数释义如下：
-- `lite_lib_path` 表示预测库所在目录。
-- `lite_model_file` 表示Lite模型路径。
-- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。
-- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
-- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
-
-如上命令执行完后，${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/文件夹下将有如下文件，只有其中的**debug目录**会传到手机：
-```shell
-demo/cxx/electra/
-|-- debug/
-|    |--config.txt                       推理配置和超参数配置
-|    |--electra-deploy-lite.nb           优化后的Lite模型文件
-|    |--electra_lite                     编译好的在手机上执行推理的可执行文件
-|    |--libpaddle_light_api_shared.so    C++预测库文件
-|    |--predict_input.bin                预处理好的输入数据（二进制）
-|    |--predict_input.txt                输入数据明文
-|    |--sst2_label.txt                   类别说明文件
-|-- config.txt                              推理配置和超参数配置
-|-- Makefile                                编译文件
-|-- sentiment_classfication.cpp                推理代码文件
-```
-
-## 与目标设备连接执行推理
-如果电脑和Android手机直接连接，则在电脑上安装[ADB工具](https://developer.android.com/studio/command-line/adb)，通过ADB工具来连接和操作Android设备：
-```shell
-# 检查是否连接上设备
-adb devices
-# 将debug目录推送到设备的/data/local/tmp/electra/目录下，需事先在设备上创建
-adb push debug /data/local/tmp/electra/
-# 登录设备并打开设备上的shell
-adb shell
-# 准备相关环境。进入程序目录，配置好动态链接库的环境变量并给程序添加执行权限
-cd /data/local/tmp/electra/debug && export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp/electra/debug/ && chmod +x electra_lite
-# 输入数据，运行Lite推理
-./electra_lite ./config.txt
-```
-如果电脑和Android手机没有直接连接，Android手机直连某个设备，则需将debug目录cp到那个设备上，并在那个设备上安装ADB工具以执行如上代码。
-
-执行如上推理命令后得到如下结果，同样数据在Paddle Lite推理的结果应该和使用Inference/Serving的结果是一样的
-```shell
-=== electra predict result: ./predict_input.txt===
-sentence: [CLS] uneasy mishmash of styles and genres . [SEP], class_id: 0(negative), logits: 2.22824
-sentence: [CLS] director rob marshall went out gunning to make a great one . [SEP], class_id: 1(positive), logits: 0.241332
-total time : 0.399562 s.
-```
-
-如果修改了代码，则需要先执行prepare.py，再重新编译并打包push到手机上；如果只修改输入数据，则只需要执行prepare.py并打包push到手机上，不用重新编译。
diff --git a/model_zoo/electra/deploy/lite/config.txt b/model_zoo/electra/deploy/lite/config.txt
deleted file mode 100644
index df4c84b9775d..000000000000
--- a/model_zoo/electra/deploy/lite/config.txt
+++ /dev/null
@@ -1,6 +0,0 @@
-lite_model_file ./electra-deploy-lite.nb # path of model relative to executable file
-label_file ./sst2_label.txt              # path of label description file
-predict_file_bin ./predict_input.bin     # path of data in binary
-predict_file_txt ./predict_input.txt     # path of data in text
-predict_num 10                           # number of data to predict, automatic generation and no need config
-predict_length 39                        # sequence length of each data, automatic generation and no need config
diff --git a/model_zoo/electra/deploy/lite/prepare.py b/model_zoo/electra/deploy/lite/prepare.py
deleted file mode 100644
index e1bc88d3086b..000000000000
--- a/model_zoo/electra/deploy/lite/prepare.py
+++ /dev/null
@@ -1,191 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import fileinput
-import io
-import os
-import shutil
-import time
-
-import numpy as np
-
-from paddlenlp.transformers import ElectraTokenizer
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--lite_lib_path", type=str, required=True, default=None, help="directory of paddle lite api library"
-    )
-    parser.add_argument("--lite_model_file", type=str, required=True, default=None, help="paddle lite model file(.nb)")
-    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
-    parser.add_argument(
-        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
-    )
-    parser.add_argument(
-        "--prepared_file_prefix",
-        type=str,
-        default="predict_input",
-        help="prepared file prefix after processing predict sentences",
-    )
-    parser.add_argument("--batch_size", type=int, default=100000, help="batch size")
-    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
-    parser.add_argument(
-        "--model_name",
-        type=str,
-        default="electra-small",
-        help="shortcut name selected in the list: "
-        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
-    )
-    return parser.parse_args()
-
-
-def read_sentences(paths=[]):
-    sentences = []
-    for sen_path in paths:
-        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
-        sen = read_file(sen_path)
-        if sen is None:
-            logger.info("error in loading file:{}".format(sen_path))
-            continue
-        sentences.extend(sen)
-    return sentences
-
-
-def read_file(path):
-    lines = []
-    with open(path, encoding="utf-8") as f:
-        while True:
-            line = f.readline()
-            if line:
-                if len(line) > 0 and not line.isspace():
-                    lines.append(line.strip())
-            else:
-                break
-    return lines
-
-
-def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
-    if predicted_data == [] or not isinstance(predicted_data, list):
-        raise TypeError("The predicted_data is inconsistent with expectations.")
-
-    sen_ids_batch = []
-    sen_words_batch = []
-    sen_ids = []
-    sen_words = []
-    batch_num = 0
-    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
-    for sen in predicted_data:
-        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
-        sen_ids.append(sen_id)
-        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
-        batch_num += 1
-        if batch_num == batch_size:
-            tmp_list = []
-            max_length = max([len(i) for i in sen_ids])
-            for i in sen_ids:
-                if len(i) < max_length:
-                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-                else:
-                    tmp_list.append(i)
-            sen_ids_batch.append(tmp_list)
-            sen_words_batch.append(sen_words)
-            sen_ids = []
-            sen_words = []
-            batch_num = 0
-
-    if len(sen_ids) > 0:
-        tmp_list = []
-        max_length = max([len(i) for i in sen_ids])
-        for i in sen_ids:
-            if len(i) < max_length:
-                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-            else:
-                tmp_list.append(i)
-        sen_ids_batch.append(tmp_list)
-        sen_words_batch.append(sen_words)
-
-    return sen_ids_batch, sen_words_batch
-
-
-def prepare_predict(args, sentences=[], paths=[]):
-    """
-    Args:
-        sentences (list[str]): each string is a sentence. If sentences not paths
-        paths (list[str]): The paths of file which contain sentences. If paths not sentences
-    """
-
-    # initialize data
-    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
-        predicted_data = sentences
-    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
-        predicted_data = read_sentences(paths)
-    else:
-        raise TypeError("The input data is inconsistent with expectations.")
-
-    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
-    predicted_input, predicted_sens = get_predicted_input(
-        predicted_data, tokenizer, args.max_seq_length, args.batch_size
-    )
-
-    predicted_input_np = np.array(predicted_input)
-    predict_num = predicted_input_np.shape[1]
-    predict_length = predicted_input_np.shape[2]
-    predict_input_bin = args.prepared_file_prefix + ".bin"
-    predict_input_txt = args.prepared_file_prefix + ".txt"
-    predicted_input_np[0].astype(np.int64).tofile(predict_input_bin)
-    with io.open(predict_input_txt, "w", encoding="UTF-8") as f:
-        for sen_batch in predicted_sens:
-            for sen in sen_batch:
-                if len(sen.strip()) > 0:
-                    f.write(sen.strip() + "\n")
-
-    for line in fileinput.input("./deploy/lite/config.txt", inplace=True):
-        if "predict_num" in line:
-            newline = "predict_num " + str(predict_num)
-            print("%s" % newline)
-        elif "predict_length" in line:
-            newline = "predict_length " + str(predict_length)
-            print("%s" % newline)
-        else:
-            print("%s" % line.strip())
-
-    root_dir = args.lite_lib_path + "/demo/cxx/electra/"
-    debug_dir = args.lite_lib_path + "/demo/cxx/electra/debug/"
-    if not os.path.exists(debug_dir):
-        os.makedirs(debug_dir)
-    shutil.copy(args.lite_model_file, debug_dir)
-    shutil.copy("./deploy/lite/sst2_label.txt", debug_dir)
-    shutil.copy("./deploy/lite/config.txt", debug_dir)
-    shutil.copy(predict_input_bin, debug_dir)
-    shutil.copy(predict_input_txt, debug_dir)
-    libpaddle_light_api = os.path.join(args.lite_lib_path, "cxx/lib/libpaddle_light_api_shared.so")
-    shutil.copy(libpaddle_light_api, debug_dir)
-
-    shutil.copy("./deploy/lite/config.txt", root_dir)
-    shutil.copy("./deploy/lite/sentiment_classfication.cpp", root_dir)
-    shutil.copy("./deploy/lite/Makefile", root_dir)
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    sentences = args.predict_sentences
-    paths = args.predict_file
-    start_time = time.time()
-    # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."]
-    # paths = ["../../debug/test.txt", "../../debug/test.txt.1"]
-    prepare_predict(args, sentences, paths)
-    print("prepare lite predict done, total time : %s s" % (time.time() - start_time))
diff --git a/model_zoo/electra/deploy/lite/sentiment_classfication.cpp b/model_zoo/electra/deploy/lite/sentiment_classfication.cpp
deleted file mode 100644
index 6305a5d2adc1..000000000000
--- a/model_zoo/electra/deploy/lite/sentiment_classfication.cpp
+++ /dev/null
@@ -1,240 +0,0 @@
-// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include <arm_neon.h>
-#include <math.h>
-#include <sys/time.h>
-#include <chrono>
-#include <fstream>
-#include <iostream>
-#include <vector>
-#include "paddle_api.h"  // NOLINT
-
-using namespace paddle::lite_api;  // NOLINT
-using namespace std;
-
-#undef stderr
-FILE *stderr = &__sF[2];
-
-struct RESULT {
-  std::string class_name;
-  int class_id;
-  float logits;
-};
-
-std::vector<RESULT> PostProcess(const float *output_data,
-                                int predict_num,
-                                int predict_class,
-                                const std::vector<std::string> &word_labels) {
-  int predict_result[predict_num] = {0};
-  float predict_logits[predict_num] = {0};
-  for (int i = 0; i < predict_num; i++) {
-    int index = -1;
-    float max_score = -100.0;
-    for (int j = 0; j < predict_class; j++) {
-      float score = output_data[i * predict_class + j];
-      if (score > max_score) {
-        max_score = score;
-        index = j;
-      }
-    }
-    predict_result[i] = index;
-    predict_logits[i] = max_score;
-  }
-
-  std::vector<RESULT> results(predict_num);
-  for (int i = 0; i < results.size(); i++) {
-    results[i].class_name = "Unknown";
-    if (predict_result[i] >= 0 && predict_result[i] < word_labels.size()) {
-      results[i].class_name = word_labels[predict_result[i]];
-    }
-    results[i].class_id = predict_result[i];
-    results[i].logits = predict_logits[i];
-  }
-  return results;
-}
-
-std::shared_ptr<PaddlePredictor> LoadModel(std::string model_file) {
-  MobileConfig config;
-  config.set_model_from_file(model_file);
-
-  std::shared_ptr<PaddlePredictor> predictor =
-      CreatePaddlePredictor<MobileConfig>(config);
-  return predictor;
-}
-
-std::vector<std::string> split(const std::string &str,
-                               const std::string &delim) {
-  std::vector<std::string> res;
-  if ("" == str) return res;
-  char *strs = new char[str.length() + 1];
-  std::strcpy(strs, str.c_str());
-
-  char *d = new char[delim.length() + 1];
-  std::strcpy(d, delim.c_str());
-
-  char *p = std::strtok(strs, d);
-  while (p) {
-    string s = p;
-    res.push_back(s);
-    p = std::strtok(NULL, d);
-  }
-
-  return res;
-}
-
-std::vector<std::string> ReadDict(const std::string &path) {
-  std::ifstream in(path);
-  std::string filename;
-  std::string line;
-  std::vector<std::string> m_vec;
-  if (in) {
-    while (getline(in, line)) {
-      m_vec.push_back(line);
-    }
-  } else {
-    std::cout << "no such file" << std::endl;
-  }
-  return m_vec;
-}
-
-std::map<std::string, std::string> LoadConfigTxt(
-    const std::string &config_path) {
-  auto config = ReadDict(config_path);
-
-  std::map<std::string, std::string> dict;
-  for (int i = 0; i < config.size(); i++) {
-    std::vector<std::string> res = split(config[i], " ");
-    dict[res[0]] = res[1];
-  }
-  return dict;
-}
-
-void PrintConfig(const std::map<std::string, std::string> &config) {
-  std::cout << "=======PaddleClas lite demo config======" << std::endl;
-  for (auto iter = config.begin(); iter != config.end(); iter++) {
-    std::cout << iter->first << " : " << iter->second << std::endl;
-  }
-  std::cout << "=======End of PaddleClas lite demo config======" << std::endl;
-}
-
-std::vector<std::string> LoadLabels(const std::string &path) {
-  std::ifstream file;
-  std::vector<std::string> labels;
-  file.open(path);
-  while (file) {
-    std::string line;
-    std::getline(file, line);
-    std::string::size_type pos = line.find(" ");
-    if (pos != std::string::npos) {
-      line = line.substr(pos + 1);
-    }
-    labels.push_back(line);
-  }
-  file.clear();
-  file.close();
-  return labels;
-}
-
-std::vector<RESULT> RunModel(std::shared_ptr<PaddlePredictor> predictor,
-                             const std::map<std::string, std::string> &config,
-                             double &cost_time) {
-  // read config
-  std::string label_path = config.at("label_file");
-  std::string predict_file_bin = config.at("predict_file_bin");
-  int predict_num = stoi(config.at("predict_num"));
-  int predict_length = stoi(config.at("predict_length"));
-
-  // Load Labels
-  std::vector<std::string> word_labels = LoadLabels(label_path);
-
-  // Read predict data
-  int64_t predict_data[predict_num][predict_length] = {0};
-  ifstream in(predict_file_bin, ios::in | ios::binary);
-  in.read((char *)&predict_data, sizeof predict_data);
-  in.close();
-
-  // Fill input tensor
-  std::unique_ptr<Tensor> input_tensor(std::move(predictor->GetInput(0)));
-  input_tensor->Resize({predict_num, predict_length});
-  auto *data = input_tensor->mutable_data<int64_t>();
-  for (int i = 0; i < predict_num; i++) {
-    for (int j = 0; j < predict_length; j++) {
-      data[i * predict_length + j] = predict_data[i][j];
-    }
-  }
-
-  auto start = std::chrono::system_clock::now();
-  // Run predictor
-  predictor->Run();
-
-  // Get output and post process
-  std::unique_ptr<const Tensor> output_tensor(
-      std::move(predictor->GetOutput(0)));
-  auto *output_data = output_tensor->data<float>();
-  auto end = std::chrono::system_clock::now();
-  auto duration =
-      std::chrono::duration_cast<std::chrono::microseconds>(end - start);
-  cost_time = double(duration.count()) *
-              std::chrono::microseconds::period::num /
-              std::chrono::microseconds::period::den;
-
-  if (output_tensor->shape().size() != 2) {
-    std::cerr << "[ERROR] the size of output tensor shape must equal to 2\n";
-    exit(1);
-  }
-  int predict_class = int(output_tensor->shape()[1]);
-
-  auto results =
-      PostProcess(output_data, predict_num, predict_class, word_labels);
-
-  return results;
-}
-
-int main(int argc, char **argv) {
-  if (argc < 2) {
-    std::cerr << "[ERROR] usage: " << argv[0] << " config_path\n";
-    exit(1);
-  }
-
-  // load config
-  std::string config_path = argv[1];
-  auto config = LoadConfigTxt(config_path);
-  PrintConfig(config);
-
-  // init predictor
-  std::string lite_model_file = config.at("lite_model_file");
-  auto electra_predictor = LoadModel(lite_model_file);
-
-  double elapsed_time = 0.0;
-  double run_time = 0;
-
-  // run lite inference
-  std::vector<RESULT> results = RunModel(electra_predictor, config, run_time);
-
-  // print result
-  std::string predict_file_txt = config.at("predict_file_txt");
-  auto sentences = ReadDict(predict_file_txt);
-  std::cout << "=== electra predict result: " << predict_file_txt
-            << "===" << std::endl;
-  for (int i = 0; i < results.size(); i++) {
-    std::cout << "sentence: " << sentences[i]
-              << ", class_id: " << results[i].class_id << "("
-              << results[i].class_name << ")"
-              << ", logits: " << results[i].logits << std::endl;
-  }
-  std::cout << "total time : " << run_time << " s." << std::endl;
-
-  return 0;
-}
diff --git a/model_zoo/electra/deploy/lite/sst2_label.txt b/model_zoo/electra/deploy/lite/sst2_label.txt
deleted file mode 100644
index 0dcf43b8731e..000000000000
--- a/model_zoo/electra/deploy/lite/sst2_label.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-0 negative
-1 positive
diff --git a/model_zoo/electra/deploy/python/README.md b/model_zoo/electra/deploy/python/README.md
deleted file mode 100644
index 35628a747926..000000000000
--- a/model_zoo/electra/deploy/python/README.md
+++ /dev/null
@@ -1,56 +0,0 @@
-# **ELECTRA 使用Paddle Inference API进行推理**
-
-## 前提条件
-准备好Inference所需模型，需要2个文件：
-| 文件                          | 说明                                   |
-|-------------------------------|----------------------------------------|
-| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
-| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
-
-如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下，有两种方法进行推理
-
-## 从命令行读取输入数据进行推理
-```shell
-python -u ./predict.py \
-    --model_file ./electra-deploy.pdmodel \
-    --params_file ./electra-deploy.pdiparams \
-    --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \
-    --batch_size 2 \
-    --max_seq_length 128 \
-    --model_name electra-small
-```
-其中参数释义如下：
-- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
-- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
-- `predict_sentences` 表示用于推理的（句子）数据，可以配置1条或多条。如果此项配置，则predict_file不用配置。
-- `batch_size` 表示每次推理的样本数目。
-- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
-- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
-
-另外还有一些额外参数不在如上命令中：
-- `use_gpu` 表示是否使用GPU进行推理，默认不开启。如果在命令中加上了--use_gpu，则使用GPU进行推理。
-- `use_trt` 表示是否使用TensorRT进行推理，默认不开启。如果在命令中加上了--use_trt，且配置了--use_gpu，则使用TensorRT进行推理。前提条件：1）需提前安装TensorRT或使用[Paddle提供的TensorRT docker镜像](https://github.com/PaddlePaddle/Serving/blob/v0.5.0/doc/DOCKER_IMAGES_CN.md)。2）需根据cuda、cudnn、tensorRT和python的版本，安装[匹配版本的Paddle包](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html)
-
-## 从文件读取输入数据进行推理
-```shell
-python -u ./predict.py \
-    --model_file ./electra-deploy.pdmodel \
-    --params_file ./electra-deploy.pdiparams \
-    --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \
-    --batch_size 2 \
-    --max_seq_length 128 \
-    --model_name electra-small
-```
-其中绝大部分和从命令行读取输入数据一样，这里描述不一样的参数：
-- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。如果此项配置，则predict_sentences不用配置。
-
-模型对每1句话分别推理出1个结果，例如下面为使用第一种方法中的命令得到的SST-2情感分类推理结果，0表示句子是负向情感，1表示句子为正向情感。因为batch_size=2，所以只有1个batch：
-```shell
-===== batch 0 =====
-Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP]
-Output data is : 0
-Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP]
-Output data is : 1
-inference total 2 sentences done, total time : 0.0849156379699707 s
-```
-此推理结果表示：第1句话是负向情感，第2句话是正向情感。
diff --git a/model_zoo/electra/deploy/python/predict.py b/model_zoo/electra/deploy/python/predict.py
deleted file mode 100755
index a4c6a60ff445..000000000000
--- a/model_zoo/electra/deploy/python/predict.py
+++ /dev/null
@@ -1,194 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import time
-
-import numpy as np
-from paddle import inference
-
-from paddlenlp.transformers import ElectraTokenizer
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_file", type=str, required=True, help="model filename")
-    parser.add_argument("--params_file", type=str, required=True, help="parameter filename")
-    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
-    parser.add_argument(
-        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
-    )
-    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
-    parser.add_argument("--use_gpu", action="store_true", help="whether to use gpu")
-    parser.add_argument("--use_trt", action="store_true", help="whether to use TensorRT")
-    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
-    parser.add_argument(
-        "--model_name",
-        type=str,
-        default="electra-small",
-        help="shortcut name selected in the list: "
-        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
-    )
-    return parser.parse_args()
-
-
-def read_sentences(paths=[]):
-    sentences = []
-    for sen_path in paths:
-        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
-        sen = read_file(sen_path)
-        if sen is None:
-            logger.info("error in loading file:{}".format(sen_path))
-            continue
-        sentences.extend(sen)
-    return sentences
-
-
-def read_file(path):
-    lines = []
-    with open(path, encoding="utf-8") as f:
-        while True:
-            line = f.readline()
-            if line:
-                if len(line) > 0 and not line.isspace():
-                    lines.append(line.strip())
-            else:
-                break
-    return lines
-
-
-def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
-    if predicted_data == [] or not isinstance(predicted_data, list):
-        raise TypeError("The predicted_data is inconsistent with expectations.")
-
-    sen_ids_batch = []
-    sen_words_batch = []
-    sen_ids = []
-    sen_words = []
-    batch_num = 0
-    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
-    for sen in predicted_data:
-        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
-        sen_ids.append(sen_id)
-        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
-        batch_num += 1
-        if batch_num == batch_size:
-            tmp_list = []
-            max_length = max([len(i) for i in sen_ids])
-            for i in sen_ids:
-                if len(i) < max_length:
-                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-                else:
-                    tmp_list.append(i)
-            sen_ids_batch.append(tmp_list)
-            sen_words_batch.append(sen_words)
-            sen_ids = []
-            sen_words = []
-            batch_num = 0
-
-    if len(sen_ids) > 0:
-        tmp_list = []
-        max_length = max([len(i) for i in sen_ids])
-        for i in sen_ids:
-            if len(i) < max_length:
-                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-            else:
-                tmp_list.append(i)
-        sen_ids_batch.append(tmp_list)
-        sen_words_batch.append(sen_words)
-
-    return sen_ids_batch, sen_words_batch
-
-
-def predict(args, sentences=[], paths=[]):
-    """
-    Args:
-        sentences (list[str]): each string is a sentence. If sentences not paths
-        paths (list[str]): The paths of file which contain sentences. If paths not sentences
-    Returns:
-        res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences.
-    """
-
-    # initialize data
-    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
-        predicted_data = sentences
-    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
-        predicted_data = read_sentences(paths)
-    else:
-        raise TypeError("The input data is inconsistent with expectations.")
-
-    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
-    predicted_input, predicted_sens = get_predicted_input(
-        predicted_data, tokenizer, args.max_seq_length, args.batch_size
-    )
-
-    # config
-    config = inference.Config(args.model_file, args.params_file)
-    config.switch_use_feed_fetch_ops(False)
-    config.enable_memory_optim()
-    if args.use_gpu:
-        config.enable_use_gpu(1000, 0)
-    if args.use_trt:
-        config.enable_tensorrt_engine(
-            workspace_size=1 << 30,
-            max_batch_size=args.batch_size,
-            min_subgraph_size=5,
-            precision_mode=inference.PrecisionType.Float32,
-            use_static=False,
-            use_calib_mode=False,
-        )
-
-    # predictor
-    predictor = inference.create_predictor(config)
-
-    start_time = time.time()
-    output_data = []
-    count = 0
-    for i, sen in enumerate(predicted_input):
-        sen = np.array(sen).astype("int64")
-        # get input name
-        input_names = predictor.get_input_names()
-        # get input pointer and copy data
-        input_tensor = predictor.get_input_handle(input_names[0])
-        input_tensor.copy_from_cpu(sen)
-
-        # run predictor
-        predictor.run()
-
-        # get output name
-        output_names = predictor.get_output_names()
-        # get output pointer and copy data(nd.array)
-        output_tensor = predictor.get_output_handle(output_names[0])
-        predict_data = output_tensor.copy_to_cpu()
-        output_res = np.argmax(predict_data, axis=1).tolist()
-        output_data.append(output_res)
-
-        print("===== batch {} =====".format(i))
-        for j in range(len(predicted_sens[i])):
-            print("Input sentence is : {}".format(predicted_sens[i][j]))
-            # print("Output logis is : {}".format(output_data[j]))
-            print("Output data is : {}".format(output_res[j]))
-        count += len(predicted_sens[i])
-    print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time))
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    sentences = args.predict_sentences
-    paths = args.predict_file
-    # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."]
-    # paths = ["../../debug/test.txt", "../../debug/test.txt.1"]
-    predict(args, sentences, paths)
diff --git a/model_zoo/electra/deploy/serving/README.md b/model_zoo/electra/deploy/serving/README.md
deleted file mode 100644
index 1b64547e9c24..000000000000
--- a/model_zoo/electra/deploy/serving/README.md
+++ /dev/null
@@ -1,107 +0,0 @@
-# **ELECTRA 使用Paddle Serving API进行推理**
-Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过RPC/HTTP方式发送数据进行推理，实现模型推理的服务化，下面以RPC方式为例进行说明。
-
-## 前提条件
-准备好Inference所需模型，需要2个文件：
-| 文件                          | 说明                                   |
-|-------------------------------|----------------------------------------|
-| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
-| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
-
-如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下
-
-## 在服务器端和客户端启动Serving的docker容器
-建议在docker容器中运行服务器端和客户端以避免一些系统依赖库问题，启动docker镜像的命令参考：[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0)
-
-## 在服务器端安装相关模块
-```shell
-pip install paddle-serving-app paddle-serving-client paddle-serving-server paddlepaddle
-```
-如果服务器端可以使用GPU进行推理，则安装server的gpu版本，安装时要注意参考服务器当前CUDA、TensorRT的版本来安装对应的版本：[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0)
-```shell
-pip install paddle-serving-app paddle-serving-client paddle-serving-server-gpu paddlepaddle-gpu
-```
-
-## 在客户端安装相关模块
-```shell
-pip install paddle-serving-app paddle-serving-client
-```
-
-## 从Inference模型生成Serving的模型和配置
-以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例：
-```shell
-python -u ./covert_inference_model_to_serving.py \
-    --inference_model_dir ./ \
-    --model_file ./electra-deploy.pdmodel \
-    --params_file ./electra-deploy.pdiparams
-```
-其中参数释义如下：
-- `inference_model_dir` 表示Inference推理模型所在目录，这里假设为当前目录。
-- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
-- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
-
-执行命令后，会在当前目录下生成2个目录：serving_server 和 serving_client。serving_server目录包含服务器端所需的模型和配置，需将其cp到服务器端容器中；serving_client目录包含客户端所需的配置，需将其cp到客户端容器中
-
-## 启动server
-在服务器端容器中，使用上一步得到的serving_server目录启动server
-```shell
-python -m paddle_serving_server_gpu.serve \
-    --model ./serving_server \
-    --port 8383
-```
-其中参数释义如下：
-- `model` 表示server加载的模型和配置所在目录。
-- `port` 表示server开启的服务端口8383。
-
-如果服务器端可以使用GPU进行推理计算，则启动服务器时可以配置server使用的GPU id
-```shell
-python -m paddle_serving_server_gpu.serve \
-    --model ./serving_server \
-    --port 8383 \
-    --gpu_id 0
-```
-- `gpu_id` 表示server使用0号GPU。
-
-## 启动client进行推理
-在客户端容器中，使用前面得到的serving_client目录启动client发起RPC推理请求。和使用Paddle Inference API进行推理一样，有如下两种方法:
-### 从命令行读取输入数据发起推理请求
-```shell
-python -u ./client.py \
-    --client_config_file ./serving_client/serving_client_conf.prototxt \
-    --server_ip_port 127.0.0.1:8383 \
-    --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \
-    --batch_size 2 \
-    --max_seq_length 128 \
-    --model_name electra-small
-```
-其中参数释义如下：
-- `client_config_file` 表示客户端需要加载的配置文件。
-- `server_ip_port` 表示服务器端的ip和port。默认为127.0.0.1:8383。
-- `predict_sentences` 表示用于推理的（句子）数据，可以配置1条或多条。如果此项配置，则predict_file不用配置。
-- `batch_size` 表示每次推理的样本数目。
-- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
-- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
-
-### 从文件读取输入数据发起推理请求
-```shell
-python -u ./client.py \
-    --client_config_file ./serving_client/serving_client_conf.prototxt \
-    --server_ip_port 127.0.0.1:8383 \
-    --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \
-    --batch_size 2 \
-    --max_seq_length 128 \
-    --model_name electra-small
-```
-其中绝大部分和从命令行读取输入数据一样，这里描述不一样的参数：
-- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。如果此项配置，则predict_sentences不用配置。
-
-使用Paddle Serving API进行推理的结果和使用Inference API的结果是一样的：
-```shell
-===== batch 0 =====
-Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP]
-Output data is : 0
-Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP]
-Output data is : 1
-inference total 2 sentences done, total time : 4.729415416717529 s
-```
-此推理结果表示：第1句话是负向情感，第2句话是正向情感。
diff --git a/model_zoo/electra/deploy/serving/client.py b/model_zoo/electra/deploy/serving/client.py
deleted file mode 100755
index 331748feee3e..000000000000
--- a/model_zoo/electra/deploy/serving/client.py
+++ /dev/null
@@ -1,166 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import time
-
-import numpy as np
-from paddle_serving_client import Client
-
-from paddlenlp.transformers import ElectraTokenizer
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--client_config_file", type=str, required=True, help="client prototxt config file")
-    parser.add_argument("--server_ip_port", type=str, default="127.0.0.1:8383", help="server_ip:port")
-    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
-    parser.add_argument(
-        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
-    )
-    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
-    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
-    parser.add_argument(
-        "--model_name",
-        type=str,
-        default="electra-small",
-        help="shortcut name selected in the list: "
-        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
-    )
-    return parser.parse_args()
-
-
-def read_sentences(paths=[]):
-    sentences = []
-    for sen_path in paths:
-        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
-        sen = read_file(sen_path)
-        if sen is None:
-            logger.info("error in loading file:{}".format(sen_path))
-            continue
-        sentences.extend(sen)
-    return sentences
-
-
-def read_file(path):
-    lines = []
-    with open(path, encoding="utf-8") as f:
-        while True:
-            line = f.readline()
-            if line:
-                if len(line) > 0 and not line.isspace():
-                    lines.append(line.strip())
-            else:
-                break
-    return lines
-
-
-def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
-    if predicted_data == [] or not isinstance(predicted_data, list):
-        raise TypeError("The predicted_data is inconsistent with expectations.")
-
-    sen_ids_batch = []
-    sen_words_batch = []
-    sen_ids = []
-    sen_words = []
-    batch_num = 0
-    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
-    for sen in predicted_data:
-        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
-        sen_ids.append(sen_id)
-        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
-        batch_num += 1
-        if batch_num == batch_size:
-            tmp_list = []
-            max_length = max([len(i) for i in sen_ids])
-            for i in sen_ids:
-                if len(i) < max_length:
-                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-                else:
-                    tmp_list.append(i)
-            sen_ids_batch.append(tmp_list)
-            sen_words_batch.append(sen_words)
-            sen_ids = []
-            sen_words = []
-            batch_num = 0
-
-    if len(sen_ids) > 0:
-        tmp_list = []
-        max_length = max([len(i) for i in sen_ids])
-        for i in sen_ids:
-            if len(i) < max_length:
-                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
-            else:
-                tmp_list.append(i)
-        sen_ids_batch.append(tmp_list)
-        sen_words_batch.append(sen_words)
-
-    return sen_ids_batch, sen_words_batch
-
-
-def predict(args, sentences=[], paths=[]):
-    """
-    Args:
-        sentences (list[str]): each string is a sentence. If have sentences then no need paths
-        paths (list[str]): The paths of file which contain sentences. If have paths then no need sentences
-    Returns:
-        res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences.
-    """
-
-    # initialize client
-    client = Client()
-    client.load_client_config(args.client_config_file)
-    # "serving_client/serving_client_conf.prototxt")
-    client.connect([args.server_ip_port])
-
-    # initialize data
-    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
-        predicted_data = sentences
-    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
-        predicted_data = read_sentences(paths)
-    else:
-        raise TypeError("The input data is inconsistent with expectations.")
-
-    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
-    predicted_input, predicted_sens = get_predicted_input(
-        predicted_data, tokenizer, args.max_seq_length, args.batch_size
-    )
-
-    start_time = time.time()
-    count = 0
-    for i, sen in enumerate(predicted_input):
-        sen = np.array(sen).astype("int64")
-
-        fetch_map = client.predict(feed={"input_ids": sen}, fetch=["save_infer_model/scale_0.tmp_0"], batch=True)
-        output_data = np.array(fetch_map["save_infer_model/scale_0.tmp_0"])
-        output_res = np.argmax(output_data, axis=1)
-
-        print("===== batch {} =====".format(i))
-        for j in range(len(predicted_sens[i])):
-            print("Input sentence is : {}".format(predicted_sens[i][j]))
-            # print("Output logis is : {}".format(output_data[j]))
-            print("Output data is : {}".format(output_res[j]))
-
-        count += len(predicted_sens[i])
-    print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time))
-
-
-if __name__ == "__main__":
-    # paddle.enable_static()
-    args = parse_args()
-    sentences = args.predict_sentences
-    paths = args.predict_file
-    predict(args, sentences, paths)
diff --git a/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py b/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py
deleted file mode 100644
index 795a5aeb2281..000000000000
--- a/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py
+++ /dev/null
@@ -1,39 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import paddle
-import paddle_serving_client.io as serving_io
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--inference_model_dir", type=str, required=True, help="input inference model dir")
-    parser.add_argument("--model_file", type=str, required=True, help="input inference model file name")
-    parser.add_argument("--params_file", type=str, required=True, help="input inference parameters file name")
-    return parser.parse_args()
-
-
-if __name__ == "__main__":
-    paddle.enable_static()
-    args = parse_args()
-    feed_names, fetch_names = serving_io.inference_model_to_serving(
-        dirname=args.inference_model_dir,
-        serving_server="serving_server",
-        serving_client="serving_client",
-        model_filename=args.model_file,
-        params_filename=args.params_file,
-    )
-    print("model feed_names : %s" % feed_names)
-    print("model fetch_names : %s" % fetch_names)
diff --git a/model_zoo/electra/electra_model_brief_introduce.JPG b/model_zoo/electra/electra_model_brief_introduce.JPG
deleted file mode 100644
index 226294ad6c52..000000000000
Binary files a/model_zoo/electra/electra_model_brief_introduce.JPG and /dev/null differ
diff --git a/model_zoo/electra/export_model.py b/model_zoo/electra/export_model.py
deleted file mode 100644
index babb7741ac0d..000000000000
--- a/model_zoo/electra/export_model.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# from collections import namedtuple
-from __future__ import absolute_import, division, print_function
-
-import argparse
-import hashlib
-import os
-
-import paddle
-from paddle.static import InputSpec
-
-from paddlenlp.transformers import ElectraForSequenceClassification
-
-
-def get_md5sum(file_path):
-    md5sum = None
-    if os.path.isfile(file_path):
-        with open(file_path, "rb") as f:
-            md5_obj = hashlib.md5()
-            md5_obj.update(f.read())
-            hash_code = md5_obj.hexdigest()
-        md5sum = str(hash_code).lower()
-    return md5sum
-
-
-def main():
-    # check and load model
-    input_model_file = os.path.join(args.input_model_dir, "model_state.pdparams")
-    print("load model to get static model : %s \nmodel md5sum : %s" % (input_model_file, get_md5sum(input_model_file)))
-    model_state_dict = paddle.load(input_model_file)
-
-    if all((s.startswith("generator") or s.startswith("discriminator")) for s in model_state_dict.keys()):
-        print("the model : %s is electra pretrain model, we need fine-tuning model to deploy" % input_model_file)
-        exit(1)
-    elif "discriminator_predictions.dense.weight" in model_state_dict:
-        print("the model : %s is electra discriminator model, we need fine-tuning model to deploy" % input_model_file)
-        exit(1)
-    elif "classifier.dense.weight" in model_state_dict:
-        print("we are load glue fine-tuning model")
-        model = ElectraForSequenceClassification.from_pretrained(args.input_model_dir)
-        print("total model layers : ", len(model_state_dict))
-    else:
-        print("the model file : %s may not be fine-tuning model, please check" % input_model_file)
-        exit(1)
-
-    # save static model to disk
-    paddle.jit.save(
-        layer=model,
-        path=os.path.join(args.output_model_dir, args.model_name),
-        input_spec=[InputSpec(shape=[None, None], dtype="int64")],
-    )
-    print("save electra inference model success")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--input_model_dir", required=True, default=None, help="Directory for storing Electra pretraining model"
-    )
-    parser.add_argument(
-        "--output_model_dir", required=True, default=None, help="Directory for output Electra inference model"
-    )
-    parser.add_argument(
-        "--model_name", default="electra-deploy", type=str, help="prefix name of output model and parameters"
-    )
-    args, unparsed = parser.parse_known_args()
-    main()
diff --git a/model_zoo/electra/get_ft_model.py b/model_zoo/electra/get_ft_model.py
deleted file mode 100644
index fde0b7cb75ec..000000000000
--- a/model_zoo/electra/get_ft_model.py
+++ /dev/null
@@ -1,82 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License"
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# from collections import namedtuple
-import argparse
-import hashlib
-import os
-
-import paddle
-
-
-def get_md5sum(file_path):
-    md5sum = None
-    if os.path.isfile(file_path):
-        with open(file_path, "rb") as f:
-            md5_obj = hashlib.md5()
-            md5_obj.update(f.read())
-            hash_code = md5_obj.hexdigest()
-        md5sum = str(hash_code).lower()
-    return md5sum
-
-
-def main(args):
-    pretraining_model = os.path.join(args.model_dir, "model_state.pdparams")
-    if os.path.islink(pretraining_model):
-        print("%s already contain fine-tuning model, pleace check" % args.model_dir)
-        exit(0)
-    print(
-        "load Electra pretrain model to get generator/discriminator model : %s \nmodel md5sum : %s"
-        % (pretraining_model, get_md5sum(pretraining_model))
-    )
-    # depart total_pretraining_model to generator and discriminator state_dict
-    total_pretraining_model = paddle.load(pretraining_model)
-    generator_state_dict = {}
-    discriminator_state_dict = {}
-    num_keys = 0
-    for key in total_pretraining_model.keys():
-        new_key = None
-        if "generator." in key:
-            new_key = key.replace("generator.", "", 1)
-            generator_state_dict[new_key] = total_pretraining_model[key]
-        if "discriminator." in key:
-            new_key = key.replace("discriminator.", "", 1)
-            discriminator_state_dict[new_key] = total_pretraining_model[key]
-        num_keys += 1
-    print("total electra keys : ", num_keys)
-    print("total generator keys : ", len(generator_state_dict))
-    print("total discriminator keys : ", len(discriminator_state_dict))
-
-    # save generator and discriminator model to disk
-    paddle.save(generator_state_dict, os.path.join(args.model_dir, args.generator_output_file))
-    paddle.save(discriminator_state_dict, os.path.join(args.model_dir, args.discriminator_output_file))
-    print("save generator and discriminator model success")
-    os.rename(pretraining_model, os.path.join(args.model_dir, "pretrain_model_state.pdparams"))
-    os.symlink(args.discriminator_output_file, os.path.join(args.model_dir, "model_state.pdparams"))
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_dir", required=True, default=None, help="Directory of storing ElectraForTotalPreTraining model"
-    )
-    parser.add_argument(
-        "--generator_output_file", default="generator_for_ft.pdparams", help="Electra generator model for fine-tuning"
-    )
-    parser.add_argument(
-        "--discriminator_output_file",
-        default="discriminator_for_ft.pdparams",
-        help="Electra discriminator model for fine-tuning",
-    )
-    args, unparsed = parser.parse_known_args()
-    main(args)
diff --git a/model_zoo/electra/run_glue.py b/model_zoo/electra/run_glue.py
deleted file mode 100644
index c5a8051a4a02..000000000000
--- a/model_zoo/electra/run_glue.py
+++ /dev/null
@@ -1,369 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import os
-import random
-import time
-from functools import partial
-
-import numpy as np
-import paddle
-from paddle.io import DataLoader
-from paddle.metric import Accuracy
-
-from paddlenlp.data import Pad, Stack, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
-from paddlenlp.transformers import (
-    BertForSequenceClassification,
-    BertTokenizer,
-    ElectraForSequenceClassification,
-    ElectraTokenizer,
-    ErnieForSequenceClassification,
-    ErnieTokenizer,
-    LinearDecayWithWarmup,
-)
-
-FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
-logging.basicConfig(level=logging.INFO, format=FORMAT)
-logger = logging.getLogger(__name__)
-
-METRIC_CLASSES = {
-    "cola": Mcc,
-    "sst-2": Accuracy,
-    "mrpc": AccuracyAndF1,
-    "sts-b": PearsonAndSpearman,
-    "qqp": AccuracyAndF1,
-    "mnli": Accuracy,
-    "qnli": Accuracy,
-    "rte": Accuracy,
-}
-
-MODEL_CLASSES = {
-    "bert": (BertForSequenceClassification, BertTokenizer),
-    "electra": (ElectraForSequenceClassification, ElectraTokenizer),
-    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
-}
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-
-    # Required parameters
-    parser.add_argument(
-        "--task_name",
-        default=None,
-        type=str,
-        required=True,
-        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
-    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
-    parser.add_argument(
-        "--model_name_or_path",
-        default=None,
-        type=str,
-        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: "
-        + ", ".join(
-            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
-        ),
-    )
-    parser.add_argument(
-        "--output_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The output directory where the model predictions and checkpoints will be written.",
-    )
-    parser.add_argument(
-        "--max_seq_length",
-        default=128,
-        type=int,
-        help="The maximum total input sequence length after tokenization. Sequences longer "
-        "than this will be truncated, sequences shorter will be padded.",
-    )
-    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
-    parser.add_argument(
-        "--num_train_epochs",
-        default=3,
-        type=int,
-        help="Total number of training epochs to perform.",
-    )
-    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
-    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
-    parser.add_argument(
-        "--batch_size",
-        default=32,
-        type=int,
-        help="Batch size per GPU/CPU for training.",
-    )
-    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
-    parser.add_argument(
-        "--warmup_steps",
-        default=0,
-        type=int,
-        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
-    )
-    parser.add_argument(
-        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
-    )
-    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
-    parser.add_argument(
-        "--max_steps",
-        default=-1,
-        type=int,
-        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
-    )
-    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
-    parser.add_argument(
-        "--device",
-        default="gpu",
-        type=str,
-        choices=["cpu", "gpu", "npu"],
-        help="The device to select to train the model, is must be cpu/gpu/npu.",
-    )
-    args = parser.parse_args()
-    return args
-
-
-def set_seed(args):
-    # Use the same data seed(for data shuffle) for all procs to guarantee data
-    # consistency after sharding.
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    # Maybe different op seeds(for dropout) for different procs is better. By:
-    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
-    paddle.seed(args.seed)
-
-
-@paddle.no_grad()
-def evaluate(model, loss_fct, metric, data_loader):
-    model.eval()
-    metric.reset()
-    for batch in data_loader:
-        input_ids, segment_ids, labels = batch
-        logits = model(input_ids, segment_ids)
-        loss = loss_fct(logits, labels)
-        correct = metric.compute(logits, labels)
-        metric.update(correct)
-    res = metric.accumulate()
-    if isinstance(metric, AccuracyAndF1):
-        print(
-            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
-            % (
-                loss.numpy(),
-                res[0],
-                res[1],
-                res[2],
-                res[3],
-                res[4],
-            ),
-            end="",
-        )
-    elif isinstance(metric, Mcc):
-        print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
-    elif isinstance(metric, PearsonAndSpearman):
-        print(
-            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
-            % (loss.numpy(), res[0], res[1], res[2]),
-            end="",
-        )
-    else:
-        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
-    model.train()
-
-
-def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
-    """convert a glue example into necessary features"""
-    if not is_test:
-        # `label_list == None` is for regression task
-        label_dtype = "int64" if label_list else "float32"
-        # Get the label
-        label = example["labels"]
-        label = np.array([label], dtype=label_dtype)
-    # Convert raw text to feature
-    if (int(is_test) + len(example)) == 2:
-        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
-    else:
-        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
-
-    if not is_test:
-        return example["input_ids"], example["token_type_ids"], label
-    else:
-        return example["input_ids"], example["token_type_ids"]
-
-
-def do_train(args):
-    paddle.set_device(args.device)
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args)
-
-    args.task_name = args.task_name.lower()
-    metric_class = METRIC_CLASSES[args.task_name]
-    args.model_type = args.model_type.lower()
-    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-
-    train_ds = load_dataset("glue", args.task_name, splits="train")
-    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-
-    trans_func = partial(
-        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
-    )
-    train_ds = train_ds.map(trans_func, lazy=True)
-    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
-    batchify_fn = lambda samples, fn=Tuple(
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
-        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
-        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
-    ): fn(samples)
-    train_data_loader = DataLoader(
-        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-    )
-    if args.task_name == "mnli":
-        dev_ds_matched, dev_ds_mismatched = load_dataset(
-            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
-        )
-
-        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
-        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
-        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
-        dev_data_loader_matched = DataLoader(
-            dataset=dev_ds_matched,
-            batch_sampler=dev_batch_sampler_matched,
-            collate_fn=batchify_fn,
-            num_workers=0,
-            return_list=True,
-        )
-        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
-            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
-        )
-        dev_data_loader_mismatched = DataLoader(
-            dataset=dev_ds_mismatched,
-            batch_sampler=dev_batch_sampler_mismatched,
-            collate_fn=batchify_fn,
-            num_workers=0,
-            return_list=True,
-        )
-    else:
-        dev_ds = load_dataset("glue", args.task_name, splits="dev")
-        dev_ds = dev_ds.map(trans_func, lazy=True)
-        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
-        dev_data_loader = DataLoader(
-            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-        )
-
-    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
-    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
-    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        beta1=0.9,
-        beta2=0.999,
-        epsilon=args.adam_epsilon,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
-
-    metric = metric_class()
-
-    global_step = 0
-    tic_train = time.time()
-    for epoch in range(args.num_train_epochs):
-        for step, batch in enumerate(train_data_loader):
-            global_step += 1
-
-            input_ids, segment_ids, labels = batch
-            logits = model(input_ids, segment_ids)
-            loss = loss_fct(logits, labels)
-            loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            if global_step % args.logging_steps == 0 or global_step == num_training_steps:
-                print(
-                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
-                    % (
-                        global_step,
-                        num_training_steps,
-                        epoch,
-                        step,
-                        paddle.distributed.get_rank(),
-                        loss,
-                        optimizer.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-            if global_step % args.save_steps == 0 or global_step == num_training_steps:
-                tic_eval = time.time()
-                if args.task_name == "mnli":
-                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
-                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
-                    print("eval done total : %s s" % (time.time() - tic_eval))
-                else:
-                    evaluate(model, loss_fct, metric, dev_data_loader)
-                    print("eval done total : %s s" % (time.time() - tic_eval))
-                if paddle.distributed.get_rank() == 0:
-                    output_dir = os.path.join(
-                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
-                    )
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    # Need better way to get inner model of DataParallel
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-            if global_step >= num_training_steps:
-                return
-
-
-def print_arguments(args):
-    """print arguments"""
-    print("-----------  Configuration Arguments -----------")
-    for arg, value in sorted(vars(args).items()):
-        print("%s: %s" % (arg, value))
-    print("------------------------------------------------")
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_arguments(args)
-    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
-    if args.device in "gpu" and n_gpu > 1:
-        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
-    else:
-        do_train(args)
diff --git a/model_zoo/electra/run_pretrain.py b/model_zoo/electra/run_pretrain.py
deleted file mode 100644
index a1fc48c096c5..000000000000
--- a/model_zoo/electra/run_pretrain.py
+++ /dev/null
@@ -1,574 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import copy
-import io
-import json
-import logging
-import os
-import random
-import time
-
-import numpy as np
-import paddle
-
-from paddlenlp.trainer.argparser import strtobool
-from paddlenlp.transformers import (
-    ElectraForTotalPretraining,
-    ElectraPretrainingCriterion,
-    ElectraTokenizer,
-    LinearDecayWithWarmup,
-)
-
-FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
-logging.basicConfig(level=logging.INFO, format=FORMAT)
-logger = logging.getLogger(__name__)
-
-MODEL_CLASSES = {
-    "electra": (ElectraForTotalPretraining, ElectraTokenizer),
-}
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_type",
-        default="electra",
-        type=str,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
-    parser.add_argument(
-        "--model_name_or_path",
-        default="electra-small",
-        type=str,
-        help="Path to pre-trained model or shortcut name selected in the list: "
-        + ", ".join(
-            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
-        ),
-    )
-    parser.add_argument(
-        "--input_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The input directory where the data will be read from.",
-    )
-    parser.add_argument(
-        "--output_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The output directory where the model predictions and checkpoints will be written.",
-    )
-    parser.add_argument("--max_seq_length", default=128, type=int, help="max length of each sequence")
-    parser.add_argument("--mask_prob", default=0.15, type=float, help="the probability of one word to be mask")
-    parser.add_argument(
-        "--train_batch_size",
-        default=96,
-        type=int,
-        help="Batch size per GPU/CPU for training.",
-    )
-    parser.add_argument(
-        "--eval_batch_size",
-        default=96,
-        type=int,
-        help="Batch size per GPU/CPU for training.",
-    )
-    parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.")
-    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
-    parser.add_argument(
-        "--num_train_epochs",
-        default=4,
-        type=int,
-        help="Total number of training epochs to perform.",
-    )
-    parser.add_argument(
-        "--max_steps",
-        default=-1,
-        type=int,
-        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
-    )
-    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
-
-    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
-    parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-    parser.add_argument(
-        "--init_from_ckpt",
-        action="store_true",
-        help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start",
-    )
-    parser.add_argument(
-        "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train."
-    )
-    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
-    parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.")
-    parser.add_argument(
-        "--device",
-        default="gpu",
-        type=str,
-        choices=["cpu", "gpu"],
-        help="The device to select to train the model, is must be cpu/gpu.",
-    )
-    parser.add_argument("--to_static", type=strtobool, default=False, help="Enable training under @to_static.")
-
-    args = parser.parse_args()
-    return args
-
-
-def set_seed(args):
-    # Use the same data seed(for data shuffle) for all procs to guarantee data
-    # consistency after sharding.
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    # Maybe different op seeds(for dropout) for different procs is better. By:
-    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
-    paddle.seed(args.seed)
-
-
-class WorkerInitObj(object):
-    def __init__(self, seed):
-        self.seed = seed
-
-    def __call__(self, id):
-        np.random.seed(seed=self.seed + id)
-        random.seed(self.seed + id)
-
-
-class BookCorpus(paddle.io.Dataset):
-    """
-    https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html
-    Args:
-        data_path (:obj:`str`) : The dataset file path, which contains train.tsv, dev.tsv and test.tsv.
-        tokenizer (:obj:`class PretrainedTokenizer`) : The tokenizer to split word and convert word to id.
-        max_seq_length (:obj:`int`) : max length for each sequence.
-        mode (:obj:`str`, `optional`, defaults to `train`):
-            It identifies the dataset mode (train, test or dev).
-    """
-
-    def __init__(
-        self,
-        data_path,
-        tokenizer,
-        max_seq_length,
-        mode="train",
-    ):
-        if mode == "train":
-            data_file = "train.data"
-        elif mode == "test":
-            data_file = "test.data"
-        else:
-            data_file = "dev.data"
-
-        self.data_file = os.path.join(data_path, data_file)
-        self.tokenizer = tokenizer
-        self.max_seq_length = max_seq_length
-        self.raw_examples = self._read_file(self.data_file)
-
-    def _read_file(self, input_file):
-        """
-        Reads a text file.
-
-        Args:
-            input_file (:obj:`str`) : The file to be read.
-
-        Returns:
-            examples (:obj:`list`): All the input data.
-        """
-        if not os.path.exists(input_file):
-            raise RuntimeError("The file {} is not found.".format(input_file))
-        else:
-            with io.open(input_file, "r", encoding="UTF-8") as f:
-                examples = []
-                while True:
-                    line = f.readline()
-                    if line:
-                        if len(line) > 0 and not line.isspace():
-                            example = self.tokenizer(line, max_seq_len=self.max_seq_length)["input_ids"]
-                            examples.append(example)
-                    else:
-                        break
-                return examples
-
-    def truncation_ids(self, ids, max_seq_length):
-        if len(ids) <= (max_seq_length - 2):
-            return ids
-        else:
-            return ids[: (max_seq_length - 2)]
-
-    def __getitem__(self, idx):
-        return self.raw_examples[idx]
-
-    def __len__(self):
-        return len(self.raw_examples)
-
-
-class DataCollatorForElectra(object):
-    """
-    pads, gets batch of tensors and preprocesses batches for masked language modeling
-    when dataloader num_worker > 0, this collator may trigger some bugs, for safe, be sure dataloader num_worker=0
-    """
-
-    def __init__(self, tokenizer, max_seq_length, mlm=True, mlm_probability=0.15):
-        self.tokenizer = tokenizer
-        self.max_seq_length = max_seq_length
-        self.mlm = True
-        self.mlm_probability = mlm_probability
-
-    def __call__(self, examples):
-        if self.mlm:
-            inputs, raw_inputs, labels = self.mask_tokens(examples)
-            return inputs, raw_inputs, labels
-        else:
-            raw_inputs, _ = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
-            raw_inputs = self.tensorize_batch(raw_inputs, "int64")
-            inputs = raw_inputs.clone().detach()
-            labels = raw_inputs.clone().detach()
-            if self.tokenizer.pad_token is not None:
-                pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
-                labels[labels == pad_token_id] = -100
-            return batch, raw_inputs, labels  # noqa:821
-
-    def tensorize_batch(self, examples, dtype):
-        if isinstance(examples[0], (list, tuple)):
-            examples = [paddle.to_tensor(e, dtype=dtype) for e in examples]
-        length_of_first = examples[0].shape[0]
-        are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples)
-        if are_tensors_same_length:
-            return paddle.stack(examples, axis=0)
-        else:
-            raise ValueError("the tensor in examples not have same shape, please check input examples")
-
-    def add_special_tokens_and_set_maskprob(self, inputs, truncation, max_seq_length):
-        # sep_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token)
-        pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
-        # cls_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token)
-        full_inputs = []
-        full_maskprob = []
-        max_length = 0
-        for ids in inputs:
-            if len(ids) > max_length:
-                max_length = len(ids)
-        max_length = min(max_length, max_seq_length)
-
-        for ids in inputs:
-            if len(ids) <= max_length:
-                padding_num = max_length - len(ids)
-                full_inputs.append(ids + ([pad_token_id] * padding_num))
-                full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0] + ([0] * padding_num))
-            else:
-                if truncation:
-                    full_inputs.append(ids[:max_length])
-                    full_maskprob.append([0] + ([self.mlm_probability] * (max_length - 2)) + [0])
-                else:
-                    full_inputs.append(ids)
-                    full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0])
-        return full_inputs, full_maskprob
-
-    def mask_tokens(self, examples):
-        if self.tokenizer.mask_token is None:
-            raise ValueError("the tokenizer does not have mask_token, please check!")
-        mask_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
-
-        raw_inputs, probability_matrix = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
-        raw_inputs = self.tensorize_batch(raw_inputs, "int64")
-        probability_matrix = self.tensorize_batch(probability_matrix, "float32")
-        inputs = raw_inputs.clone()
-        labels = raw_inputs.clone()
-
-        total_indices = paddle.bernoulli(probability_matrix).astype("bool").numpy()
-        labels[~total_indices] = -100
-
-        # 80% MASK
-        indices_mask = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & total_indices
-        inputs[indices_mask] = mask_token_id
-
-        # 10% Random
-        indices_random = (
-            paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() & total_indices & ~indices_mask
-        )
-        random_words = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64")
-        inputs = paddle.where(paddle.to_tensor(indices_random), random_words, inputs)
-
-        # 10% Original
-        return inputs, raw_inputs, labels
-
-
-def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None):
-    """
-    Creats dataloader.
-
-    Args:
-        dataset(obj:`paddle.io.Dataset`):
-            Dataset instance.
-        mode(obj:`str`, optional, defaults to obj:`train`):
-            If mode is 'train', it will shuffle the dataset randomly.
-        batch_size(obj:`int`, optional, defaults to 1):
-            The sample number of a mini-batch.
-        use_gpu(obj:`bool`, optional, defaults to obj:`True`):
-            Whether to use gpu to run.
-
-    Returns:
-        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
-    """
-
-    if mode == "train" and use_gpu:
-        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
-        dataloader = paddle.io.DataLoader(
-            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
-        )
-    else:
-        shuffle = True if mode == "train" else False
-        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
-        dataloader = paddle.io.DataLoader(
-            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
-        )
-
-    return dataloader
-
-
-def create_input_specs():
-    input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
-    token_type_ids = None
-    position_ids = None
-    attention_mask = None
-    raw_input_ids = paddle.static.InputSpec(name="raw_input_ids", shape=[-1, -1], dtype="int64")
-    generator_labels = paddle.static.InputSpec(name="generator_labels", shape=[-1, -1], dtype="int64")
-    return [input_ids, token_type_ids, position_ids, attention_mask, raw_input_ids, generator_labels]
-
-
-def do_train(args):
-    paddle.enable_static() if not args.eager_run else None
-    paddle.set_device(args.device)
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args)
-    # worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
-
-    args.model_type = args.model_type.lower()
-    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-
-    # Loads or initializes a model.
-    pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys())
-    config = model_class.config_class.from_pretrained(args.model_name_or_path)
-
-    if args.model_name_or_path in pretrained_models:
-        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-        model = model_class(config)
-        args.init_from_ckpt = False
-    else:
-        if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
-            # Load checkpoint
-            tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-            with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f:
-                config_dict = json.load(f)
-                model_name = config_dict["model_name"]
-            if model_name in pretrained_models:
-                model = model_class.from_pretrained(args.model_name_or_path)
-                model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")))
-            else:
-                raise ValueError(
-                    "initialize a model from ckpt need model_name "
-                    "in model_config_file. The supported model_name "
-                    "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys())
-                )
-        else:
-            raise ValueError(
-                "initialize a model need identifier or the "
-                "directory of storing model. if use identifier, the supported model "
-                "identifiers are as follows: {}, if use directory, "
-                "make sure set init_from_ckpt as True".format(model_class.pretrained_init_configuration.keys())
-            )
-
-    criterion = ElectraPretrainingCriterion(config)
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-    # decorate @to_static for benchmark, skip it by default.
-    if args.to_static:
-        specs = create_input_specs()
-        model = paddle.jit.to_static(model, input_spec=specs)
-        logger.info("Successfully to apply @to_static to the whole model with specs: {}.".format(specs))
-
-    # Loads dataset.
-    tic_load_data = time.time()
-    print("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
-    train_dataset = BookCorpus(
-        data_path=args.input_dir, tokenizer=tokenizer, max_seq_length=args.max_seq_length, mode="train"
-    )
-    print("load data done, total : %s s" % (time.time() - tic_load_data))
-
-    # Reads data and generates mini-batches.
-    data_collator = DataCollatorForElectra(
-        tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm=True, mlm_probability=args.mask_prob
-    )
-
-    train_data_loader = create_dataloader(
-        train_dataset,
-        batch_size=args.train_batch_size,
-        mode="train",
-        use_gpu=True if args.device in "gpu" else False,
-        data_collator=data_collator,
-    )
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
-
-    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        epsilon=args.adam_epsilon,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        grad_clip=clip,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-    if args.use_amp:
-        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
-
-    print("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
-    trained_global_step = global_step = 0
-    t_loss = paddle.to_tensor([0.0])
-    log_loss = paddle.to_tensor([0.0])
-    loss_list = []
-    log_list = []
-    tic_train = time.time()
-    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
-        optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt")))
-        trained_global_step = global_step = config_dict["global_step"]
-        if trained_global_step < num_training_steps:
-            print(
-                "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s"
-                % (trained_global_step, trained_global_step + 1)
-            )
-        else:
-            print(
-                "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !"
-                % (trained_global_step, num_training_steps)
-            )
-            exit(0)
-
-    for epoch in range(args.num_train_epochs):
-        for step, batch in enumerate(train_data_loader):
-            if trained_global_step > 0:
-                trained_global_step -= 1
-                continue
-            global_step += 1
-            input_ids, raw_input_ids, gen_labels = batch
-            if args.use_amp:
-                with paddle.amp.auto_cast():
-                    gen_logits, disc_logits, disc_labels, attention_mask = model(
-                        input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels
-                    )
-                    loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask)
-                scaled = scaler.scale(loss)
-                scaled.backward()
-                t_loss += loss.detach()
-                scaler.minimize(optimizer, scaled)
-            else:
-                gen_logits, disc_logits, disc_labels, attention_mask = model(
-                    input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels
-                )
-                loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask)
-                loss.backward()
-                t_loss += loss.detach()
-                optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            if global_step % args.logging_steps == 0:
-                local_loss = (t_loss - log_loss) / args.logging_steps
-                if paddle.distributed.get_world_size() > 1:
-                    paddle.distributed.all_gather(loss_list, local_loss)
-                    if paddle.distributed.get_rank() == 0:
-                        log_str = (
-                            "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
-                            "avg_loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
-                        ).format(
-                            global_step,
-                            num_training_steps,
-                            epoch,
-                            step,
-                            float((paddle.stack(loss_list).sum() / len(loss_list)).numpy()),
-                            optimizer.get_lr(),
-                            (time.time() - tic_train) / args.logging_steps,
-                        )
-                        print(log_str)
-                        log_list.append(log_str)
-                    loss_list = []
-                else:
-                    log_str = (
-                        "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
-                        "loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
-                    ).format(
-                        global_step,
-                        num_training_steps,
-                        epoch,
-                        step,
-                        float(local_loss.numpy()),
-                        optimizer.get_lr(),
-                        (time.time() - tic_train) / args.logging_steps,
-                    )
-                    print(log_str)
-                    log_list.append(log_str)
-                log_loss = t_loss
-                tic_train = time.time()
-            if global_step % args.save_steps == 0:
-                if paddle.distributed.get_rank() == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step)
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    config_to_save = copy.deepcopy(model_to_save.discriminator.electra.config)
-                    config_to_save.to_json_file(os.path.join(output_dir, "model_config.json"))
-                    run_states = {
-                        "model_name": model_name if args.init_from_ckpt else args.model_name_or_path,
-                        "global_step": global_step,
-                        "epoch": epoch,
-                        "step": step,
-                    }
-                    with open(os.path.join(output_dir, "run_states.json"), "w") as f:
-                        json.dump(run_states, f)
-                    paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams"))
-                    tokenizer.save_pretrained(output_dir)
-                    paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
-                    if len(log_list) > 0:
-                        with open(os.path.join(output_dir, "train.log"), "w") as f:
-                            for log in log_list:
-                                if len(log.strip()) > 0:
-                                    f.write(log.strip() + "\n")
-            if global_step >= num_training_steps:
-                return
-
-
-def print_arguments(args):
-    """print arguments"""
-    print("-----------  Configuration Arguments -----------")
-    for arg, value in sorted(vars(args).items()):
-        print("%s: %s" % (arg, value))
-    print("------------------------------------------------")
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_arguments(args)
-    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
-    if args.device in "gpu" and n_gpu > 1:
-        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
-    else:
-        do_train(args)
diff --git a/model_zoo/ernie-code/README.en.md b/model_zoo/ernie-code/README.en.md
deleted file mode 100644
index 77904c09ce4d..000000000000
--- a/model_zoo/ernie-code/README.en.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# ERNIE-Code
-
-[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [中文版](./README.md)
-
-![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4)
-
-[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf)
-
-
-ERNIE-Code is a unified large language model (LLM) that connects 116 natural languages with 6 programming languages. We employ two pre-training methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation.
-
-## Quick Start
-
-This project is the PaddlePaddle implementation of the ERINE-Code, including model prediction and weight conversion. The brief directory structure and description of this example are as follows:
-
-```text
-├── README.md               # Documentation
-├── predict.py              # Forward prediction demo
-├── converter.py            # Weight conversion script
-```
-
-### Multilingual Text-to-Code / Code-to-Text
-
-This project provides a simple demo for multlingual code/text generation. The startup command is as follows:
-
-```shell
-python predict.py \
-  --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
-  --target_lang 'code' \
-  --source_prefix 'translate Japanese to Python: \n' \
-  --max_length 1024 \
-  --num_beams 3 \
-  --device 'gpu'
-```
-
-Explanation of parameters in the configuration file:
-- `input`:The input sequence.
-- `target_lang`: The target language, which can be set to 'text' or 'code'.
-- `source_prefix`: The prompt.
-- `max_length`: The maximum length of input/output text.
-- `num_beams`: The number of beams to keep at each decoding step (for beam search).
-- `device`: The running device, which can be set to 'cpu' or 'gpu'.
-
-
-### Zero-shot Examples
-- Multilingual code-to-text generation (zero-shot)
-
-![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76)
-
-![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e)
-
-- Multilingual text-to-text translation (zero-shot)
-
-![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a)
-
-
-## BibTeX
-```
-@inproceedings{chai-etal-2023-ernie,
-    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
-    author = "Chai, Yekun  and
-      Wang, Shuohuan  and
-      Pang, Chao  and
-      Sun, Yu  and
-      Tian, Hao  and
-      Wu, Hua",
-    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
-    month = jul,
-    year = "2023",
-    address = "Toronto, Canada",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2023.findings-acl.676",
-    pages = "10628--10650",
-    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
-}
-```
diff --git a/model_zoo/ernie-code/README.md b/model_zoo/ernie-code/README.md
deleted file mode 100644
index 5562d043d240..000000000000
--- a/model_zoo/ernie-code/README.md
+++ /dev/null
@@ -1,81 +0,0 @@
-# ERNIE-Code
-
-[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [English version](./README.en.md)
-
-![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4)
-
-[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf)
-
-
-ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型（Code LLM），支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练：
-- Span-Corruption Language Modeling (SCLM) 从单语言的自然语言或编程语言中进行掩码语言学习；
-- Pivot-based Translation Language Modeling (PTLM)，将多自然语言到多编程语言的映射 规约为，以英语为枢轴(pivot)的多自然语言到英语、和英语到多编程语言的联合学习。
-
-ERNIE-Code在代码智能的各种下游任务中，包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务，优于以前的多语言代码和文本模型（例如mT5 和 CodeT5），同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。
-
-## 快速开始
-
-本项目是ERNIE-Code的PaddlePaddle实现，包括模型预测和权重转换。以下是该示例的简要目录结构和说明：
-
-```text
-├── README.md               # 文档
-├── predict.py              # 前向预测示例
-├── converter.py            # 权重转换脚本
-```
-
-### 多语言文本到代码/代码到文本
-
-本项目提供了一个简单的多语言代码/文本生成的演示。启动命令如下：
-
-```shell
-python predict.py \
-  --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
-  --target_lang 'code' \
-  --source_prefix 'translate Japanese to Python: \n' \
-  --max_length 1024 \
-  --num_beams 3 \
-  --device 'gpu'
-```
-
-配置文件中参数的解释：
-- `input`：输入的文本序列。
-- `target_lang`：目标语言，可设置为'text'或'code'。
-- `source_prefix`：提示词Prompt。
-- `max_length`：输入/输出文本的最大长度。
-- `num_beams`：解码时每个时间步保留的beam大小（用于束搜索）。
-- `device`：运行设备，可设置为'cpu'或'gpu'。
-
-
-
-### Zero-shot示例
-- 多语言代码到文本生成（zero-shot）
-
-![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76)
-
-![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e)
-
-- 计算机术语翻译（zero-shot）
-
-![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a)
-
-
-## BibTeX
-```
-@inproceedings{chai-etal-2023-ernie,
-    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
-    author = "Chai, Yekun  and
-      Wang, Shuohuan  and
-      Pang, Chao  and
-      Sun, Yu  and
-      Tian, Hao  and
-      Wu, Hua",
-    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
-    month = jul,
-    year = "2023",
-    address = "Toronto, Canada",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2023.findings-acl.676",
-    pages = "10628--10650",
-    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
-}
-```
diff --git a/model_zoo/ernie-code/convert.py b/model_zoo/ernie-code/convert.py
deleted file mode 100644
index 900e8437bea7..000000000000
--- a/model_zoo/ernie-code/convert.py
+++ /dev/null
@@ -1,68 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import argparse
-from collections import OrderedDict
-
-dont_transpose = [
-    "shared.weight",
-    "layer_norm.weight",
-    ".layer_norm.weight",
-    "relative_attention_bias.weight",
-    "embed_tokens.weight",
-]
-
-
-def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
-    import paddle
-    import torch
-
-    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
-    paddle_state_dict = OrderedDict()
-    for k, v in pytorch_state_dict.items():
-        transpose = False
-
-        if k[-7:] == ".weight":
-            if not any([w in k for w in dont_transpose]):
-                if v.ndim == 2:
-                    v = v.transpose(0, 1)
-                    transpose = True
-
-        print(f"Converting: {k} | is_transpose {transpose}")
-
-        if k != "lm_head.weight":
-            k = "ErnieCode." + k
-        # The bf16 data of torch cannot be directly converted to paddle
-        paddle_state_dict[k] = paddle.to_tensor(v.to(torch.float32).numpy()).cast(paddle.bfloat16).numpy()
-
-    paddle.save(paddle_state_dict, paddle_dump_path)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--pytorch_checkpoint_path",
-        default="/home/models/pytorch_model.bin",
-        type=str,
-        required=False,
-        help="Path to the Pytorch checkpoint path.",
-    )
-    parser.add_argument(
-        "--paddle_dump_path",
-        default="/home/models/model_state.pdparams",
-        type=str,
-        required=False,
-        help="Path to the output Paddle model.",
-    )
-    args = parser.parse_args()
-    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/model_zoo/ernie-code/predict.py b/model_zoo/ernie-code/predict.py
deleted file mode 100644
index 95b916bae7d2..000000000000
--- a/model_zoo/ernie-code/predict.py
+++ /dev/null
@@ -1,76 +0,0 @@
-#   Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import argparse
-
-import numpy as np
-import paddle
-
-from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer
-
-parser = argparse.ArgumentParser("ERNIE-CODE")
-parser.add_argument(
-    "--model_name_or_path",
-    default="ernie-code-base-L512",
-    type=str,
-)
-parser.add_argument("--input", default="BadZipFileのAliasは、古い Python バージョンとの互換性のために。", type=str)
-parser.add_argument("--target_lang", default="code", type=str)
-parser.add_argument("--source_prefix", default="translate Japanese to Python: \n", type=str)
-parser.add_argument("--max_length", type=int, default=1024)
-parser.add_argument("--num_beams", type=int, default=3)
-parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"])
-
-args = parser.parse_args()
-
-
-def predict():
-
-    paddle.set_device(args.device)
-    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
-    model = AutoModelForConditionalGeneration.from_pretrained(args.model_name_or_path)
-
-    prefix = args.source_prefix if args.source_prefix is not None else ""
-
-    def preprocess_function(inputs, tokenizer):
-        inputs = [prefix + inp for inp in inputs]
-
-        model_inputs = tokenizer(inputs, max_length=args.max_length)
-        return model_inputs
-
-    dev_dataset = [args.input]
-    model_inputs = preprocess_function(dev_dataset, tokenizer)
-    model.eval()
-    gen_kwargs = {
-        "max_length": args.max_length,
-        "num_beams": args.num_beams,
-        "decode_strategy": "beam_search",
-        "length_penalty": 0,
-        "min_length": 0,
-    }
-    generated_tokens, _ = model.generate(
-        paddle.to_tensor(np.array(model_inputs["input_ids"]).reshape(1, -1).astype("int64")),
-        attention_mask=paddle.to_tensor(np.array(model_inputs["attention_mask"]).reshape(1, -1).astype("int64")),
-        **gen_kwargs,
-    )
-    if args.target_lang == "text":
-        decoded_preds = tokenizer.batch_decode(generated_tokens.numpy(), skip_special_tokens=True)
-    elif args.target_lang == "code":
-        decoded_preds = tokenizer.batch_decode(
-            generated_tokens.numpy(), skip_special_tokens=False, clean_up_tokenization_spaces=False
-        )
-    print(decoded_preds)
-
-
-if __name__ == "__main__":
-    predict()
diff --git a/model_zoo/ernie-doc/README.md b/model_zoo/ernie-doc/README.md
deleted file mode 100644
index ca3d5a064826..000000000000
--- a/model_zoo/ernie-doc/README.md
+++ /dev/null
@@ -1,215 +0,0 @@
-# ERNIE-Doc: A Retrospective Long-Document Modeling Transformer
-
-* [模型简介](#模型简介)
-* [快速开始](#快速开始)
-    * [环境依赖](#环境依赖)
-    * [通用参数释义](#通用参数释义)
-    * [分类任务](#分类任务)
-    * [阅读理解任务](#阅读理解任务)
-    * [语义匹配任务](#语义匹配任务)
-    * [序列标注任务](#序列标注任务)
-* [致谢](#致谢)
-* [参考论文](#参考论文)
-
-## 模型简介
-[ERNIE-Doc](https://arxiv.org/abs/2012.15688)是百度NLP提出的针对长文本的预训练模型。在循环Transformer机制之上，创新性地提出两阶段重复学习以及增强的循环机制，以此提高模型感受野，加强模型对长文本的理解能力。
-
-本项目是 ERNIE-Doc 的 PaddlePaddle 动态图实现， 包含模型训练，模型验证等内容。以下是本例的简要目录结构及说明：
-
-```text
-.
-├── README.md                   # 文档
-├── data.py                     # 数据处理
-├── metrics.py                  # ERNIE-Doc下游任务指标
-├── model.py                    # 下游任务模型实现
-├── optimization.py             # 优化算法
-├── run_classifier.py           # 分类任务
-├── run_mcq.py                  # 阅读理解任务，单项选择题
-├── run_mrc.py                  # 抽取式阅读理解任务
-├── run_semantic_matching.py    # 语义匹配任务
-└── run_sequence_labeling.py    # 序列标注任务
-
-```
-
-## 快速开始
-
-### 环境依赖
-
-- nltk
-- beautifulsoup4
-
-安装命令：`pip install nltk==3.5 beautifulsoup4`
-
-初次使用时，需要下载nltk的模型，可运行以下命令（下载模型可能比较慢，请耐心等待）：
-
-```
-python -c "import nltk; nltk.download('punkt')"
-```
-
-### 通用参数释义
-
-- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："ernie-doc-base-zh", "ernie-doc-base-en"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"。
-- `dataset` 表示Fine-tuning需要加载的数据集。
-- `memory_length` 表示当前的句子被截取作为下一个样本的特征的长度。
-- `max_seq_length` 表示最大句子长度，超过该长度的部分将被切分成下一个样本。
-- `batch_size` 表示每次迭代**每张卡**上的样本数目。
-- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
-- `epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔步数。
-- `save_steps` 表示模型保存及评估间隔步数。
-- `output_dir` 表示模型保存路径。
-- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
-- `seed` 表示随机数种子。
-- `weight_decay` 表示AdamW的权重衰减系数。
-- `warmup_proportion` 表示学习率warmup系数。
-- `layerwise_decay` 表示AdamW with Layerwise decay的逐层衰减系数。
-
-由于不同任务、不同数据集所设的超参数差别较大，可查看[ERNIE-Doc](https://arxiv.org/abs/2012.15688)论文附录中具体超参设定，此处不一一列举。
-
-### 分类任务
-
-分类任务支持多种数据集的评测，目前支持`imdb`, `iflytek`, `thucnews`, `hyp`四个数据集（有关数据集的描述可查看[PaddleNLP文本分类数据集](../../docs/data_prepare/dataset_list.md)）。可通过参数`dataset`指定具体的数据集，下面以`imdb`为例子运行分类任务。
-
-#### 单卡训练
-
-```shell
-python run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
-
-```
-
-#### 多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus "0,1" --log_dir imdb run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
-
-```
-
-在`imdb`, `iflytek`, `thucnews`, `hyp`各数据集上Fine-tuning后，在验证集上有如下结果：
-
-| Dataset   | Model             |      Dev ACC     |
-|:---------:|:-----------------:|:----------------:|
-| IMDB      | ernie-doc-base-en |      0.9506      |
-| THUCNews  | ernie-doc-base-zh |      0.9854      |
-| HYP       | ernie-doc-base-en |      0.7412      |
-| IFLYTEK   | ernie-doc-base-zh |      0.6179      |
-
-
-### 阅读理解任务
-
-阅读理解任务支持抽取式阅读理解与单项选择题任务。
-
-- 抽取式阅读理解
-
-目前抽取式阅读理解支持`duredear-robust`, `drcd`,`cmrc2018`数据集。可通过参数`dataset`指定具体的数据集，下面以`dureader_robust`为例子运行抽取式阅读理解任务。
-
-#### 单卡训练
-
-```shell
-python run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
-```
-
-#### 多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus "0,1" --log_dir dureader_robust run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
-```
-
-在`duredear-robust`, `drcd`, `cmrc2018`各数据集上Fine-tuning后，在验证集上有如下结果：
-
-| Dataset        | Model             |      Dev EM/F1   |
-|:--------------:|:-----------------:|:----------------:|
-| Dureader-robust| ernie-doc-base-zh |  0.7481/0.8637   |
-| DRCD           | ernie-doc-base-zh |  0.8879/0.9392   |
-| CMRC2018       | ernie-doc-base-zh |  0.7061/0.9004   |
-
-
-- 单项选择题
-
-[C3](https://github.com/nlpdata/c3)是首个自由形式的多选项中文机器阅读理解数据集。该数据集每个样本提供一个上下文（文章或者对话）、问题以及至多四个答案选项，要求从答案选项中选择一个正确选项。
-
-目前PaddleNLP提供`C3`阅读理解单项选择题数据集，可执行以下命令运行该任务。
-
-#### 单卡训练
-
-```shell
-python run_mcq.py --batch_size 8
-
-```
-
-#### 多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus "0,1" --log_dir mcq run_mcq.py --batch_size 8
-
-```
-
-在`C3`数据集上Fine-tuning后，在验证集上有如下结果：
-| Dataset        | Model             |   Dev/Test Acc   |
-|:--------------:|:-----------------:|:----------------:|
-| C3             | ernie-doc-base-zh |  0.7573/0.7583   |
-
-
-### 语义匹配任务
-
-[CAIL2019 SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) 数据集是来自“中国裁判文书网”公开的法律文书,其中每份数据由三篇法律文书组成。对于每份数据，用`(A,B,C)`来代表该组数据，其中`(A,B,C)`均对应某一篇文书。该任务要求判别similarity(A, B)是否大于similarity(A, C)。
-
-可执行以下命令运行该任务。
-
-#### 单卡训练
-
-```shell
-python run_semantic_matching.py  --batch_size 6 --learning_rate 2e-5
-```
-
-#### 多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus "0,1" --log_dir cail run_semantic_matching.py --batch_size 6 --learning_rate 2e-5
-```
-
-在`CAIL2019-SCM`数据集上Fine-tuning后，在验证集与测试集上有如下结果：
-
-| Dataset        | Model             |   Dev/Test Acc   |
-|:--------------:|:-----------------:|:----------------:|
-| CAIL2019-SCM   | ernie-doc-base-zh |  0.6420/0.6484   |
-
-
-### 序列标注任务
-
-
-MSRA-NER 数据集由微软亚研院发布，其目标是识别文本中具有特定意义的实体，主要包括人名、地名、机构名等。示例如下：
-
-```
-不\002久\002前\002，\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。    O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O
-这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。    O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O
-```
-
-PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整：每一行文本、标签以特殊字符"\t"进行分隔，每个字之间以特殊字符"\002"分隔。
-
-可执行以下命令运行序列标注任务。
-
-#### 单卡训练
-
-```shell
-python run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
-```
-
-#### 多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus "0,1" --log_dir msra_ner run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
-```
-
-在`MSRA-NER`数据集上Fine-tuning后，在验证集与测试集上有如下最佳结果：
-
-| Dataset        | Model             |   Precision/Recall/F1   |
-|:--------------:|:-----------------:|:-----------------------:|
-| MSRA-NER       | ernie-doc-base-zh |  0.9288/0.9139/0.9213   |
-
-
-## 致谢
-* 感谢[百度NLP](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)提供ERNIE-Doc开源代码的实现以及预训练模型。
-
-## 参考论文
-
-* Siyu Ding, Junyuan Shang et al. "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer" ACL, 2021
diff --git a/model_zoo/ernie-doc/data.py b/model_zoo/ernie-doc/data.py
deleted file mode 100644
index 93a36443458a..000000000000
--- a/model_zoo/ernie-doc/data.py
+++ /dev/null
@@ -1,1194 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import itertools
-from collections import namedtuple
-
-import numpy as np
-from paddle.utils import try_import
-
-from paddlenlp.transformers import tokenize_chinese_chars
-from paddlenlp.utils.log import logger
-
-__all__ = ["ClassifierIterator", "MRCIterator", "MCQIterator"]
-
-
-def get_related_pos(insts, seq_len, memory_len=128):
-    """generate relative postion ids"""
-    beg = seq_len + seq_len + memory_len
-    r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))]
-    return np.array(r_position).astype("int64").reshape([len(insts), beg, 1])
-
-
-def pad_batch_data(
-    insts,
-    insts_data_type="int64",
-    pad_idx=0,
-    final_cls=False,
-    pad_max_len=None,
-    return_pos=False,
-    return_input_mask=False,
-    return_max_len=False,
-    return_num_token=False,
-    return_seq_lens=False,
-):
-    """
-    Pad the instances to the max sequence length in batch, and generate the
-    corresponding position data and attention bias.
-    """
-    return_list = []
-    if pad_max_len:
-        max_len = pad_max_len
-    else:
-        max_len = max(len(inst) for inst in insts)
-    # Any token included in dict can be used to pad, since the paddings' loss
-    # will be masked out by weights and make no effect on parameter gradients.
-
-    # Input id
-    if final_cls:
-        inst_data = np.array([inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts])
-    else:
-        inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
-    return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])]
-
-    # Position id
-    if return_pos:
-        inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
-
-        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
-
-    if return_input_mask:
-        # This is used to avoid attention on paddings.
-        if final_cls:
-            input_mask_data = np.array([[1] * len(inst[:-1]) + [0] * (max_len - len(inst)) + [1] for inst in insts])
-        else:
-            input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
-        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
-        return_list += [input_mask_data.astype("float32")]
-
-    if return_max_len:
-        return_list += [max_len]
-
-    if return_num_token:
-        num_token = 0
-        for inst in insts:
-            num_token += len(inst)
-        return_list += [num_token]
-
-    if return_seq_lens:
-        seq_lens_type = [-1]
-        seq_lens = np.array([len(inst) for inst in insts])
-        return_list += [seq_lens.astype("int64").reshape(seq_lens_type)]
-
-    return return_list if len(return_list) > 1 else return_list[0]
-
-
-class TextPreprocessor(object):
-    def __call__(self, text):
-        raise NotImplementedError("TextPreprocessor object can't be called")
-
-
-class ImdbTextPreprocessor(TextPreprocessor):
-    def __call__(self, text):
-        text = text.strip().replace("<br /><br />", " ")
-        text = text.replace("\t", "")
-        return text
-
-
-class HYPTextPreprocessor(TextPreprocessor):
-    def __init__(self):
-        self.bs4 = try_import("bs4")
-
-    def __call__(self, text):
-        text = self.bs4.BeautifulSoup(text, "html.parser").get_text()
-        text = text.strip().replace("\n", "").replace("\t", "")
-        return text
-
-
-class ClassifierIterator(object):
-    def __init__(
-        self,
-        dataset,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id,
-        max_seq_length=512,
-        memory_len=128,
-        repeat_input=False,
-        in_tokens=False,
-        mode="train",
-        random_seed=None,
-        preprocess_text_fn=None,
-    ):
-        self.batch_size = batch_size
-        self.tokenizer = tokenizer
-        self.trainer_num = trainer_num
-        self.trainer_id = trainer_id
-        self.max_seq_length = max_seq_length
-        self.memory_len = memory_len
-        self.repeat_input = repeat_input
-        self.in_tokens = in_tokens
-        self.dataset = [data for data in dataset]
-        self.num_examples = None
-        self.mode = mode
-        self.shuffle = True if mode == "train" else False
-        if random_seed is None:
-            random_seed = 12345
-        self.random_seed = random_seed
-        self.preprocess_text_fn = preprocess_text_fn
-
-    def shuffle_sample(self):
-        if self.shuffle:
-            self.global_rng = np.random.RandomState(self.random_seed)
-            self.global_rng.shuffle(self.dataset)
-
-    def _cnt_list(self, inp):
-        """Cnt_list"""
-        cnt = 0
-        for lit in inp:
-            if lit:
-                cnt += 1
-        return cnt
-
-    def _convert_to_features(self, example, qid):
-        """
-        Convert example to features fed into model
-        """
-        if "text" in example:  # imdb
-            text = example["text"]
-        elif "sentence" in example:  # iflytek
-            text = example["sentence"]
-
-        if self.preprocess_text_fn:
-            text = self.preprocess_text_fn(text)
-        label = example["label"]
-        doc_spans = []
-        _DocSpan = namedtuple("DocSpan", ["start", "length"])
-        start_offset = 0
-        max_tokens_for_doc = self.max_seq_length - 2
-        tokens_a = self.tokenizer.tokenize(text)
-        while start_offset < len(tokens_a):
-            length = len(tokens_a) - start_offset
-            if length > max_tokens_for_doc:
-                length = max_tokens_for_doc
-            doc_spans.append(_DocSpan(start=start_offset, length=length))
-            if start_offset + length == len(tokens_a):
-                break
-            start_offset += min(length, self.memory_len)
-
-        features = []
-        Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"])
-        for (doc_span_index, doc_span) in enumerate(doc_spans):
-            tokens = tokens_a[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"]
-            token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
-            features.append(Feature(src_ids=token_ids, label_id=label, qid=qid, cal_loss=1))
-
-        if self.repeat_input:
-            features_repeat = features
-            features = list(map(lambda x: x._replace(cal_loss=0), features))
-            features = features + features_repeat
-        return features
-
-    def _get_samples(self, pre_batch_list, is_last=False):
-        if is_last:
-            # Pad batch
-            len_doc = [len(doc) for doc in pre_batch_list]
-            max_len_idx = len_doc.index(max(len_doc))
-            dirty_sample = pre_batch_list[max_len_idx][-1]._replace(cal_loss=0)
-            for sample_list in pre_batch_list:
-                sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list)))
-
-        samples = []
-        min_len = min([len(doc) for doc in pre_batch_list])
-        for cnt in range(min_len):
-            for batch_idx in range(self.batch_size * self.trainer_num):
-                sample = pre_batch_list[batch_idx][cnt]
-                samples.append(sample)
-
-        for idx in range(len(pre_batch_list)):
-            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
-        return samples
-
-    def _pad_batch_records(self, batch_records, gather_idx=[]):
-        batch_token_ids = [record.src_ids for record in batch_records]
-        if batch_records[0].label_id is not None:
-            batch_labels = [record.label_id for record in batch_records]
-            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
-        else:
-            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
-        # Qid
-        if batch_records[-1].qid is not None:
-            batch_qids = [record.qid for record in batch_records]
-            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
-        else:
-            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
-
-        if gather_idx:
-            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([1]).astype("int64")
-        else:
-            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([0]).astype("int64")
-
-        # Padding
-        padded_token_ids, input_mask = pad_batch_data(
-            batch_token_ids,
-            pad_idx=self.tokenizer.pad_token_id,
-            pad_max_len=self.max_seq_length,
-            final_cls=True,
-            return_input_mask=True,
-        )
-        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
-        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
-
-        return_list = [
-            padded_token_ids,
-            padded_position_ids,
-            padded_task_ids,
-            input_mask,
-            batch_labels,
-            batch_qids,
-            batch_gather_idx,
-            need_cal_loss,
-        ]
-        return return_list
-
-    def _prepare_batch_data(self, examples):
-        batch_records, max_len, gather_idx = [], 0, []
-        for index, example in enumerate(examples):
-            max_len = max(max_len, len(example.src_ids))
-            if self.in_tokens:
-                to_append = (len(batch_records) + 1) * max_len <= self.batch_size
-            else:
-                to_append = len(batch_records) < self.batch_size
-            if to_append:
-                batch_records.append(example)
-                if example.cal_loss == 1:
-                    gather_idx.append(index % self.batch_size)
-            else:
-                yield self._pad_batch_records(batch_records, gather_idx)
-                batch_records, max_len = [example], len(example.src_ids)
-                gather_idx = [index % self.batch_size] if example.cal_loss == 1 else []
-        yield self._pad_batch_records(batch_records, gather_idx)
-
-    def _create_instances(self):
-        examples = self.dataset
-        pre_batch_list = []
-        insert_idx = []
-        for qid, example in enumerate(examples):
-            features = self._convert_to_features(example, qid)
-            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
-                if insert_idx:
-                    pre_batch_list[insert_idx[0]] = features
-                    insert_idx.pop(0)
-                else:
-                    pre_batch_list.append(features)
-            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
-                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
-                assert not insert_idx, "the insert_idx must be null"
-                sample_batch = self._get_samples(pre_batch_list)
-
-                for idx, lit in enumerate(pre_batch_list):
-                    if not lit:
-                        insert_idx.append(idx)
-                for batch_records in self._prepare_batch_data(sample_batch):
-                    yield batch_records
-
-        if self.mode != "train":
-            if self._cnt_list(pre_batch_list):
-                pre_batch_list += [
-                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
-                ]
-                sample_batch = self._get_samples(pre_batch_list, is_last=True)
-                for batch_records in self._prepare_batch_data(sample_batch):
-                    yield batch_records
-
-    def __call__(self):
-        curr_id = 0
-        for batch_records in self._create_instances():
-            if curr_id == self.trainer_id or self.mode != "train":
-                yield batch_records
-            curr_id = (curr_id + 1) % self.trainer_num
-
-    def get_num_examples(self):
-        if self.num_examples is None:
-            self.num_examples = 0
-            for qid, example in enumerate(self.dataset):
-                self.num_examples += len(self._convert_to_features(example, qid))
-        return self.num_examples
-
-
-class MRCIterator(ClassifierIterator):
-    """
-    Machine Reading Comprehension iterator. Only for answer extraction.
-    """
-
-    def __init__(
-        self,
-        dataset,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id,
-        max_seq_length=512,
-        memory_len=128,
-        repeat_input=False,
-        in_tokens=False,
-        mode="train",
-        random_seed=None,
-        doc_stride=128,
-        max_query_length=64,
-    ):
-        super(MRCIterator, self).__init__(
-            dataset,
-            batch_size,
-            tokenizer,
-            trainer_num,
-            trainer_id,
-            max_seq_length,
-            memory_len,
-            repeat_input,
-            in_tokens,
-            mode,
-            random_seed,
-            preprocess_text_fn=None,
-        )
-        self.doc_stride = doc_stride
-        self.max_query_length = max_query_length
-        self.examples = []
-        self.features = []
-        self.features_all = []
-        self._preprocess_data()
-
-    def shuffle_sample(self):
-        if self.shuffle:
-            self.global_rng = np.random.RandomState(self.random_seed)
-            self.global_rng.shuffle(self.features_all)
-
-    def _convert_qa_to_examples(self):
-        Example = namedtuple(
-            "Example", ["qas_id", "question_text", "doc_tokens", "orig_answer_text", "start_position", "end_position"]
-        )
-        examples = []
-        for qa in self.dataset:
-            qas_id = qa["id"]
-            question_text = qa["question"]
-            context = qa["context"]
-            start_pos = None
-            end_pos = None
-            orig_answer_text = None
-            if self.mode == "train":
-                if len(qa["answers"]) != 1:
-                    raise ValueError("For training, each question should have exactly 1 answer.")
-                orig_answer_text = qa["answers"][0]
-                answer_offset = qa["answer_starts"][0]
-                answer_length = len(orig_answer_text)
-                doc_tokens = [
-                    context[:answer_offset],
-                    context[answer_offset : answer_offset + answer_length],
-                    context[answer_offset + answer_length :],
-                ]
-
-                start_pos = 1
-                end_pos = 1
-
-                actual_text = " ".join(doc_tokens[start_pos : (end_pos + 1)])
-                if orig_answer_text.islower():
-                    actual_text = actual_text.lower()
-                if actual_text.find(orig_answer_text) == -1:
-                    logger.info("Could not find answer: '%s' vs. '%s'" % (actual_text, orig_answer_text))
-                    continue
-
-            else:
-                doc_tokens = tokenize_chinese_chars(context)
-
-            example = Example(
-                qas_id=qas_id,
-                question_text=question_text,
-                doc_tokens=doc_tokens,
-                orig_answer_text=orig_answer_text,
-                start_position=start_pos,
-                end_position=end_pos,
-            )
-            examples.append(example)
-        return examples
-
-    def _convert_example_to_feature(self, examples):
-        Feature = namedtuple(
-            "Feature",
-            [
-                "qid",
-                "example_index",
-                "doc_span_index",
-                "tokens",
-                "token_to_orig_map",
-                "token_is_max_context",
-                "src_ids",
-                "start_position",
-                "end_position",
-                "cal_loss",
-            ],
-        )
-        features = []
-        self.features_all = []
-        unique_id = 1000
-        is_training = self.mode == "train"
-        print("total {} examples".format(len(examples)), flush=True)
-        for (example_index, example) in enumerate(examples):
-            query_tokens = self.tokenizer.tokenize(example.question_text)
-            if len(query_tokens) > self.max_query_length:
-                query_tokens = query_tokens[0 : self.max_query_length]
-            tok_to_orig_index = []
-            orig_to_tok_index = []
-            all_doc_tokens = []
-            for (i, token) in enumerate(example.doc_tokens):
-                orig_to_tok_index.append(len(all_doc_tokens))
-                sub_tokens = self.tokenizer.tokenize(token)
-                for sub_token in sub_tokens:
-                    tok_to_orig_index.append(i)
-                    all_doc_tokens.append(sub_token)
-
-            tok_start_position = None
-            tok_end_position = None
-            if is_training:
-                tok_start_position = orig_to_tok_index[example.start_position]
-                if example.end_position < len(example.doc_tokens) - 1:
-                    tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
-                else:
-                    tok_end_position = len(all_doc_tokens) - 1
-                (tok_start_position, tok_end_position) = self._improve_answer_span(
-                    all_doc_tokens, tok_start_position, tok_end_position, example.orig_answer_text
-                )
-
-            max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3
-            _DocSpan = namedtuple("DocSpan", ["start", "length"])
-            doc_spans = []
-            start_offset = 0
-            while start_offset < len(all_doc_tokens):
-                length = len(all_doc_tokens) - start_offset
-                if length > max_tokens_for_doc:
-                    length = max_tokens_for_doc
-                doc_spans.append(_DocSpan(start=start_offset, length=length))
-                if start_offset + length == len(all_doc_tokens):
-                    break
-                start_offset += min(length, self.doc_stride)
-
-            features_each = []
-            for (doc_span_index, doc_span) in enumerate(doc_spans):
-                tokens = []
-                token_to_orig_map = {}
-                token_is_max_context = {}
-                tokens.append("[CLS]")
-                for i in range(doc_span.length):
-                    split_token_index = doc_span.start + i
-                    token_to_orig_map[i + 1] = tok_to_orig_index[split_token_index]
-                    is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
-                    token_is_max_context[i + 1] = is_max_context
-                tokens += all_doc_tokens[doc_span.start : doc_span.start + doc_span.length]
-                tokens.append("[SEP]")
-
-                for token in query_tokens:
-                    tokens.append(token)
-                tokens.append("[SEP]")
-
-                token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
-                start_position = None
-                end_position = None
-                if is_training:
-                    doc_start = doc_span.start
-                    doc_end = doc_span.start + doc_span.length - 1
-                    out_of_span = False
-                    if not (tok_start_position >= doc_start and tok_end_position <= doc_end):
-                        out_of_span = True
-                    if out_of_span:
-                        start_position = 0
-                        end_position = 0
-                    else:
-                        doc_offset = 1  # len(query_tokens) + 2
-                        start_position = tok_start_position - doc_start + doc_offset
-                        end_position = tok_end_position - doc_start + doc_offset
-
-                feature = Feature(
-                    qid=unique_id,
-                    example_index=example_index,
-                    doc_span_index=doc_span_index,
-                    tokens=tokens,
-                    token_to_orig_map=token_to_orig_map,
-                    token_is_max_context=token_is_max_context,
-                    src_ids=token_ids,
-                    start_position=start_position,
-                    end_position=end_position,
-                    cal_loss=1,
-                )
-                features.append(feature)
-                features_each.append(feature)
-                if example_index % 1000 == 0:
-                    print("processing {} examples".format(example_index), flush=True)
-
-                unique_id += 1
-            # Repeat
-            if self.repeat_input:
-                features_each_repeat = features_each
-                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
-                features_each += features_each_repeat
-
-            self.features_all.append(features_each)
-
-        return features
-
-    def _preprocess_data(self):
-        # Construct examples
-        self.examples = self._convert_qa_to_examples()
-        # Construct features
-        self.features = self._convert_example_to_feature(self.examples)
-
-    def get_num_examples(self):
-        if not self.features_all:
-            self._preprocess_data()
-        return len(sum(self.features_all, []))
-
-    def _improve_answer_span(self, doc_tokens, input_start, input_end, orig_answer_text):
-        """Improve answer span"""
-        tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text))
-
-        for new_start in range(input_start, input_end + 1):
-            for new_end in range(input_end, new_start - 1, -1):
-                text_span = " ".join(doc_tokens[new_start : (new_end + 1)])
-                if text_span == tok_answer_text:
-                    return (new_start, new_end)
-
-        return (input_start, input_end)
-
-    def _check_is_max_context(self, doc_spans, cur_span_index, position):
-        """Check is max context"""
-        best_score = None
-        best_span_index = None
-        for (span_index, doc_span) in enumerate(doc_spans):
-            end = doc_span.start + doc_span.length - 1
-            if position < doc_span.start:
-                break
-            if position > end:
-                continue
-            num_left_context = position - doc_span.start
-            num_right_context = end - position
-            score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
-            if best_score is None or score > best_score:
-                best_score = score
-                best_span_index = span_index
-                if best_span_index > cur_span_index:
-                    return False
-
-        return cur_span_index == best_span_index
-
-    def _pad_batch_records(self, batch_records, gather_idx=[]):
-        """Pad batch data"""
-        batch_token_ids = [record.src_ids for record in batch_records]
-
-        if self.mode == "train":
-            batch_start_position = [record.start_position for record in batch_records]
-            batch_end_position = [record.end_position for record in batch_records]
-            batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1])
-            batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1])
-        else:
-            batch_size = len(batch_token_ids)
-            batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64")
-            batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64")
-
-        batch_qids = [record.qid for record in batch_records]
-        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
-
-        if gather_idx:
-            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([1]).astype("int64")
-        else:
-            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([0]).astype("int64")
-
-        # padding
-        padded_token_ids, input_mask = pad_batch_data(
-            batch_token_ids,
-            pad_idx=self.tokenizer.pad_token_id,
-            pad_max_len=self.max_seq_length,
-            return_input_mask=True,
-        )
-        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
-        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
-
-        return_list = [
-            padded_token_ids,
-            padded_position_ids,
-            padded_task_ids,
-            input_mask,
-            batch_start_position,
-            batch_end_position,
-            batch_qids,
-            batch_gather_idx,
-            need_cal_loss,
-        ]
-
-        return return_list
-
-    def _create_instances(self):
-        """Generate batch records"""
-        pre_batch_list = []
-        insert_idx = []
-        for qid, features in enumerate(self.features_all):
-            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
-                if insert_idx:
-                    pre_batch_list[insert_idx[0]] = features
-                    insert_idx.pop(0)
-                else:
-                    pre_batch_list.append(features)
-            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
-                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
-                assert not insert_idx, "the insert_idx must be null"
-                sample_batch = self._get_samples(pre_batch_list)
-
-                for idx, lit in enumerate(pre_batch_list):
-                    if not lit:
-                        insert_idx.append(idx)
-                for batch_records in self._prepare_batch_data(sample_batch):
-                    yield batch_records
-
-        if self.mode != "train":
-            if self._cnt_list(pre_batch_list):
-                pre_batch_list += [
-                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
-                ]
-                sample_batch = self._get_samples(pre_batch_list, is_last=True)
-                for batch_records in self._prepare_batch_data(sample_batch):
-                    yield batch_records
-
-
-class MCQIterator(MRCIterator):
-    """
-    Multiple choice question iterator.
-    """
-
-    def __init__(
-        self,
-        dataset,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id,
-        max_seq_length=512,
-        memory_len=128,
-        repeat_input=False,
-        in_tokens=False,
-        mode="train",
-        random_seed=None,
-        doc_stride=128,
-        max_query_length=64,
-        choice_num=4,
-    ):
-        self.choice_num = choice_num
-        super(MCQIterator, self).__init__(
-            dataset,
-            batch_size,
-            tokenizer,
-            trainer_num,
-            trainer_id,
-            max_seq_length,
-            memory_len,
-            repeat_input,
-            in_tokens,
-            mode,
-            random_seed,
-        )
-
-    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
-        """Truncates a sequence pair in place to the maximum length."""
-
-        # This is a simple heuristic which will always truncate the longer sequence
-        # one token at a time. This makes more sense than truncating an equal percent
-        # of tokens from each, since if one sequence is very short then each token
-        # that's truncated likely contains more information than a longer sequence.
-        tokens_a = list(tokens_a)
-        tokens_b = list(tokens_b)
-        while True:
-            total_length = len(tokens_a) + len(tokens_b)
-            if total_length <= max_length:
-                break
-            if len(tokens_a) > len(tokens_b):
-                tokens_a.pop()
-            else:
-                tokens_b.pop()
-        return tokens_a, tokens_b
-
-    def _convert_qa_to_examples(self):
-        Example = namedtuple("Example", ["qas_id", "context", "question", "choice", "label"])
-        examples = []
-        for qas_id, qa in enumerate(self.dataset):
-            context = "\n".join(qa["context"]).lower()
-            question = qa["question"].lower()
-            choice = [c.lower() for c in qa["choice"]]
-            # pad empty choice
-            for k in range(len(choice), self.choice_num):
-                choice.append("")
-            label = qa["label"]
-
-            example = Example(qas_id=qas_id, context=context, question=question, choice=choice, label=label)
-            examples.append(example)
-        return examples
-
-    def _convert_example_to_feature(self, examples):
-        Feature = namedtuple("Feature", ["qid", "src_ids", "segment_ids", "label", "cal_loss"])
-        features = []
-        self.features_all = []
-        for (ex_index, example) in enumerate(examples):
-            context_tokens = self.tokenizer.tokenize(example.context)
-            question_tokens = self.tokenizer.tokenize(example.question)
-            choice_tokens_lst = [self.tokenizer.tokenize(choice) for choice in example.choice]
-            # nums = 4
-            question_choice_pairs = [
-                self._truncate_seq_pair(question_tokens, choice_tokens, self.max_query_length - 2)
-                for choice_tokens in choice_tokens_lst
-            ]
-            total_qc_num = sum([(len(q) + len(c)) for q, c in question_choice_pairs])
-            max_tokens_for_doc = self.max_seq_length - total_qc_num - 4
-            _DocSpan = namedtuple("DocSpan", ["start", "length"])
-            doc_spans = []
-            start_offset = 0
-
-            while start_offset < len(context_tokens):
-                length = len(context_tokens) - start_offset
-                if length > max_tokens_for_doc:
-                    length = max_tokens_for_doc
-                doc_spans.append(_DocSpan(start=start_offset, length=length))
-                if start_offset + length == len(context_tokens):
-                    break
-                start_offset += min(length, self.doc_stride)
-
-            features_each = []
-            for (doc_span_index, doc_span) in enumerate(doc_spans):
-                qa_features = []
-                for q_tokens, c_tokens in question_choice_pairs:
-                    segment_tokens = ["[CLS]"]
-                    token_type_ids = [0]
-
-                    segment_tokens += context_tokens[doc_span.start : doc_span.start + doc_span.length]
-                    token_type_ids += [0] * doc_span.length
-
-                    segment_tokens += ["[SEP]"]
-                    token_type_ids += [0]
-
-                    segment_tokens += q_tokens
-                    token_type_ids += [1] * len(q_tokens)
-
-                    segment_tokens += ["[SEP]"]
-                    token_type_ids += [1]
-
-                    segment_tokens += c_tokens
-                    token_type_ids += [1] * len(c_tokens)
-
-                    segment_tokens += ["[SEP]"]
-                    token_type_ids += [1]
-
-                    input_ids = self.tokenizer.convert_tokens_to_ids(segment_tokens)
-                    feature = Feature(
-                        qid=example.qas_id,
-                        label=example.label,
-                        src_ids=input_ids,
-                        segment_ids=token_type_ids,
-                        cal_loss=1,
-                    )
-                    qa_features.append(feature)
-
-                features.append(qa_features)
-                features_each.append(qa_features)
-
-            # Repeat
-            if self.repeat_input:
-                features_each_repeat = features_each
-                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
-                features_each += features_each_repeat
-
-            self.features_all.append(features_each)
-
-        return features
-
-    def _pad_batch_records(self, batch_records, gather_idx=[]):
-        batch_token_ids = [[record.src_ids for record in records] for records in batch_records]
-        if batch_records[0][0].label is not None:
-            batch_labels = [[record.label for record in records] for records in batch_records]
-            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
-        else:
-            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
-        # Qid
-        batch_qids = [[record.qid for record in records] for records in batch_records]
-        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
-
-        if gather_idx:
-            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([1]).astype("int64")
-        else:
-            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([0]).astype("int64")
-
-        batch_task_ids = [[record.segment_ids for record in records] for records in batch_records]
-
-        # Padding
-        batch_padded_token_ids = []
-        batch_input_mask = []
-        batch_padded_task_ids = []
-        batch_padded_position_ids = []
-        batch_size = len(batch_token_ids)
-        for i in range(batch_size):
-            padded_token_ids, input_mask = pad_batch_data(
-                batch_token_ids[i],
-                pad_idx=self.tokenizer.pad_token_id,
-                pad_max_len=self.max_seq_length,
-                return_input_mask=True,
-            )
-            padded_task_ids = pad_batch_data(
-                batch_task_ids[i], pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
-            )
-
-            padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
-
-            batch_padded_token_ids.append(padded_token_ids)
-            batch_input_mask.append(input_mask)
-            batch_padded_task_ids.append(padded_task_ids)
-            batch_padded_position_ids.append(padded_position_ids)
-
-        batch_padded_token_ids = (
-            np.array(batch_padded_token_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
-        )
-        batch_padded_position_ids = (
-            np.array(batch_padded_position_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
-        )
-        batch_padded_task_ids = (
-            np.array(batch_padded_task_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
-        )
-        batch_input_mask = np.array(batch_input_mask).astype("float32").reshape([batch_size * self.choice_num, -1, 1])
-
-        return_list = [
-            batch_padded_token_ids,
-            batch_padded_position_ids,
-            batch_padded_task_ids,
-            batch_input_mask,
-            batch_labels,
-            batch_qids,
-            batch_gather_idx,
-            need_cal_loss,
-        ]
-        return return_list
-
-    def _prepare_batch_data(self, examples_list):
-        batch_records, max_len, gather_idx = [], 0, []
-        real_batch_size = self.batch_size * self.choice_num
-        index = 0
-        for examples in examples_list:
-            records = []
-            gather_idx_candidate = []
-            for example in examples:
-                if example.cal_loss == 1:
-                    gather_idx_candidate.append(index % real_batch_size)
-                max_len = max(max_len, len(example.src_ids))
-                records.append(example)
-                index += 1
-
-            if self.in_tokens:
-                to_append = (len(batch_records) + 1) * self.choice_num * max_len <= self.batch_size
-            else:
-                to_append = len(batch_records) < self.batch_size
-            if to_append:
-                batch_records.append(records)
-                gather_idx += gather_idx_candidate
-            else:
-                yield self._pad_batch_records(batch_records, gather_idx)
-                batch_records, max_len = [records], max(len(record.src_ids) for record in records)
-                gather_idx = gather_idx_candidate
-        if len(batch_records) > 0:
-            yield self._pad_batch_records(batch_records, gather_idx)
-
-    def _get_samples(self, pre_batch_list, is_last=False):
-        if is_last:
-            # Pad batch
-            len_doc = [[len(doc) for doc in doc_list] for doc_list in pre_batch_list]
-            len_doc = list(itertools.chain(*len_doc))
-            max_len_idx = len_doc.index(max(len_doc))
-            doc_idx = max_len_idx % self.choice_num
-            doc_list_idx = max_len_idx // self.choice_num
-            dirty_sample = pre_batch_list[doc_list_idx][doc_idx][-1]._replace(cal_loss=0)
-            for sample_list in pre_batch_list:
-                for samples in sample_list:
-                    samples.extend([dirty_sample] * (max(len_doc) - len(samples)))
-        samples = []
-        min_len = min([len(doc) for doc in pre_batch_list])
-        for cnt in range(min_len):
-            for batch_idx in range(self.batch_size * self.trainer_num):
-                sample = pre_batch_list[batch_idx][cnt]
-                samples.append(sample)
-
-        for idx in range(len(pre_batch_list)):
-            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
-        return samples
-
-
-class SemanticMatchingIterator(MRCIterator):
-    def _convert_qa_to_examples(self):
-        Example = namedtuple("Example", ["qid", "text_a", "text_b", "text_c", "label"])
-        examples = []
-        for qid, qa in enumerate(self.dataset):
-            text_a, text_b, text_c = list(
-                map(lambda x: x.replace("\n", "").strip(), [qa["text_a"], qa["text_b"], qa["text_c"]])
-            )
-
-            example = Example(qid=qid, text_a=text_a, text_b=text_b, text_c=text_c, label=qa["label"])
-            examples += [example]
-        return examples
-
-    def _create_tokens_and_type_id(self, text_a_tokens, text_b_tokens, start, length):
-        tokens = (
-            ["[CLS]"]
-            + text_a_tokens[start : start + length]
-            + ["[SEP]"]
-            + text_b_tokens[start : start + length]
-            + ["[SEP]"]
-        )
-        token_type_ids = [0] + [0] * (length + 1) + [1] * (length + 1)
-        return tokens, token_type_ids
-
-    def _convert_example_to_feature(self, examples):
-        Feature = namedtuple(
-            "Feature", ["qid", "src_ids", "segment_ids", "pair_src_ids", "pair_segment_ids", "label", "cal_loss"]
-        )
-        features = []
-        self.features_all = []
-        for (ex_index, example) in enumerate(examples):
-            text_a_tokens = self.tokenizer.tokenize(example.text_a)
-            text_b_tokens = self.tokenizer.tokenize(example.text_b)
-            text_c_tokens = self.tokenizer.tokenize(example.text_c)
-            a_len, b_len, c_len = list(map(lambda x: len(x), [text_a_tokens, text_b_tokens, text_c_tokens]))
-
-            # Align 3 text
-            min_text_len = min([a_len, b_len, c_len])
-            text_a_tokens = text_a_tokens[:min_text_len]
-            text_b_tokens = text_b_tokens[:min_text_len]
-            text_c_tokens = text_c_tokens[:min_text_len]
-
-            _DocSpan = namedtuple("DocSpan", ["start", "length"])
-            doc_spans = []
-            start_offset = 0
-
-            max_tokens_for_doc = (self.max_seq_length - 3) // 2
-
-            while start_offset < len(text_a_tokens):
-                length = len(text_a_tokens) - start_offset
-                if length > max_tokens_for_doc:
-                    length = max_tokens_for_doc
-                doc_spans.append(_DocSpan(start=start_offset, length=length))
-                if start_offset + length == len(text_a_tokens):
-                    break
-                start_offset += min(length, self.doc_stride)
-
-            features_each = []
-            for (doc_span_index, doc_span) in enumerate(doc_spans):
-                tokens1, token_type_ids1 = self._create_tokens_and_type_id(
-                    text_a_tokens, text_b_tokens, doc_span.start, doc_span.length
-                )
-                tokens2, token_type_ids2 = self._create_tokens_and_type_id(
-                    text_a_tokens, text_c_tokens, doc_span.start, doc_span.length
-                )
-
-                input_ids1 = self.tokenizer.convert_tokens_to_ids(tokens1)
-                input_ids2 = self.tokenizer.convert_tokens_to_ids(tokens2)
-                feature = Feature(
-                    qid=example.qid,
-                    label=example.label,
-                    src_ids=input_ids1,
-                    segment_ids=token_type_ids1,
-                    pair_src_ids=input_ids2,
-                    pair_segment_ids=token_type_ids2,
-                    cal_loss=1,
-                )
-
-                features.append(feature)
-                features_each.append(feature)
-
-            # Repeat
-            if self.repeat_input:
-                features_each_repeat = features_each
-                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
-                features_each += features_each_repeat
-
-            self.features_all.append(features_each)
-
-        return features
-
-    def _create_pad_ids(self, batch_records, prefix=""):
-        src_ids = prefix + "src_ids"
-        segment_ids = prefix + "segment_ids"
-        batch_token_ids = [getattr(record, src_ids) for record in batch_records]
-        batch_task_ids = [getattr(record, segment_ids) for record in batch_records]
-
-        # Padding
-        padded_token_ids, input_mask = pad_batch_data(
-            batch_token_ids,
-            pad_idx=self.tokenizer.pad_token_id,
-            pad_max_len=self.max_seq_length,
-            return_input_mask=True,
-        )
-        padded_task_ids = pad_batch_data(
-            batch_task_ids, pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
-        )
-
-        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
-
-        return [padded_token_ids, padded_position_ids, padded_task_ids, input_mask]
-
-    def _pad_batch_records(self, batch_records, gather_idx=[]):
-        if batch_records[0].label is not None:
-            batch_labels = [record.label for record in batch_records]
-            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
-        else:
-            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
-        # Qid
-        batch_qids = [record.qid for record in batch_records]
-        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
-
-        if gather_idx:
-            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([1]).astype("int64")
-        else:
-            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([0]).astype("int64")
-
-        return_list = (
-            self._create_pad_ids(batch_records)
-            + self._create_pad_ids(batch_records, "pair_")
-            + [batch_labels, batch_qids, batch_gather_idx, need_cal_loss]
-        )
-        return return_list
-
-
-class SequenceLabelingIterator(ClassifierIterator):
-    def __init__(
-        self,
-        dataset,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id,
-        max_seq_length=512,
-        memory_len=128,
-        repeat_input=False,
-        in_tokens=False,
-        mode="train",
-        random_seed=None,
-        no_entity_id=-1,
-    ):
-        super(SequenceLabelingIterator, self).__init__(
-            dataset,
-            batch_size,
-            tokenizer,
-            trainer_num,
-            trainer_id,
-            max_seq_length,
-            memory_len,
-            repeat_input,
-            in_tokens,
-            mode,
-            random_seed,
-            preprocess_text_fn=None,
-        )
-        self.no_entity_id = no_entity_id
-
-    def _convert_to_features(self, example, qid):
-        """
-        Convert example to features fed into model
-        """
-        tokens = example["tokens"]
-        label = example["labels"]
-        doc_spans = []
-        _DocSpan = namedtuple("DocSpan", ["start", "length"])
-        start_offset = 0
-        max_tokens_for_doc = self.max_seq_length - 2
-        while start_offset < len(tokens):
-            length = len(tokens) - start_offset
-            if length > max_tokens_for_doc:
-                length = max_tokens_for_doc
-            doc_spans.append(_DocSpan(start=start_offset, length=length))
-            if start_offset + length == len(tokens):
-                break
-            start_offset += min(length, self.memory_len)
-
-        features = []
-        Feature = namedtuple("Feature", ["src_ids", "label_ids", "qid", "cal_loss"])
-        for (doc_span_index, doc_span) in enumerate(doc_spans):
-            curr_tokens = ["[CLS]"] + tokens[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"]
-            token_ids = self.tokenizer.convert_tokens_to_ids(curr_tokens)
-            label = (
-                [self.no_entity_id] + label[doc_span.start : doc_span.start + doc_span.length] + [self.no_entity_id]
-            )
-
-            features.append(Feature(src_ids=token_ids, label_ids=label, qid=qid, cal_loss=1))
-
-        if self.repeat_input:
-            features_repeat = features
-            features = list(map(lambda x: x._replace(cal_loss=0), features))
-            features = features + features_repeat
-        return features
-
-    def _pad_batch_records(self, batch_records, gather_idx=[]):
-        batch_token_ids = [record.src_ids for record in batch_records]
-        batch_length = [len(record.src_ids) for record in batch_records]
-        batch_length = np.array(batch_length).astype("int64").reshape([-1, 1])
-
-        if batch_records[0].label_ids is not None:
-            batch_labels = [record.label_ids for record in batch_records]
-        else:
-            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
-        # Qid
-        if batch_records[-1].qid is not None:
-            batch_qids = [record.qid for record in batch_records]
-            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
-        else:
-            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
-
-        if gather_idx:
-            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([1]).astype("int64")
-        else:
-            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
-            need_cal_loss = np.array([0]).astype("int64")
-        # Padding
-        padded_token_ids, input_mask = pad_batch_data(
-            batch_token_ids,
-            pad_idx=self.tokenizer.pad_token_id,
-            pad_max_len=self.max_seq_length,
-            return_input_mask=True,
-        )
-        if batch_records[0].label_ids is not None:
-            padded_batch_labels = pad_batch_data(
-                batch_labels, pad_idx=self.no_entity_id, pad_max_len=self.max_seq_length
-            )
-        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
-        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
-
-        return_list = [
-            padded_token_ids,
-            padded_position_ids,
-            padded_task_ids,
-            input_mask,
-            padded_batch_labels,
-            batch_length,
-            batch_qids,
-            batch_gather_idx,
-            need_cal_loss,
-        ]
-        return return_list
diff --git a/model_zoo/ernie-doc/metrics.py b/model_zoo/ernie-doc/metrics.py
deleted file mode 100644
index 41777f1d106c..000000000000
--- a/model_zoo/ernie-doc/metrics.py
+++ /dev/null
@@ -1,367 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import collections
-import sys
-
-import numpy as np
-import paddle
-from paddle.utils import try_import
-
-from paddlenlp.metrics.dureader import (
-    _compute_softmax,
-    _get_best_indexes,
-    get_final_text,
-)
-
-# Metric for ERNIE-DOCs
-
-
-class F1(object):
-    def __init__(self, positive_label=1):
-        self.positive_label = positive_label
-        self.reset()
-
-    def compute(self, preds, labels):
-        if isinstance(preds, paddle.Tensor):
-            preds = preds.numpy()
-        elif isinstance(preds, list):
-            preds = np.array(preds, dtype="float32")
-        if isinstance(labels, list):
-            labels = np.array(labels, dtype="int64")
-        elif isinstance(labels, paddle.Tensor):
-            labels = labels.numpy()
-        preds = np.argmax(preds, axis=1)
-        tp = ((preds == labels) & (labels == self.positive_label)).sum()
-        fn = ((preds != labels) & (labels == self.positive_label)).sum()
-        fp = ((preds != labels) & (preds == self.positive_label)).sum()
-        return tp, fp, fn
-
-    def update(self, statistic):
-        tp, fp, fn = statistic
-        self.tp += tp
-        self.fp += fp
-        self.fn += fn
-
-    def accumulate(self):
-        recall = self.tp / (self.tp + self.fn)
-        precision = self.tp / (self.tp + self.fp)
-        f1 = 2 * recall * precision / (recall + precision)
-        return f1
-
-    def reset(self):
-        self.tp = 0
-        self.fp = 0
-        self.fn = 0
-
-
-class EM_AND_F1(object):
-    def __init__(self):
-        self.nltk = try_import("nltk")
-        self.re = try_import("re")
-
-    def _mixed_segmentation(self, in_str, rm_punc=False):
-        """mixed_segmentation"""
-        in_str = in_str.lower().strip()
-        segs_out = []
-        temp_str = ""
-        sp_char = [
-            "-",
-            ":",
-            "_",
-            "*",
-            "^",
-            "/",
-            "\\",
-            "~",
-            "`",
-            "+",
-            "=",
-            "，",
-            "。",
-            "：",
-            "？",
-            "！",
-            "“",
-            "”",
-            "；",
-            "’",
-            "《",
-            "》",
-            "……",
-            "·",
-            "、",
-            "「",
-            "」",
-            "（",
-            "）",
-            "－",
-            "～",
-            "『",
-            "』",
-        ]
-        for char in in_str:
-            if rm_punc and char in sp_char:
-                continue
-            pattern = "[\\u4e00-\\u9fa5]"
-            if self.re.search(pattern, char) or char in sp_char:
-                if temp_str != "":
-                    ss = self.nltk.word_tokenize(temp_str)
-                    segs_out.extend(ss)
-                    temp_str = ""
-                segs_out.append(char)
-            else:
-                temp_str += char
-
-        # Handling last part
-        if temp_str != "":
-            ss = self.nltk.word_tokenize(temp_str)
-            segs_out.extend(ss)
-
-        return segs_out
-
-    # Remove punctuation
-    def _remove_punctuation(self, in_str):
-        """remove_punctuation"""
-        in_str = in_str.lower().strip()
-        sp_char = [
-            "-",
-            ":",
-            "_",
-            "*",
-            "^",
-            "/",
-            "\\",
-            "~",
-            "`",
-            "+",
-            "=",
-            "，",
-            "。",
-            "：",
-            "？",
-            "！",
-            "“",
-            "”",
-            "；",
-            "’",
-            "《",
-            "》",
-            "……",
-            "·",
-            "、",
-            "「",
-            "」",
-            "（",
-            "）",
-            "－",
-            "～",
-            "『",
-            "』",
-        ]
-        out_segs = []
-        for char in in_str:
-            if char in sp_char:
-                continue
-            else:
-                out_segs.append(char)
-        return "".join(out_segs)
-
-    # Find longest common string
-    def _find_lcs(self, s1, s2):
-        m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
-        mmax = 0
-        p = 0
-        for i in range(len(s1)):
-            for j in range(len(s2)):
-                if s1[i] == s2[j]:
-                    m[i + 1][j + 1] = m[i][j] + 1
-                    if m[i + 1][j + 1] > mmax:
-                        mmax = m[i + 1][j + 1]
-                        p = i + 1
-        return s1[p - mmax : p], mmax
-
-    def _calc_f1_score(self, answers, prediction):
-        f1_scores = []
-        for ans in answers:
-            ans_segs = self._mixed_segmentation(ans, rm_punc=True)
-            prediction_segs = self._mixed_segmentation(prediction, rm_punc=True)
-            lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs)
-            if lcs_len == 0:
-                f1_scores.append(0)
-                continue
-            precision = 1.0 * lcs_len / len(prediction_segs)
-            recall = 1.0 * lcs_len / len(ans_segs)
-            f1 = (2 * precision * recall) / (precision + recall)
-            f1_scores.append(f1)
-        return max(f1_scores)
-
-    def _calc_em_score(self, answers, prediction):
-        em = 0
-        for ans in answers:
-            ans_ = self._remove_punctuation(ans)
-            prediction_ = self._remove_punctuation(prediction)
-            if ans_ == prediction_:
-                em = 1
-                break
-        return em
-
-    def __call__(self, prediction, ground_truth):
-        f1 = 0
-        em = 0
-        total_count = 0
-        skip_count = 0
-        for instance in ground_truth:
-            total_count += 1
-            query_id = instance["id"]
-            answers = instance["answers"]
-            if query_id not in prediction:
-                sys.stderr.write("Unanswered question: {}\n".format(query_id))
-                skip_count += 1
-                continue
-            preds = str(prediction[query_id])
-            f1 += self._calc_f1_score(answers, preds)
-            em += self._calc_em_score(answers, preds)
-
-        f1_score = 100.0 * f1 / total_count
-        em_score = 100.0 * em / total_count
-
-        avg_score = (f1_score + em_score) * 0.5
-        return em_score, f1_score, avg_score, total_count
-
-
-def compute_qa_predictions(
-    all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, tokenizer, verbose
-):
-    """Write final predictions to the json file and log-odds of null if needed."""
-
-    example_index_to_features = collections.defaultdict(list)
-    for feature in all_features:
-        example_index_to_features[feature.example_index].append(feature)
-
-    unique_id_to_result = {}
-    for result in all_results:
-        unique_id_to_result[result.unique_id] = result
-
-    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
-        "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
-    )
-
-    all_predictions = collections.OrderedDict()
-    all_nbest_json = collections.OrderedDict()
-
-    for (example_index, example) in enumerate(all_examples):
-        features = example_index_to_features[example_index]
-
-        prelim_predictions = []
-        # Keep track of the minimum score of null start+end of position 0
-        for (feature_index, feature) in enumerate(features):
-            result = unique_id_to_result[feature.qid]
-            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
-            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
-
-            for start_index in start_indexes:
-                for end_index in end_indexes:
-                    # We could hypothetically create invalid predictions, e.g., predict
-                    # that the start of the span is in the question. We throw out all
-                    # invalid predictions.
-                    if start_index >= len(feature.tokens):
-                        continue
-                    if end_index >= len(feature.tokens):
-                        continue
-                    if start_index not in feature.token_to_orig_map:
-                        continue
-                    if end_index not in feature.token_to_orig_map:
-                        continue
-                    if not feature.token_is_max_context.get(start_index, False):
-                        continue
-                    if end_index < start_index:
-                        continue
-                    length = end_index - start_index + 1
-                    if length > max_answer_length:
-                        continue
-                    prelim_predictions.append(
-                        _PrelimPrediction(
-                            feature_index=feature_index,
-                            start_index=start_index,
-                            end_index=end_index,
-                            start_logit=result.start_logits[start_index],
-                            end_logit=result.end_logits[end_index],
-                        )
-                    )
-
-        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
-
-        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
-            "NbestPrediction", ["text", "start_logit", "end_logit"]
-        )
-
-        seen_predictions = {}
-        nbest = []
-        for pred in prelim_predictions:
-            if len(nbest) >= n_best_size:
-                break
-            feature = features[pred.feature_index]
-            if pred.start_index > 0:  # this is a non-null prediction
-                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]
-                orig_doc_start = feature.token_to_orig_map[pred.start_index]
-                orig_doc_end = feature.token_to_orig_map[pred.end_index]
-                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]
-                tok_text = " ".join(tok_tokens)
-
-                # De-tokenize WordPieces that have been split off.
-                tok_text = tok_text.replace(" ##", "")
-                tok_text = tok_text.replace("##", "")
-
-                # Clean whitespace
-                tok_text = tok_text.strip()
-                tok_text = " ".join(tok_text.split())
-                orig_text = "".join(orig_tokens)
-
-                final_text = get_final_text(tok_text, orig_text, tokenizer, verbose)
-                if final_text in seen_predictions:
-                    continue
-
-                seen_predictions[final_text] = True
-            else:
-                final_text = ""
-                seen_predictions[final_text] = True
-
-            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
-
-        # In very rare edge cases we could have no valid predictions. So we
-        # just create a nonce prediction in this case to avoid failure.
-        if not nbest:
-            nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
-
-        total_scores = []
-        for entry in nbest:
-            total_scores.append(entry.start_logit + entry.end_logit)
-
-        probs = _compute_softmax(total_scores)
-
-        nbest_json = []
-        for (i, entry) in enumerate(nbest):
-            output = collections.OrderedDict()
-            output["text"] = entry.text
-            output["probability"] = probs[i]
-            output["start_logit"] = entry.start_logit
-            output["end_logit"] = entry.end_logit
-            nbest_json.append(output)
-
-        assert len(nbest_json) >= 1
-
-        all_predictions[example.qas_id] = nbest_json[0]["text"]
-        all_nbest_json[example.qas_id] = nbest_json
-    return all_predictions, all_nbest_json
diff --git a/model_zoo/ernie-doc/model.py b/model_zoo/ernie-doc/model.py
deleted file mode 100644
index 5fd3e622596a..000000000000
--- a/model_zoo/ernie-doc/model.py
+++ /dev/null
@@ -1,50 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle.nn as nn
-
-
-class ErnieDocForTextMatching(nn.Layer):
-    def __init__(self, ernie_doc, num_classes=2, dropout=None):
-        super().__init__()
-        self.ernie_doc = ernie_doc
-        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
-        self.classifier = nn.Linear(ernie_doc.config["hidden_size"], num_classes)
-
-    def forward(
-        self,
-        query_input_ids,
-        title_input_ids,
-        query_memories,
-        title_memories,
-        query_token_type_ids=None,
-        query_position_ids=None,
-        query_attention_mask=None,
-        title_token_type_ids=None,
-        title_position_ids=None,
-        title_attention_mask=None,
-    ):
-
-        _, query_pooled_output, query_mem, = self.ernie_doc(
-            query_input_ids, query_memories, query_token_type_ids, query_position_ids, query_attention_mask
-        )
-
-        _, title_pooled_output, title_mem = self.ernie_doc(
-            title_input_ids, title_memories, title_token_type_ids, title_position_ids, title_attention_mask
-        )
-
-        diff_pooled_output = query_pooled_output - title_pooled_output
-        diff_pooled_output = self.dropout(diff_pooled_output)
-        output = self.classifier(diff_pooled_output)
-        return output, query_mem, title_mem
diff --git a/model_zoo/ernie-doc/run_classifier.py b/model_zoo/ernie-doc/run_classifier.py
deleted file mode 100644
index a2985189f416..000000000000
--- a/model_zoo/ernie-doc/run_classifier.py
+++ /dev/null
@@ -1,323 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from collections import defaultdict
-from functools import partial
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-from data import ClassifierIterator, HYPTextPreprocessor, ImdbTextPreprocessor
-from metrics import F1
-from paddle.metric import Accuracy
-from paddle.optimizer import AdamW
-
-from paddlenlp.ops.optimizer import layerwise_lr_decay
-from paddlenlp.transformers import (
-    ErnieDocBPETokenizer,
-    ErnieDocForSequenceClassification,
-    ErnieDocTokenizer,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser()
-parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-en", help="Pretraining model name or path")
-parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
-parser.add_argument("--learning_rate", type=float, default=7e-5, help="Learning rate used to train.")
-parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
-parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
-parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
-parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
-parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
-parser.add_argument("--dataset", default="imdb", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, help="The training dataset")
-parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
-parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-args = parser.parse_args()
-# fmt: on
-
-# tokenizer, eval_dataset, test_dataset, preprocess_text_fn, metric
-# BPETokenizer for English Tasks
-# ErnieDocTokenizer for Chinese Tasks
-
-DATASET_INFO = {
-    "imdb": (ErnieDocBPETokenizer, "test", "test", ImdbTextPreprocessor(), Accuracy()),
-    "hyp": (ErnieDocBPETokenizer, "dev", "test", HYPTextPreprocessor(), F1()),
-    "iflytek": (ErnieDocTokenizer, "validation", "test", None, Accuracy()),
-    "thucnews": (ErnieDocTokenizer, "dev", "test", None, Accuracy()),
-}
-
-
-def set_seed(args):
-    # Use the same data seed(for data shuffle) for all procs to guarantee data
-    # consistency after sharding.
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    # Maybe different op seeds(for dropout) for different procs is better. By:
-    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
-    paddle.seed(args.seed)
-
-
-def init_memory(batch_size, memory_length, d_model, n_layers):
-    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
-
-
-@paddle.no_grad()
-def evaluate(model, metric, data_loader, memories0):
-    model.eval()
-    losses = []
-    # copy the memory
-    memories = list(memories0)
-    tic_train = time.time()
-    eval_logging_step = 500
-
-    probs_dict = defaultdict(list)
-    label_dict = dict()
-    for step, batch in enumerate(data_loader, start=1):
-        input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch
-        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids]))
-        # Need to collect probs for each qid, so use softmax_with_cross_entropy
-        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
-        losses.append(loss.mean().numpy())
-        # Shape: [B, NUM_LABELS]
-        np_probs = probs.numpy()
-        # Shape: [B, 1]
-        np_qids = qids.numpy()
-        np_labels = labels.numpy().flatten()
-        for i, qid in enumerate(np_qids.flatten()):
-            probs_dict[qid].append(np_probs[i])
-            label_dict[qid] = np_labels[i]  # Same qid share same label.
-
-        if step % eval_logging_step == 0:
-            logger.info(
-                "Step %d: loss:  %.5f, speed: %.5f steps/s"
-                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
-            )
-            tic_train = time.time()
-
-    # Collect predicted labels
-    preds = []
-    labels = []
-    for qid, probs in probs_dict.items():
-        mean_prob = np.mean(np.array(probs), axis=0)
-        preds.append(mean_prob)
-        labels.append(label_dict[qid])
-
-    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
-    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
-
-    metric.update(metric.compute(preds, labels))
-    acc_or_f1 = metric.accumulate()
-    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
-    metric.reset()
-    model.train()
-    return acc_or_f1
-
-
-def do_train(args):
-    set_seed(args)
-    tokenizer_class, eval_name, test_name, preprocess_text_fn, eval_metric = DATASET_INFO[args.dataset]
-    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-    if args.dataset in ["hyp", "thucnews"]:
-        from paddlenlp.datasets import load_dataset
-
-        train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
-        num_classes = len(train_ds.label_list)
-
-    else:
-        from datasets import load_dataset
-
-        # Get dataset
-        if args.dataset == "iflytek":
-
-            train_ds, eval_ds, test_ds = load_dataset("clue", name=args.dataset, split=["train", eval_name, test_name])
-        else:
-            train_ds, eval_ds = load_dataset(args.dataset, split=["train", eval_name])
-            test_ds = eval_ds
-        num_classes = train_ds.features["label"].num_classes
-
-    # Initialize model
-    paddle.set_device(args.device)
-    trainer_num = paddle.distributed.get_world_size()
-    if trainer_num > 1:
-        paddle.distributed.init_parallel_env()
-    rank = paddle.distributed.get_rank()
-    if rank == 0:
-        if os.path.exists(args.model_name_or_path):
-            logger.info("init checkpoint from %s" % args.model_name_or_path)
-    model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
-    model_config = model.ernie_doc.config
-    if trainer_num > 1:
-        model = paddle.DataParallel(model)
-
-    train_ds_iter = ClassifierIterator(
-        train_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        preprocess_text_fn=preprocess_text_fn,
-    )
-    eval_ds_iter = ClassifierIterator(
-        eval_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="eval",
-        preprocess_text_fn=preprocess_text_fn,
-    )
-    test_ds_iter = ClassifierIterator(
-        test_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="test",
-        preprocess_text_fn=preprocess_text_fn,
-    )
-
-    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
-    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
-
-    num_training_examples = train_ds_iter.get_num_examples()
-    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
-    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
-    logger.info("Num train examples: %d" % num_training_examples)
-    logger.info("Max train steps: %d" % num_training_steps)
-    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    # Construct dict
-    name_dict = dict()
-    for n, p in model.named_parameters():
-        name_dict[p.name] = n
-
-    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
-
-    optimizer = AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-        lr_ratio=simple_lr_setting,
-    )
-
-    criterion = paddle.nn.loss.CrossEntropyLoss()
-    metric = paddle.metric.Accuracy()
-
-    global_steps = 0
-    best_acc = -1
-    create_memory = partial(
-        init_memory,
-        args.batch_size,
-        args.memory_length,
-        model_config["hidden_size"],
-        model_config["num_hidden_layers"],
-    )
-    # Copy the memory
-    memories = create_memory()
-    tic_train = time.time()
-    stop_training = False
-    for epoch in range(args.epochs):
-        train_ds_iter.shuffle_sample()
-        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-        for step, batch in enumerate(train_dataloader, start=1):
-            global_steps += 1
-            input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch
-            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-
-            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
-            loss = criterion(logits, labels) * need_cal_loss
-            mean_loss = loss.mean()
-            mean_loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            # Rough acc result, not a precise acc
-            acc = metric.compute(logits, labels) * need_cal_loss
-            metric.update(acc)
-
-            if global_steps % args.logging_steps == 0:
-                logger.info(
-                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
-                    % (
-                        global_steps,
-                        epoch,
-                        mean_loss,
-                        metric.accumulate(),
-                        lr_scheduler.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-
-            if global_steps % args.save_steps == 0:
-                # Evaluate
-                logger.info("Eval:")
-                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory())
-                # Save
-                if rank == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-                    if eval_acc > best_acc:
-                        logger.info("Save best model......")
-                        best_acc = eval_acc
-                        best_model_dir = os.path.join(args.output_dir, "best_model")
-                        if not os.path.exists(best_model_dir):
-                            os.makedirs(best_model_dir)
-                        model_to_save.save_pretrained(best_model_dir)
-                        tokenizer.save_pretrained(best_model_dir)
-
-            if args.max_steps > 0 and global_steps >= args.max_steps:
-                stop_training = True
-                break
-        if stop_training:
-            break
-    logger.info("Final test result:")
-    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory())
-
-
-if __name__ == "__main__":
-    do_train(args)
diff --git a/model_zoo/ernie-doc/run_mcq.py b/model_zoo/ernie-doc/run_mcq.py
deleted file mode 100644
index 4050959fa8c7..000000000000
--- a/model_zoo/ernie-doc/run_mcq.py
+++ /dev/null
@@ -1,306 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from collections import defaultdict
-from functools import partial
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-from data import MCQIterator
-from paddle.metric import Accuracy
-from paddle.optimizer import AdamW
-
-from paddlenlp.datasets import load_dataset
-from paddlenlp.ops.optimizer import layerwise_lr_decay
-from paddlenlp.transformers import (
-    ErnieDocForSequenceClassification,
-    ErnieDocTokenizer,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser()
-parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
-parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
-parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate used to train.")
-parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
-parser.add_argument("--epochs", type=int, default=8, help="Number of epoches for training.")
-parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
-parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
-parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
-parser.add_argument("--dataset", default="c3", choices=["c3"], type=str, help="The training dataset")
-parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio")
-parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument("--gradient_accumulation_steps", default=4, type=int, help="Number of updates steps to accumualte before performing a backward/update pass.")
-parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-args = parser.parse_args()
-# fmt: on
-
-
-DATASET_INFO = {
-    "c3": (ErnieDocTokenizer, "dev", "test", Accuracy()),
-}
-
-
-def set_seed(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    paddle.seed(args.seed)
-
-
-def init_memory(batch_size, memory_length, d_model, n_layers):
-    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
-
-
-@paddle.no_grad()
-def evaluate(model, metric, data_loader, memories0, choice_num):
-    model.eval()
-    losses = []
-    # Copy the memory
-    memories = list(memories0)
-    tic_train = time.time()
-    eval_logging_step = 500
-
-    probs_dict = defaultdict(list)
-    label_dict = dict()
-    for step, batch in enumerate(data_loader, start=1):
-        input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch
-        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids]))
-        logits = logits.reshape([-1, choice_num])
-        labels = labels.reshape([-1, choice_num, 1])[:, 0]
-        qids = qids.reshape([-1, choice_num, 1])[:, 0]
-        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
-        losses.append(loss.mean().numpy())
-        # Shape: [B, NUM_LABELS]
-        np_probs = probs.numpy()
-        # Shape: [B, 1]
-        np_qids = qids.numpy().flatten()
-        np_labels = labels.numpy().flatten()
-        for i, qid in enumerate(np_qids):
-            probs_dict[qid].append(np_probs[i])
-            label_dict[qid] = np_labels[i]  # Same qid share same label.
-
-        if step % eval_logging_step == 0:
-            logger.info(
-                "Step %d: loss:  %.5f, speed: %.5f steps/s"
-                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
-            )
-            tic_train = time.time()
-
-    # Collect predicted labels
-    preds = []
-    labels = []
-    logger.info("Total {} qustion".format(len(probs_dict)))
-    for qid, probs in probs_dict.items():
-        mean_prob = np.mean(np.array(probs), axis=0)
-        preds.append(mean_prob)
-        labels.append(label_dict[qid])
-
-    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
-    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
-
-    metric.update(metric.compute(preds, labels))
-    acc_or_f1 = metric.accumulate()
-    metric.reset()
-    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
-    model.train()
-    return acc_or_f1
-
-
-def do_train(args):
-    set_seed(args)
-    tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset]
-    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-
-    # Get dataset
-    train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
-
-    num_classes = len(train_ds.label_list)
-
-    # Initialize model
-    paddle.set_device(args.device)
-    trainer_num = paddle.distributed.get_world_size()
-    if trainer_num > 1:
-        paddle.distributed.init_parallel_env()
-    rank = paddle.distributed.get_rank()
-    if rank == 0:
-        if os.path.exists(args.model_name_or_path):
-            logger.info("init checkpoint from %s" % args.model_name_or_path)
-    model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=1, cls_token_idx=0)
-    model_config = model.ernie_doc.config
-    if trainer_num > 1:
-        model = paddle.DataParallel(model)
-    batch_size = int(args.batch_size / args.gradient_accumulation_steps)
-    train_ds_iter = MCQIterator(
-        train_ds,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        choice_num=num_classes,
-    )
-
-    eval_ds_iter = MCQIterator(
-        eval_ds,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        mode="eval",
-        choice_num=num_classes,
-    )
-
-    test_ds_iter = MCQIterator(
-        test_ds,
-        batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        mode="test",
-        choice_num=num_classes,
-    )
-
-    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
-    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
-
-    num_training_examples = train_ds_iter.get_num_examples()
-    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
-    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
-    logger.info("Num train examples: %d" % num_training_examples)
-    logger.info("Max train steps: %d" % num_training_steps)
-    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    # Construct dict
-    name_dict = dict()
-    for n, p in model.named_parameters():
-        name_dict[p.name] = n
-
-    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
-
-    optimizer = AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-        lr_ratio=simple_lr_setting,
-    )
-
-    criterion = paddle.nn.loss.CrossEntropyLoss()
-    metric = paddle.metric.Accuracy()
-
-    global_steps = 1
-    best_acc = -1
-    create_memory = partial(
-        init_memory,
-        batch_size * num_classes,
-        args.memory_length,
-        model_config["hidden_size"],
-        model_config["num_hidden_layers"],
-    )
-    # Copy the memory
-    memories = create_memory()
-    tic_train = time.time()
-
-    for epoch in range(args.epochs):
-        train_ds_iter.shuffle_sample()
-        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-        for step, batch in enumerate(train_dataloader, start=1):
-            input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch
-            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
-            logits = logits.reshape([-1, num_classes])
-            labels = labels.reshape([-1, num_classes, 1])[:, 0]
-
-            loss = criterion(logits, labels) * need_cal_loss
-            loss.backward()
-            if step % args.gradient_accumulation_steps == 0:
-                optimizer.step()
-                lr_scheduler.step()
-                optimizer.clear_grad()
-                global_steps += 1
-            # Rough acc result, not a precise acc
-            acc = metric.compute(logits, labels) * need_cal_loss
-            metric.update(acc)
-
-            if global_steps % args.logging_steps == 0 and step % args.gradient_accumulation_steps == 0:
-                logger.info(
-                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
-                    % (
-                        global_steps,
-                        epoch,
-                        loss,
-                        metric.accumulate(),
-                        lr_scheduler.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-
-            if global_steps % args.save_steps == 0 and step % args.gradient_accumulation_steps == 0:
-                logger.info("Eval, total {} qustion.".format(len(eval_ds)))
-                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), num_classes)
-                # Save model
-                if rank == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-                    if eval_acc > best_acc:
-                        logger.info("Save best model......")
-                        best_acc = eval_acc
-                        best_model_dir = os.path.join(args.output_dir, "best_model")
-                        if not os.path.exists(best_model_dir):
-                            os.makedirs(best_model_dir)
-                        model_to_save.save_pretrained(best_model_dir)
-                        tokenizer.save_pretrained(best_model_dir)
-
-            if args.max_steps > 0 and global_steps >= args.max_steps:
-                return
-    logger.info("Final test result:")
-    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), num_classes)
-
-
-if __name__ == "__main__":
-    do_train(args)
diff --git a/model_zoo/ernie-doc/run_mrc.py b/model_zoo/ernie-doc/run_mrc.py
deleted file mode 100644
index 51687dd04dbe..000000000000
--- a/model_zoo/ernie-doc/run_mrc.py
+++ /dev/null
@@ -1,358 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from collections import namedtuple
-from functools import partial
-
-import numpy as np
-import paddle
-from data import MRCIterator
-from metrics import EM_AND_F1, compute_qa_predictions
-from paddle.optimizer import AdamW
-
-from paddlenlp.datasets import load_dataset
-from paddlenlp.ops.optimizer import layerwise_lr_decay
-from paddlenlp.transformers import (
-    ErnieDocForQuestionAnswering,
-    ErnieDocTokenizer,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser()
-parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
-parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
-parser.add_argument("--learning_rate", type=float, default=2.75e-4, help="Learning rate used to train.")
-parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
-parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.")
-parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
-parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
-parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
-parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio")
-parser.add_argument("--n_best_size", default=20, type=int, help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
-parser.add_argument("--max_answer_length", default=100, type=int, help="Max answer length.")
-parser.add_argument("--do_lower_case", action='store_false', help="Whether to lower case the input text. Should be True for uncased models and False for cased models.")
-parser.add_argument("--verbose", action='store_true', help="Whether to output verbose log.")
-parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc")
-parser.add_argument("--dataset", default="dureader_robust", type=str, choices=["dureader_robust", "cmrc2018", "drcd"], help="The avaliable Q&A dataset")
-parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-args = parser.parse_args()
-# fmt: on
-
-# eval_dataset, test_dataset,
-DATASET_INFO = {
-    "dureader_robust": ["dev", "dev", ErnieDocTokenizer],
-    "cmrc2018": ["dev", "dev", ErnieDocTokenizer],
-    "drcd": ["dev", "test", ErnieDocTokenizer],
-}
-
-
-def set_seed(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    paddle.seed(args.seed)
-
-
-def init_memory(batch_size, memory_length, d_model, n_layers):
-    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
-
-
-class CrossEntropyLossForQA(paddle.nn.Layer):
-    def __init__(self):
-        super(CrossEntropyLossForQA, self).__init__()
-        self.criterion = paddle.nn.CrossEntropyLoss()
-
-    def forward(self, y, label):
-        start_logits, end_logits = y
-        start_position, end_position = label
-
-        start_loss = self.criterion(start_logits, start_position)
-        end_loss = self.criterion(end_logits, end_position)
-        loss = (start_loss + end_loss) / 2
-        return loss
-
-
-@paddle.no_grad()
-def evaluate(args, model, criterion, metric, data_loader, memories0, tokenizer):
-    RawResult = namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
-    model.eval()
-    all_results = []
-
-    tic_start = time.time()
-    tic_eval = time.time()
-    memories = list(memories0)
-
-    # Collect result
-    logger.info("The example number of eval_dataloader: {}".format(len(data_loader._batch_reader.features)))
-    for step, batch in enumerate(data_loader, start=1):
-        (
-            input_ids,
-            position_ids,
-            token_type_ids,
-            attn_mask,
-            start_position,
-            end_position,
-            qids,
-            gather_idx,
-            need_cal_loss,
-        ) = batch
-
-        start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-
-        start_logits, end_logits, qids = list(
-            map(lambda x: paddle.gather(x, gather_idx), [start_logits, end_logits, qids])
-        )
-        np_qids = qids.numpy()
-        np_start_logits = start_logits.numpy()
-        np_end_logits = end_logits.numpy()
-
-        if int(need_cal_loss.numpy()) == 1:
-            for idx in range(qids.shape[0]):
-                if len(all_results) % 1000 == 0 and len(all_results):
-                    logger.info("Processing example: %d" % len(all_results))
-                    logger.info("time per 1000: {} s".format(time.time() - tic_eval))
-                    tic_eval = time.time()
-
-                qid_each = int(np_qids[idx])
-                start_logits_each = [float(x) for x in np_start_logits[idx].flat]
-                end_logits_each = [float(x) for x in np_end_logits[idx].flat]
-                all_results.append(
-                    RawResult(unique_id=qid_each, start_logits=start_logits_each, end_logits=end_logits_each)
-                )
-
-    # Compute_predictions
-    all_predictions_eval, all_nbest_eval = compute_qa_predictions(
-        data_loader._batch_reader.examples,
-        data_loader._batch_reader.features,
-        all_results,
-        args.n_best_size,
-        args.max_answer_length,
-        args.do_lower_case,
-        tokenizer,
-        args.verbose,
-    )
-
-    EM, F1, AVG, TOTAL = metric(all_predictions_eval, data_loader._batch_reader.dataset)
-
-    logger.info("EM: {}, F1: {}, AVG: {}, TOTAL: {}, TIME: {}".format(EM, F1, AVG, TOTAL, time.time() - tic_start))
-    model.train()
-    return EM, F1, AVG
-
-
-def do_train(args):
-    set_seed(args)
-
-    DEV, TEST, TOKENIZER_CLASS = DATASET_INFO[args.dataset]
-    tokenizer = TOKENIZER_CLASS.from_pretrained(args.model_name_or_path)
-
-    train_ds, eval_ds = load_dataset(args.dataset, splits=["train", DEV])
-    if DEV == TEST:
-        test_ds = eval_ds
-    else:
-        test_ds = load_dataset(args.dataset, splits=[TEST])
-
-    paddle.set_device(args.device)
-    trainer_num = paddle.distributed.get_world_size()
-    if trainer_num > 1:
-        paddle.distributed.init_parallel_env()
-    rank = paddle.distributed.get_rank()
-    if rank == 0:
-        if os.path.exists(args.model_name_or_path):
-            logger.info("init checkpoint from %s" % args.model_name_or_path)
-
-    model = ErnieDocForQuestionAnswering.from_pretrained(args.model_name_or_path, dropout=args.dropout)
-    model_config = model.ernie_doc.config
-    if trainer_num > 1:
-        model = paddle.DataParallel(model)
-
-    train_ds_iter = MRCIterator(
-        train_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-    )
-
-    eval_ds_iter = MRCIterator(
-        eval_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="eval",
-        random_seed=args.seed,
-    )
-
-    test_ds_iter = MRCIterator(
-        test_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="test",
-        random_seed=args.seed,
-    )
-
-    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-
-    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
-
-    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
-
-    num_training_examples = train_ds_iter.get_num_examples()
-    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
-    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
-    logger.info("Num train examples: %d" % num_training_examples)
-    logger.info("Max train steps: %d" % num_training_steps)
-    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    # Construct dict
-    name_dict = dict()
-    for n, p in model.named_parameters():
-        name_dict[p.name] = n
-
-    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
-
-    optimizer = AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-        lr_ratio=simple_lr_setting,
-    )
-
-    global_steps = 0
-    create_memory = partial(
-        init_memory,
-        args.batch_size,
-        args.memory_length,
-        model_config["hidden_size"],
-        model_config["num_hidden_layers"],
-    )
-
-    criterion = CrossEntropyLossForQA()
-
-    memories = create_memory()
-    tic_train = time.time()
-    best_avg_metric = -1
-    stop_training = False
-    for epoch in range(args.epochs):
-        train_ds_iter.shuffle_sample()
-        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-        for step, batch in enumerate(train_dataloader, start=1):
-            global_steps += 1
-            (
-                input_ids,
-                position_ids,
-                token_type_ids,
-                attn_mask,
-                start_position,
-                end_position,
-                qids,
-                gather_idx,
-                need_cal_loss,
-            ) = batch
-            start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-
-            start_logits, end_logits, qids, start_position, end_position = list(
-                map(
-                    lambda x: paddle.gather(x, gather_idx),
-                    [start_logits, end_logits, qids, start_position, end_position],
-                )
-            )
-            loss = criterion([start_logits, end_logits], [start_position, end_position]) * need_cal_loss
-
-            mean_loss = loss.mean()
-            mean_loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-
-            if global_steps % args.logging_steps == 0:
-                logger.info(
-                    "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s"
-                    % (
-                        global_steps,
-                        epoch,
-                        mean_loss,
-                        lr_scheduler.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-
-            if global_steps % args.save_steps == 0:
-                # Evaluate
-                logger.info("Eval:")
-                EM, F1, AVG = evaluate(
-                    args, model, criterion, EM_AND_F1(), eval_dataloader, create_memory(), tokenizer
-                )
-                if rank == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-                    if best_avg_metric < AVG:
-                        output_dir = os.path.join(args.output_dir, "best_model")
-                        if not os.path.exists(output_dir):
-                            os.makedirs(output_dir)
-                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                        model_to_save.save_pretrained(output_dir)
-                        tokenizer.save_pretrained(output_dir)
-
-            if args.max_steps > 0 and global_steps >= args.max_steps:
-                stop_training = True
-                break
-        if stop_training:
-            break
-    logger.info("Test:")
-    evaluate(args, model, criterion, EM_AND_F1(), test_dataloader, create_memory(), tokenizer)
-    if rank == 0:
-        output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-        if not os.path.exists(output_dir):
-            os.makedirs(output_dir)
-        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-        model_to_save.save_pretrained(output_dir)
-        tokenizer.save_pretrained(output_dir)
-
-
-if __name__ == "__main__":
-    do_train(args)
diff --git a/model_zoo/ernie-doc/run_semantic_matching.py b/model_zoo/ernie-doc/run_semantic_matching.py
deleted file mode 100644
index 986154f82e15..000000000000
--- a/model_zoo/ernie-doc/run_semantic_matching.py
+++ /dev/null
@@ -1,357 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from collections import defaultdict
-from functools import partial
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-from data import SemanticMatchingIterator
-from model import ErnieDocForTextMatching
-from paddle.metric import Accuracy
-from paddle.optimizer import AdamW
-
-from paddlenlp.datasets import load_dataset
-from paddlenlp.ops.optimizer import layerwise_lr_decay
-from paddlenlp.transformers import (
-    ErnieDocModel,
-    ErnieDocTokenizer,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser()
-parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
-parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
-parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train.")
-parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
-parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.")
-parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
-parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
-parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
-parser.add_argument("--dataset", default="cail2019_scm", choices=["cail2019_scm"], type=str, help="The training dataset")
-parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc")
-parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
-parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-args = parser.parse_args()
-# fmt: on
-
-DATASET_INFO = {
-    "cail2019_scm": (ErnieDocTokenizer, "dev", "test", Accuracy()),
-}
-
-
-def set_seed(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    paddle.seed(args.seed)
-
-
-def init_memory(batch_size, memory_length, d_model, n_layers):
-    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
-
-
-@paddle.no_grad()
-def evaluate(model, metric, data_loader, memories0, pair_memories0):
-    model.eval()
-    losses = []
-    # copy the memory
-    memories = list(memories0)
-    pair_memories = list(pair_memories0)
-    tic_train = time.time()
-    eval_logging_step = 500
-
-    probs_dict = defaultdict(list)
-    label_dict = dict()
-    for step, batch in enumerate(data_loader, start=1):
-        (
-            input_ids,
-            position_ids,
-            token_type_ids,
-            attn_mask,
-            pair_input_ids,
-            pair_position_ids,
-            pair_token_type_ids,
-            pair_attn_mask,
-            labels,
-            qids,
-            gather_idx,
-            need_cal_loss,
-        ) = batch
-
-        logits, memories, pair_memories = model(
-            input_ids,
-            pair_input_ids,
-            memories,
-            pair_memories,
-            token_type_ids,
-            position_ids,
-            attn_mask,
-            pair_token_type_ids,
-            pair_position_ids,
-            pair_attn_mask,
-        )
-        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels, qids]))
-        # Need to collect probs for each qid, so use softmax_with_cross_entropy
-        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
-        losses.append(loss.mean().numpy())
-        # Shape: [B, NUM_LABELS]
-        np_probs = probs.numpy()
-        # Shape: [B, 1]
-        np_qids = qids.numpy()
-        np_labels = labels.numpy().flatten()
-        for i, qid in enumerate(np_qids.flatten()):
-            probs_dict[qid].append(np_probs[i])
-            label_dict[qid] = np_labels[i]  # Same qid share same label.
-
-        if step % eval_logging_step == 0:
-            logger.info(
-                "Step %d: loss:  %.5f, speed: %.5f steps/s"
-                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
-            )
-            tic_train = time.time()
-
-    # Collect predicted labels
-    preds = []
-    labels = []
-    for qid, probs in probs_dict.items():
-        mean_prob = np.mean(np.array(probs), axis=0)
-        preds.append(mean_prob)
-        labels.append(label_dict[qid])
-
-    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
-    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
-
-    metric.update(metric.compute(preds, labels))
-    acc_or_f1 = metric.accumulate()
-    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
-    metric.reset()
-    model.train()
-    return acc_or_f1
-
-
-def do_train(args):
-    set_seed(args)
-    tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset]
-    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-
-    # Get dataset
-    train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
-
-    num_classes = len(train_ds.label_list)
-
-    # Initialize model
-    paddle.set_device(args.device)
-    trainer_num = paddle.distributed.get_world_size()
-    if trainer_num > 1:
-        paddle.distributed.init_parallel_env()
-    rank = paddle.distributed.get_rank()
-    if rank == 0:
-        if os.path.exists(args.model_name_or_path):
-            logger.info("init checkpoint from %s" % args.model_name_or_path)
-
-    ernie_doc = ErnieDocModel.from_pretrained(args.model_name_or_path, cls_token_idx=0)
-    model = ErnieDocForTextMatching(ernie_doc, num_classes, args.dropout)
-
-    model_config = model.ernie_doc.config
-    if trainer_num > 1:
-        model = paddle.DataParallel(model)
-
-    train_ds_iter = SemanticMatchingIterator(
-        train_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-    )
-
-    eval_ds_iter = SemanticMatchingIterator(
-        eval_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        mode="eval",
-    )
-
-    test_ds_iter = SemanticMatchingIterator(
-        test_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        mode="test",
-    )
-
-    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
-    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
-
-    num_training_examples = train_ds_iter.get_num_examples()
-    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
-    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
-    logger.info("Num train examples: %d" % num_training_examples)
-    logger.info("Max train steps: %d" % num_training_steps)
-    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    # Construct dict
-    name_dict = dict()
-    for n, p in model.named_parameters():
-        name_dict[p.name] = n
-
-    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
-
-    optimizer = AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-        lr_ratio=simple_lr_setting,
-    )
-
-    criterion = paddle.nn.loss.CrossEntropyLoss()
-    metric = paddle.metric.Accuracy()
-
-    global_steps = 0
-    best_acc = -1
-    create_memory = partial(
-        init_memory,
-        args.batch_size,
-        args.memory_length,
-        model_config["hidden_size"],
-        model_config["num_hidden_layers"],
-    )
-    # Copy the memory
-    memories = create_memory()
-    pair_memories = create_memory()
-    tic_train = time.time()
-
-    for epoch in range(args.epochs):
-        train_ds_iter.shuffle_sample()
-        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-        for step, batch in enumerate(train_dataloader, start=1):
-            global_steps += 1
-            (
-                input_ids,
-                position_ids,
-                token_type_ids,
-                attn_mask,
-                pair_input_ids,
-                pair_position_ids,
-                pair_token_type_ids,
-                pair_attn_mask,
-                labels,
-                qids,
-                gather_idx,
-                need_cal_loss,
-            ) = batch
-
-            logits, memories, pair_memories = model(
-                input_ids,
-                pair_input_ids,
-                memories,
-                pair_memories,
-                token_type_ids,
-                position_ids,
-                attn_mask,
-                pair_token_type_ids,
-                pair_position_ids,
-                pair_attn_mask,
-            )
-
-            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
-            loss = criterion(logits, labels) * need_cal_loss
-            mean_loss = loss.mean()
-            mean_loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            # Rough acc result, not a precise acc
-            acc = metric.compute(logits, labels) * need_cal_loss
-            metric.update(acc)
-
-            if global_steps % args.logging_steps == 0:
-                logger.info(
-                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
-                    % (
-                        global_steps,
-                        epoch,
-                        mean_loss,
-                        metric.accumulate(),
-                        lr_scheduler.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-
-            if global_steps % args.save_steps == 0:
-                # Evaluate
-                logger.info("Eval:")
-                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), create_memory())
-                # Save
-                if rank == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    save_param_path = os.path.join(output_dir, "model_state.pdparams")
-                    paddle.save(model_to_save.state_dict(), save_param_path)
-                    tokenizer.save_pretrained(output_dir)
-                    if eval_acc > best_acc:
-                        logger.info("Save best model......")
-                        best_acc = eval_acc
-                        best_model_dir = os.path.join(args.output_dir, "best_model")
-                        if not os.path.exists(best_model_dir):
-                            os.makedirs(best_model_dir)
-
-                        save_param_path = os.path.join(best_model_dir, "model_state.pdparams")
-                        paddle.save(model_to_save.state_dict(), save_param_path)
-                        tokenizer.save_pretrained(best_model_dir)
-
-            if args.max_steps > 0 and global_steps >= args.max_steps:
-                return
-    logger.info("Final test result:")
-    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), create_memory())
-
-
-if __name__ == "__main__":
-    do_train(args)
diff --git a/model_zoo/ernie-doc/run_sequence_labeling.py b/model_zoo/ernie-doc/run_sequence_labeling.py
deleted file mode 100644
index 0686f9626ac4..000000000000
--- a/model_zoo/ernie-doc/run_sequence_labeling.py
+++ /dev/null
@@ -1,300 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from collections import defaultdict
-from functools import partial
-
-import numpy as np
-import paddle
-from data import SequenceLabelingIterator
-from paddle.optimizer import AdamW
-
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import ChunkEvaluator
-from paddlenlp.ops.optimizer import layerwise_lr_decay
-from paddlenlp.transformers import (
-    ErnieDocForTokenClassification,
-    ErnieDocTokenizer,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser()
-parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
-parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
-parser.add_argument("--learning_rate", type=float, default=3e-5, help="Learning rate used to train.")
-parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
-parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
-parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
-parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
-parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
-parser.add_argument("--dataset", default="msra_ner", choices=["msra_ner"], type=str, help="The training dataset")
-parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
-parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
-args = parser.parse_args()
-# fmt: on
-
-
-def set_seed(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    paddle.seed(args.seed)
-
-
-def init_memory(batch_size, memory_length, d_model, n_layers):
-    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
-
-
-@paddle.no_grad()
-def evaluate(model, metric, data_loader, memories0):
-    model.eval()
-    metric.reset()
-    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
-    loss_fct = paddle.nn.loss.CrossEntropyLoss()
-    losses = []
-    # Copy the memory
-    memories = list(memories0)
-    tic_train = time.time()
-    eval_logging_step = 500
-    labels_dict = defaultdict(list)
-    preds_dict = defaultdict(list)
-    length_dict = defaultdict(list)
-
-    for step, batch in enumerate(data_loader, start=1):
-        input_ids, position_ids, token_type_ids, attn_mask, labels, lengths, qids, gather_idxs, need_cal_loss = batch
-        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-        logits, labels, qids, lengths = list(
-            map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids, lengths])
-        )
-        loss = loss_fct(logits, labels)
-        avg_loss = loss.mean()
-        losses.append(avg_loss)
-        preds = logits.argmax(axis=2)
-
-        np_qids = qids.numpy().flatten()
-        for i, qid in enumerate(np_qids):
-            preds_dict[qid].append(preds[i])
-            labels_dict[qid].append(labels[i])
-            length_dict[qid].append(lengths[i])
-
-        if step % eval_logging_step == 0:
-            logger.info(
-                "Step %d: loss:  %.5f, speed: %.5f steps/s"
-                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
-            )
-            tic_train = time.time()
-
-    qids = preds_dict.keys()
-    for qid in qids:
-        preds = paddle.concat(preds_dict[qid], axis=0).unsqueeze(0)
-        labels = paddle.concat(labels_dict[qid], axis=0).unsqueeze(0).squeeze(-1)
-        length = paddle.concat(length_dict[qid], axis=0)
-        length = length.sum(axis=0, keepdim=True)
-        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
-        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
-    precision, recall, f1_score = metric.accumulate()
-    metric.reset()
-    logger.info("Total {} samples.".format(len(qids)))
-    logger.info("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score))
-    model.train()
-    return precision, recall, f1_score
-
-
-def do_train(args):
-    set_seed(args)
-    tokenizer = ErnieDocTokenizer.from_pretrained(args.model_name_or_path)
-    train_ds, eval_ds = load_dataset(args.dataset, splits=["train", "test"])
-    test_ds = eval_ds
-
-    num_classes = len(train_ds.label_list)
-    no_entity_id = num_classes - 1
-
-    paddle.set_device(args.device)
-    trainer_num = paddle.distributed.get_world_size()
-    if trainer_num > 1:
-        paddle.distributed.init_parallel_env()
-    rank = paddle.distributed.get_rank()
-    if rank == 0:
-        if os.path.exists(args.model_name_or_path):
-            logger.info("init checkpoint from %s" % args.model_name_or_path)
-    model = ErnieDocForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
-    model_config = model.ernie_doc.config
-    if trainer_num > 1:
-        model = paddle.DataParallel(model)
-
-    train_ds_iter = SequenceLabelingIterator(
-        train_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        random_seed=args.seed,
-        no_entity_id=no_entity_id,
-    )
-    eval_ds_iter = SequenceLabelingIterator(
-        eval_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="eval",
-        no_entity_id=no_entity_id,
-    )
-    test_ds_iter = SequenceLabelingIterator(
-        test_ds,
-        args.batch_size,
-        tokenizer,
-        trainer_num,
-        trainer_id=rank,
-        memory_len=model_config["memory_len"],
-        max_seq_length=args.max_seq_length,
-        mode="test",
-        no_entity_id=no_entity_id,
-    )
-
-    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
-    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
-    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
-
-    num_training_examples = train_ds_iter.get_num_examples()
-    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
-    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
-    logger.info("Num train examples: %d" % num_training_examples)
-    logger.info("Max train steps: %d" % num_training_steps)
-    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    # Construct dict
-    name_dict = dict()
-    for n, p in model.named_parameters():
-        name_dict[p.name] = n
-
-    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
-
-    optimizer = AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-        lr_ratio=simple_lr_setting,
-    )
-
-    criterion = paddle.nn.loss.CrossEntropyLoss()
-    metric = ChunkEvaluator(label_list=train_ds.label_list)
-
-    global_steps = 0
-
-    create_memory = partial(
-        init_memory,
-        args.batch_size,
-        args.memory_length,
-        model_config["hidden_size"],
-        model_config["num_hidden_layers"],
-    )
-    # Copy the memory
-    memories = create_memory()
-    tic_train = time.time()
-    best_f1 = 0
-    stop_training = False
-    for epoch in range(args.epochs):
-        train_ds_iter.shuffle_sample()
-        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
-        for step, batch in enumerate(train_dataloader, start=1):
-            global_steps += 1
-            (
-                input_ids,
-                position_ids,
-                token_type_ids,
-                attn_mask,
-                labels,
-                lengths,
-                qids,
-                gather_idx,
-                need_cal_loss,
-            ) = batch
-            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
-            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
-
-            loss = criterion(logits, labels) * need_cal_loss
-            loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-
-            if global_steps % args.logging_steps == 0:
-                logger.info(
-                    "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s"
-                    % (
-                        global_steps,
-                        epoch,
-                        loss,
-                        lr_scheduler.get_lr(),
-                        args.logging_steps / (time.time() - tic_train),
-                    )
-                )
-                tic_train = time.time()
-            if global_steps % args.save_steps == 0:
-                # Evaluate
-                logger.info("Eval:")
-                precision, recall, f1_score = evaluate(model, metric, eval_dataloader, create_memory())
-                # Save
-                if rank == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    model_to_save.save_pretrained(output_dir)
-                    tokenizer.save_pretrained(output_dir)
-                    if f1_score > best_f1:
-                        logger.info("Save best model......")
-                        best_f1 = f1_score
-                        best_model_dir = os.path.join(args.output_dir, "best_model")
-                        if not os.path.exists(best_model_dir):
-                            os.makedirs(best_model_dir)
-                        model_to_save.save_pretrained(best_model_dir)
-                        tokenizer.save_pretrained(best_model_dir)
-
-            if args.max_steps > 0 and global_steps >= args.max_steps:
-                stop_training = True
-                break
-        if stop_training:
-            break
-
-    logger.info("Final test result:")
-    evaluate(model, metric, test_dataloader, create_memory())
-
-
-if __name__ == "__main__":
-    do_train(args)
diff --git a/model_zoo/ernie-gen/README.md b/model_zoo/ernie-gen/README.md
deleted file mode 100644
index 4d236818fd22..000000000000
--- a/model_zoo/ernie-gen/README.md
+++ /dev/null
@@ -1,148 +0,0 @@
-# ERNIE-Gen: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
-
-## 1. 简介
-
-ERNIE-GEN 是面向生成任务的预训练-微调框架，首次在预训练阶段加入**span-by-span 生成任务**，让模型每次能够生成一个语义完整的片段。在预训练和微调中通过**填充式生成机制**和**噪声感知机制**来缓解曝光偏差问题。此外, ERNIE-GEN 采样**多片段-多粒度目标文本采样策略**, 增强源文本和目标文本的关联性，加强了编码器和解码器的交互。
-
-![multi-flow-attention](https://github.com/PaddlePaddle/ERNIE/raw/repro/ernie-gen/.meta/multi-flow-attention.png)
-
-## 快速开始
-
-### 环境依赖
-
-- tqdm
-
-安装方式：`pip install tqdm`
-
-### 数据准备
-
-在本例中，我们提供了古诗词数据集，示例数据如下：
-
-```text
-画\002精\002禅\002室\002冷\002，\002方\002暑\002久\002徘\002徊\002。	不\002尽\002林\002端\002雪\002，\002长\002青\002石\002上\002苔\002。\002心\002闲\002对\002岩\002岫\002，\002目\002浄\002失\002尘\002埃\002。\002坐\002久\002清\002风\002至\002，\002疑\002从\002翠\002涧\002来\002。
-```
-
-每行数据都是由两列组成，以制表符分隔。第一列是输入的诗句前文，第二列是输出的诗句后文，所有文字都以 `\002` 分隔。
-
-完整数据集可以通过以下命令下载并解压：
-
-```bash
-wget --no-check-certificate https://bj.bcebos.com/paddlenlp/datasets/poetry.tar.gz
-tar xvf poetry.tar.gz
-```
-
-### 模型微调
-
-#### 单卡训练
-
-训练启动方式如下：
-
-```bash
-python -u ./train.py \
-    --model_name_or_path ernie-1.0 \
-    --max_encode_len 24 \
-    --max_decode_len 72 \
-    --batch_size 48  \
-    --learning_rate 2e-5 \
-    --num_epochs 12 \
-    --logging_steps 1 \
-    --save_steps 1000 \
-    --output_dir ./tmp/ \
-    --device gpu \
-    # --init_checkpoint ./tmp/model_10000/model_state.pdparams
-```
-
-参数释义如下：
-- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer，支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型，但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数，其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地，这里也可以提供相应目录地址。
-- `max_encode_len` 表示最大输入句子长度，超过该长度将被截断。
-- `max_decode_len` 表示最大输出句子长度，超过该长度将被截断。
-- `batch_size` 表示每次迭代**每张卡**上的样本数目。
-- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
-- `num_epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔。
-- `save_steps` 表示模型保存及评估间隔。
-- `output_dir` 表示模型保存路径。
-- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
-- `init_checkpoint` 表示模型加载路径，通过设置此参数可以开启增量训练。
-
-训练会持续很长的时间，为此我们提供了[微调后的模型](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gen_finetuned/ernie_1.0_poetry.pdparams)。您可以下载该模型并通过`init_checkpoint`加载其参数进行增量训练、评估或预测。
-
-#### 多卡训练
-
-训练启动方式如下：
-
-```bash
-python -m paddle.distributed.launch --gpus "0,1" ./train.py \
-    --model_name_or_path ernie-1.0 \
-    --max_encode_len 24 \
-    --max_decode_len 72 \
-    --batch_size 48  \
-    --learning_rate 2e-5 \
-    --num_epochs 12 \
-    --logging_steps 1 \
-    --save_steps 1000 \
-    --output_dir ./tmp/ \
-    --device gpu \
-    # --init_checkpoint ./tmp/model_10000/model_state.pdparams
-```
-
-### 模型评估
-
-通过加载训练保存的模型，可以对验证集数据进行验证，启动方式如下：
-
-```bash
-python -u ./eval.py \
-    --model_name_or_path ernie-1.0 \
-    --max_encode_len 24 \
-    --max_decode_len 72 \
-    --batch_size 48   \
-    --init_checkpoint ./tmp/model_10000/model_state.pdparams \
-    --device gpu
-```
-
-参数释义如下：
-- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer，支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型，但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数，其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地，这里也可以提供相应目录地址。
-- `max_encode_len` 表示最大输入句子长度，超过该长度将被截断。
-- `max_decode_len` 表示最大输出句子长度，超过该长度将被截断。
-- `batch_size` 表示每次迭代**每张卡**上的样本数目。
-- `init_checkpoint` 表示模型加载路径。
-- `use_gpu` 表示使用GPU。
-
-### 模型预测
-
-对无标签数据可以启动模型预测：
-
-```bash
-python -u ./predict.py \
-    --model_name_or_path ernie-1.0 \
-    --max_encode_len 24 \
-    --max_decode_len 72 \
-    --batch_size 48   \
-    --init_checkpoint ./tmp/model_10000/model_state.pdparams \
-    --device gpu
-```
-
-
-## Citation
-
-您可以按下面的格式引用ERNIE-Gen论文:
-
-```
-@article{xiao2020ernie-gen,
-  title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
-  author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
-  journal={arXiv preprint arXiv:2001.11314},
-  year={2020}
-}
-```
-
-## 线上教程体验
-
-我们为诗歌文本生成提供了线上教程，欢迎体验：
-
-* [使用PaddleNLP预训练模型ERNIE-GEN生成诗歌](https://aistudio.baidu.com/aistudio/projectdetail/1339888)
-
-
-## Acknowledgement
-
-- 感谢 [chinese-poetry数据集](https://github.com/chinese-poetry/chinese-poetry) 开放的诗歌数据集
diff --git a/model_zoo/ernie-gen/decode.py b/model_zoo/ernie-gen/decode.py
deleted file mode 100644
index 0450dd46504f..000000000000
--- a/model_zoo/ernie-gen/decode.py
+++ /dev/null
@@ -1,292 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import re
-from collections import namedtuple
-
-import numpy as np
-import paddle
-import paddle.nn as nn
-
-
-def gen_bias(encoder_inputs, decoder_inputs, step):
-    decoder_bsz, decoder_seqlen = decoder_inputs.shape[:2]
-    encoder_bsz, encoder_seqlen = encoder_inputs.shape[:2]
-    attn_bias = paddle.reshape(paddle.arange(0, decoder_seqlen, 1, dtype="float32") + 1, [1, -1, 1])
-    decoder_bias = paddle.cast(
-        (paddle.matmul(attn_bias, 1.0 / attn_bias, transpose_y=True) >= 1.0), "float32"
-    )  # [1, decoderlen, decoderlen]
-    encoder_bias = paddle.unsqueeze(
-        paddle.cast(paddle.ones_like(encoder_inputs), "float32"), [1]
-    )  # [bsz, 1, encoderlen]
-    encoder_bias = paddle.expand(
-        encoder_bias, [encoder_bsz, decoder_seqlen, encoder_seqlen]
-    )  # [bsz,decoderlen, encoderlen]
-    decoder_bias = paddle.expand(
-        decoder_bias, [decoder_bsz, decoder_seqlen, decoder_seqlen]
-    )  # [bsz, decoderlen, decoderlen]
-    if step > 0:
-        bias = paddle.concat(
-            [encoder_bias, paddle.ones([decoder_bsz, decoder_seqlen, step], "float32"), decoder_bias], -1
-        )
-    else:
-        bias = paddle.concat([encoder_bias, decoder_bias], -1)
-    return bias
-
-
-@paddle.no_grad()
-def greedy_search_infilling(
-    model,
-    token_ids,
-    token_type_ids,
-    sos_id,
-    eos_id,
-    attn_id,
-    pad_id,
-    unk_id,
-    vocab_size,
-    max_encode_len=640,
-    max_decode_len=100,
-    tgt_type_id=3,
-):
-    _, logits, info = model(token_ids, token_type_ids)
-    d_batch, d_seqlen = token_ids.shape
-    seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True)
-    has_stopped = np.zeros([d_batch], dtype=np.bool_)
-    gen_seq_len = np.zeros([d_batch], dtype=np.int64)
-    output_ids = []
-
-    past_cache = info["caches"]
-
-    cls_ids = paddle.ones([d_batch], dtype="int64") * sos_id
-    attn_ids = paddle.ones([d_batch], dtype="int64") * attn_id
-    ids = paddle.stack([cls_ids, attn_ids], -1)
-    for step in range(max_decode_len):
-        bias = gen_bias(token_ids, ids, step)
-        pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch, 1]))
-        pos_ids += seqlen
-        _, logits, info = model(
-            ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache
-        )
-
-        if logits.shape[-1] > vocab_size:
-            logits[:, :, vocab_size:] = 0
-        logits[:, :, pad_id] = 0
-        logits[:, :, unk_id] = 0
-        logits[:, :, attn_id] = 0
-
-        gen_ids = paddle.argmax(logits, -1)
-
-        past_cached_k, past_cached_v = past_cache
-        cached_k, cached_v = info["caches"]
-        cached_k = [paddle.concat([pk, k[:, :1, :]], 1) for pk, k in zip(past_cached_k, cached_k)]  # concat cached
-        cached_v = [paddle.concat([pv, v[:, :1, :]], 1) for pv, v in zip(past_cached_v, cached_v)]
-        past_cache = (cached_k, cached_v)
-
-        gen_ids = gen_ids[:, 1]
-        ids = paddle.stack([gen_ids, attn_ids], 1)
-
-        gen_ids = gen_ids.numpy()
-        has_stopped |= (gen_ids == eos_id).astype(np.bool_)
-        gen_seq_len += 1 - has_stopped.astype(np.int64)
-        output_ids.append(gen_ids.tolist())
-        if has_stopped.all():
-            break
-    output_ids = np.array(output_ids).transpose([1, 0])
-    return output_ids
-
-
-BeamSearchState = namedtuple("BeamSearchState", ["log_probs", "lengths", "finished"])
-BeamSearchOutput = namedtuple("BeamSearchOutput", ["scores", "predicted_ids", "beam_parent_ids"])
-
-
-def log_softmax(x):
-    e_x = np.exp(x - np.max(x))
-    return np.log(e_x / e_x.sum())
-
-
-def mask_prob(p, onehot_eos, finished):
-    is_finished = paddle.cast(paddle.reshape(finished, [-1, 1]) != 0, "float32")
-    p = is_finished * (1.0 - paddle.cast(onehot_eos, "float32")) * -9999.0 + (1.0 - is_finished) * p
-    return p
-
-
-def hyp_score(log_probs, length, length_penalty):
-    lp = paddle.pow((5.0 + paddle.cast(length, "float32")) / 6.0, length_penalty)
-    return log_probs / lp
-
-
-def beam_search_step(state, logits, eos_id, beam_width, is_first_step, length_penalty):
-    """logits.shape == [B*W, V]"""
-    _, vocab_size = logits.shape
-
-    bsz, beam_width = state.log_probs.shape
-    onehot_eos = paddle.cast(nn.functional.one_hot(paddle.ones([1], "int64") * eos_id, vocab_size), "int64")  # [1, V]
-
-    probs = paddle.log(nn.functional.softmax(logits))  # [B*W, V]
-    probs = mask_prob(probs, onehot_eos, state.finished)  # [B*W, V]
-    allprobs = paddle.reshape(state.log_probs, [-1, 1]) + probs  # [B*W, V]
-
-    not_finished = 1 - paddle.reshape(state.finished, [-1, 1])  # [B*W,1]
-    not_eos = 1 - onehot_eos
-    length_to_add = not_finished * not_eos  # [B*W,V]
-    alllen = paddle.reshape(state.lengths, [-1, 1]) + length_to_add
-
-    allprobs = paddle.reshape(allprobs, [-1, beam_width * vocab_size])
-    alllen = paddle.reshape(alllen, [-1, beam_width * vocab_size])
-    allscore = hyp_score(allprobs, alllen, length_penalty)
-    if is_first_step:
-        allscore = paddle.reshape(allscore, [bsz, beam_width, -1])[:, 0, :]  # first step only consiter beam 0
-    scores, idx = paddle.topk(allscore, k=beam_width)  # [B, W]
-    next_beam_id = idx // vocab_size  # [B, W]
-    next_word_id = idx % vocab_size
-
-    gather_idx = paddle.concat([paddle.nonzero(idx != -1)[:, :1], paddle.reshape(idx, [-1, 1])], 1)
-    next_probs = paddle.reshape(paddle.gather_nd(allprobs, gather_idx), idx.shape)
-    next_len = paddle.reshape(paddle.gather_nd(alllen, gather_idx), idx.shape)
-
-    gather_idx = paddle.concat([paddle.nonzero(next_beam_id != -1)[:, :1], paddle.reshape(next_beam_id, [-1, 1])], 1)
-    next_finished = paddle.reshape(
-        paddle.gather_nd(state.finished, gather_idx), state.finished.shape
-    )  # [gather new beam state according to new beam id]
-
-    next_finished += paddle.cast(next_word_id == eos_id, "int64")
-    next_finished = paddle.cast(next_finished > 0, "int64")
-
-    next_state = BeamSearchState(log_probs=next_probs, lengths=next_len, finished=next_finished)
-    output = BeamSearchOutput(scores=scores, predicted_ids=next_word_id, beam_parent_ids=next_beam_id)
-
-    return output, next_state
-
-
-@paddle.no_grad()
-def beam_search_infilling(
-    model,
-    token_ids,
-    token_type_ids,
-    sos_id,
-    eos_id,
-    attn_id,
-    pad_id,
-    unk_id,
-    vocab_size,
-    max_encode_len=640,
-    max_decode_len=100,
-    beam_width=5,
-    tgt_type_id=3,
-    length_penalty=1.0,
-):
-    _, __, info = model(token_ids, token_type_ids)
-    d_batch, d_seqlen = token_ids.shape
-
-    state = BeamSearchState(
-        log_probs=paddle.zeros([d_batch, beam_width], "float32"),
-        lengths=paddle.zeros([d_batch, beam_width], "int64"),
-        finished=paddle.zeros([d_batch, beam_width], "int64"),
-    )
-    outputs = []
-
-    def reorder_(t, parent_id):
-        """reorder cache according to parent beam id"""
-        gather_idx = paddle.nonzero(parent_id != -1)[:, 0] * beam_width + paddle.reshape(parent_id, [-1])
-        t = paddle.gather(t, gather_idx)
-        return t
-
-    def tile_(t, times):
-        _shapes = list(t.shape[1:])
-        new_shape = [t.shape[0], times] + list(t.shape[1:])
-        ret = paddle.reshape(
-            paddle.expand(paddle.unsqueeze(t, [1]), new_shape),
-            [
-                -1,
-            ]
-            + _shapes,
-        )
-        return ret
-
-    cached_k, cached_v = info["caches"]
-    cached_k = [tile_(k, beam_width) for k in cached_k]
-    cached_v = [tile_(v, beam_width) for v in cached_v]
-    past_cache = (cached_k, cached_v)
-
-    token_ids = tile_(token_ids, beam_width)
-    seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True)
-    # log.debug(token_ids.shape)
-
-    cls_ids = paddle.ones([d_batch * beam_width], dtype="int64") * sos_id
-    attn_ids = paddle.ones([d_batch * beam_width], dtype="int64") * attn_id  # SOS
-    ids = paddle.stack([cls_ids, attn_ids], -1)
-    for step in range(max_decode_len):
-        # log.debug('decode step %d' % step)
-        bias = gen_bias(token_ids, ids, step)
-        pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch * beam_width, 1]))
-        pos_ids += seqlen
-        _, logits, info = model(
-            ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache
-        )
-        if logits.shape[-1] > vocab_size:
-            logits[:, :, vocab_size:] = 0
-        logits[:, :, pad_id] = 0
-        logits[:, :, unk_id] = 0
-        logits[:, :, attn_id] = 0
-
-        output, state = beam_search_step(
-            state,
-            logits[:, 1],
-            eos_id=eos_id,
-            beam_width=beam_width,
-            is_first_step=(step == 0),
-            length_penalty=length_penalty,
-        )
-        outputs.append(output)
-
-        past_cached_k, past_cached_v = past_cache
-        cached_k, cached_v = info["caches"]
-        cached_k = [
-            reorder_(paddle.concat([pk, k[:, :1, :]], 1), output.beam_parent_ids)
-            for pk, k in zip(past_cached_k, cached_k)
-        ]  # concat cached
-        cached_v = [
-            reorder_(paddle.concat([pv, v[:, :1, :]], 1), output.beam_parent_ids)
-            for pv, v in zip(past_cached_v, cached_v)
-        ]
-        past_cache = (cached_k, cached_v)
-
-        pred_ids_flatten = paddle.reshape(output.predicted_ids, [d_batch * beam_width])
-        ids = paddle.stack([pred_ids_flatten, attn_ids], 1)
-
-        if state.finished.numpy().all():
-            break
-
-    final_ids = paddle.stack([o.predicted_ids for o in outputs], 0)
-    final_parent_ids = paddle.stack([o.beam_parent_ids for o in outputs], 0)
-    final_ids = nn.functional.gather_tree(final_ids, final_parent_ids)[:, :, 0]  # pick best beam
-    final_ids = paddle.transpose(paddle.reshape(final_ids, [-1, d_batch * 1]), [1, 0])
-
-    return final_ids.numpy()
-
-
-en_patten = re.compile(r"^[a-zA-Z0-9]*$")
-
-
-def post_process(token):
-    if token.startswith("##"):
-        ret = token[2:]
-    elif token in ["[CLS]", "[SEP]", "[PAD]"]:
-        ret = ""
-    else:
-        if en_patten.match(token):
-            ret = " " + token
-        else:
-            ret = token
-    return ret
diff --git a/model_zoo/ernie-gen/encode.py b/model_zoo/ernie-gen/encode.py
deleted file mode 100644
index a1f47e1f3310..000000000000
--- a/model_zoo/ernie-gen/encode.py
+++ /dev/null
@@ -1,145 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from copy import deepcopy
-
-import numpy as np
-
-
-def convert_example(
-    tokenizer,
-    attn_id,
-    tgt_type_id=3,
-    max_encode_len=512,
-    max_decode_len=128,
-    is_test=False,
-    noise_prob=0.0,
-    use_random_noice=False,
-):
-    def warpper(example):
-        """convert an example into necessary features"""
-        tokens = example["tokens"]
-        labels = example["labels"]
-        encoded_src = tokenizer(tokens, max_seq_len=max_encode_len, pad_to_max_seq_len=False)
-        src_ids, src_sids = encoded_src["input_ids"], encoded_src["token_type_ids"]
-        src_pids = np.arange(len(src_ids))
-
-        if not is_test:
-            encoded_tgt = tokenizer(labels, max_seq_len=max_decode_len, pad_to_max_seq_len=False)
-            tgt_ids, tgt_sids = encoded_tgt["input_ids"], encoded_tgt["token_type_ids"]
-            tgt_ids = np.array(tgt_ids).astype("int64")
-            tgt_sids = np.array(tgt_sids) + tgt_type_id
-            tgt_pids = np.arange(len(tgt_ids)) + len(src_ids)
-
-        attn_ids = np.ones_like(tgt_ids) * attn_id
-        if noise_prob > 0.0:
-            tgt_labels = deepcopy(tgt_ids)
-            if use_random_noice:
-                noice_ids = np.random.randint(1, len(tokenizer.vocab), size=tgt_ids.shape)
-            else:
-                noice_ids = np.ones_like(tgt_ids) * tokenizer.vocab["[NOISE]"]
-            (pos,) = np.where(np.ones_like(tgt_ids))
-            np.random.shuffle(pos)
-            pos = pos[: int(noise_prob * len(pos))]
-            tgt_ids[pos] = noice_ids[
-                pos,
-            ]
-        else:
-            tgt_labels = tgt_ids
-
-        return (src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels)
-
-    return warpper
-
-
-def gen_mask(batch_ids, mask_type="bidi", query_len=None, pad_value=0):
-    if query_len is None:
-        query_len = batch_ids.shape[1]
-    if mask_type != "empty":
-        mask = (batch_ids != pad_value).astype(np.float32)
-        mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1])
-        if mask_type == "causal":
-            assert query_len == batch_ids.shape[1]
-            mask = np.tril(mask)
-        elif mask_type == "causal_without_diag":
-            assert query_len == batch_ids.shape[1]
-            mask = np.tril(mask, -1)
-        elif mask_type == "diag":
-            assert query_len == batch_ids.shape[1]
-            mask = np.stack([np.diag(np.diag(m)) for m in mask], 0)
-
-    else:
-        mask_type == "empty"
-        mask = np.zeros_like(batch_ids).astype(np.float32)
-        mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1])
-    return mask
-
-
-def after_padding(args):
-    """
-    attention mask:
-    ***  src,  tgt, attn
-    src  00,   01,   11
-    tgt  10,   11,   12
-    attn 20,   21,   22
-
-    ***   s1, s2 | t1 t2 t3| attn1 attn2 attn3
-    s1    1,  1  | 0, 0, 0,| 0,    0,    0,
-    s2    1,  1  | 0, 0, 0,| 0,    0,    0,
-    -
-    t1    1,  1, | 1, 0, 0,| 0,    0,    0,
-    t2    1,  1, | 1, 1, 0,| 0,    0,    0,
-    t3    1,  1, | 1, 1, 1,| 0,    0,    0,
-    -
-    attn1 1,  1, | 0, 0, 0,| 1,    0,    0,
-    attn2 1,  1, | 1, 0, 0,| 0,    1,    0,
-    attn3 1,  1, | 1, 1, 0,| 0,    0,    1,
-
-    for details, see Fig3. https://arxiv.org/abs/2001.11314
-    """
-    src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels = args
-    src_len = src_ids.shape[1]
-    tgt_len = tgt_ids.shape[1]
-    mask_00 = gen_mask(src_ids, "bidi", query_len=src_len)
-    # mask_01 = gen_mask(tgt_ids, "empty", query_len=src_len)
-    # mask_02 = gen_mask(attn_ids, "empty", query_len=src_len)
-
-    mask_10 = gen_mask(src_ids, "bidi", query_len=tgt_len)
-    mask_11 = gen_mask(tgt_ids, "causal", query_len=tgt_len)
-    # mask_12 = gen_mask(attn_ids, "empty", query_len=tgt_len)
-
-    mask_20 = gen_mask(src_ids, "bidi", query_len=tgt_len)
-    mask_21 = gen_mask(tgt_ids, "causal_without_diag", query_len=tgt_len)
-    mask_22 = gen_mask(attn_ids, "diag", query_len=tgt_len)
-
-    mask_src_2_src = mask_00
-    mask_tgt_2_srctgt = np.concatenate([mask_10, mask_11], 2)
-    mask_attn_2_srctgtattn = np.concatenate([mask_20, mask_21, mask_22], 2)
-
-    raw_tgt_labels = deepcopy(tgt_labels)
-    tgt_labels = tgt_labels[np.where(tgt_labels != 0)]
-    return (
-        src_ids,
-        src_sids,
-        src_pids,
-        tgt_ids,
-        tgt_sids,
-        tgt_pids,
-        attn_ids,
-        mask_src_2_src,
-        mask_tgt_2_srctgt,
-        mask_attn_2_srctgtattn,
-        tgt_labels,
-        raw_tgt_labels,
-    )
diff --git a/model_zoo/ernie-gen/eval.py b/model_zoo/ernie-gen/eval.py
deleted file mode 100644
index 47dd2298ac7b..000000000000
--- a/model_zoo/ernie-gen/eval.py
+++ /dev/null
@@ -1,147 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import paddle
-from decode import beam_search_infilling
-from encode import after_padding, convert_example
-from paddle.io import DataLoader
-from tqdm import tqdm
-
-from paddlenlp.data import Pad, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import Rouge1, Rouge2
-from paddlenlp.transformers import (
-    BertTokenizer,
-    ElectraTokenizer,
-    ErnieForGeneration,
-    ErnieTinyTokenizer,
-    ErnieTokenizer,
-    RobertaTokenizer,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN')
-parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())))
-parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length")
-parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length")
-parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", )
-parser.add_argument('--beam_width', type=int, default=1, help="Beam search width")
-parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding")
-parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from')
-parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
-args = parser.parse_args()
-# fmt: on
-
-
-def evaluate():
-    paddle.set_device(args.device)
-
-    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
-    if "ernie-tiny" in args.model_name_or_path:
-        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
-    elif "ernie" in args.model_name_or_path:
-        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
-    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
-        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
-    elif "electra" in args.model_name_or_path:
-        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
-    else:
-        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
-
-    dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False)
-    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
-    tgt_type_id = model.sent_emb.weight.shape[0] - 1
-
-    trans_func = convert_example(
-        tokenizer=tokenizer,
-        attn_id=attn_id,
-        tgt_type_id=tgt_type_id,
-        max_encode_len=args.max_encode_len,
-        max_decode_len=args.max_decode_len,
-    )
-
-    batchify_fn = lambda samples, fn=Tuple(
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_sids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_sids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
-    ): after_padding(fn(samples))
-
-    dev_dataset = dev_dataset.map(trans_func)
-    dev_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False)
-    data_loader = DataLoader(
-        dataset=dev_dataset, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-    )
-
-    rouge1 = Rouge1()
-    rouge2 = Rouge2()
-
-    if args.init_checkpoint:
-        model_state = paddle.load(args.init_checkpoint)
-        model.set_state_dict(model_state)
-
-    model.eval()
-    vocab = tokenizer.vocab
-    eos_id = vocab[tokenizer.sep_token]
-    sos_id = vocab[tokenizer.cls_token]
-    pad_id = vocab[tokenizer.pad_token]
-    unk_id = vocab[tokenizer.unk_token]
-    vocab_size = len(vocab)
-    evaluated_sentences_ids = []
-    reference_sentences_ids = []
-    logger.info("Evaluating...")
-    for data in tqdm(data_loader):
-        (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
-        # Use greedy_search_infilling or beam_search_infilling to get predictions
-        output_ids = beam_search_infilling(
-            model,
-            src_ids,
-            src_sids,
-            eos_id=eos_id,
-            sos_id=sos_id,
-            attn_id=attn_id,
-            pad_id=pad_id,
-            unk_id=unk_id,
-            vocab_size=vocab_size,
-            max_decode_len=args.max_decode_len,
-            max_encode_len=args.max_encode_len,
-            beam_width=args.beam_width,
-            length_penalty=args.length_penalty,
-            tgt_type_id=tgt_type_id,
-        )
-
-        for ids in output_ids.tolist():
-            if eos_id in ids:
-                ids = ids[: ids.index(eos_id)]
-            evaluated_sentences_ids.append(ids)
-
-        for ids in raw_tgt_labels.numpy().tolist():
-            ids = ids[: ids.index(eos_id)]
-            reference_sentences_ids.append(ids)
-
-    score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids)
-    score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids)
-
-    logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100))
-
-
-if __name__ == "__main__":
-    evaluate()
diff --git a/model_zoo/ernie-gen/model.py b/model_zoo/ernie-gen/model.py
deleted file mode 100644
index 31e30c6e9c33..000000000000
--- a/model_zoo/ernie-gen/model.py
+++ /dev/null
@@ -1,64 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-import paddle.nn as nn
-
-
-class StackModel(nn.Layer):
-    def __init__(self, model):
-        super().__init__()
-        self.model = model
-
-    def forward(
-        self,
-        src_ids,
-        src_sids,
-        src_pids,
-        tgt_ids,
-        tgt_sids,
-        tgt_pids,
-        attn_ids,
-        mask_src_2_src,
-        mask_tgt_2_srctgt,
-        mask_attn_2_srctgtattn,
-        tgt_labels,
-        tgt_pos,
-    ):
-        _, __, info = self.model(
-            src_ids, sent_ids=src_sids, pos_ids=src_pids, attn_bias=mask_src_2_src, encode_only=True
-        )
-        cached_k, cached_v = info["caches"]
-        _, __, info = self.model(
-            tgt_ids,
-            sent_ids=tgt_sids,
-            pos_ids=tgt_pids,
-            attn_bias=mask_tgt_2_srctgt,
-            past_cache=(cached_k, cached_v),
-            encode_only=True,
-        )
-        cached_k2, cached_v2 = info["caches"]
-        past_cache_k = [paddle.concat([k, k2], 1) for k, k2 in zip(cached_k, cached_k2)]
-        past_cache_v = [paddle.concat([v, v2], 1) for v, v2 in zip(cached_v, cached_v2)]
-        loss, _, __ = self.model(
-            attn_ids,
-            sent_ids=tgt_sids,
-            pos_ids=tgt_pids,
-            attn_bias=mask_attn_2_srctgtattn,
-            past_cache=(past_cache_k, past_cache_v),
-            tgt_labels=tgt_labels,
-            tgt_pos=tgt_pos,
-        )
-        loss = loss.mean()
-        return loss
diff --git a/model_zoo/ernie-gen/predict.py b/model_zoo/ernie-gen/predict.py
deleted file mode 100644
index 408ded91231a..000000000000
--- a/model_zoo/ernie-gen/predict.py
+++ /dev/null
@@ -1,137 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import paddle
-from decode import beam_search_infilling, post_process
-from encode import after_padding, convert_example
-from paddle.io import DataLoader
-
-from paddlenlp.data import Pad, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.transformers import (
-    BertTokenizer,
-    ElectraTokenizer,
-    ErnieForGeneration,
-    ErnieTinyTokenizer,
-    ErnieTokenizer,
-    RobertaTokenizer,
-)
-from paddlenlp.utils.log import logger
-
-# fmt: off
-parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN')
-parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())))
-parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length")
-parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length")
-parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", )
-parser.add_argument('--beam_width', type=int, default=3, help="Beam search width")
-parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding")
-parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from')
-parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
-# fmt: on
-
-args = parser.parse_args()
-
-
-def predict():
-    paddle.set_device(args.device)
-
-    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
-    if "ernie-tiny" in args.model_name_or_path:
-        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
-    elif "ernie" in args.model_name_or_path:
-        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
-    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
-        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
-    elif "electra" in args.model_name_or_path:
-        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
-    else:
-        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
-
-    dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False)
-    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
-    tgt_type_id = model.sent_emb.weight.shape[0] - 1
-
-    trans_func = convert_example(
-        tokenizer=tokenizer,
-        attn_id=attn_id,
-        tgt_type_id=tgt_type_id,
-        max_encode_len=args.max_encode_len,
-        max_decode_len=args.max_decode_len,
-    )
-
-    batchify_fn = lambda samples, fn=Tuple(
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_sids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_sids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
-    ): after_padding(fn(samples))
-
-    dev_dataset = dev_dataset.map(trans_func)
-    test_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False)
-    data_loader = DataLoader(
-        dataset=dev_dataset, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
-    )
-
-    if args.init_checkpoint:
-        model_state = paddle.load(args.init_checkpoint)
-        model.set_state_dict(model_state)
-
-    model.eval()
-    vocab = tokenizer.vocab
-    eos_id = vocab[tokenizer.sep_token]
-    sos_id = vocab[tokenizer.cls_token]
-    pad_id = vocab[tokenizer.pad_token]
-    unk_id = vocab[tokenizer.unk_token]
-    vocab_size = len(vocab)
-    logger.info("Predicting...")
-    for data in data_loader:
-        (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
-        # Use greedy_search_infilling or beam_search_infilling to get predictions
-        output_ids = beam_search_infilling(
-            model,
-            src_ids,
-            src_sids,
-            eos_id=eos_id,
-            sos_id=sos_id,
-            attn_id=attn_id,
-            pad_id=pad_id,
-            unk_id=unk_id,
-            vocab_size=vocab_size,
-            max_decode_len=args.max_decode_len,
-            max_encode_len=args.max_encode_len,
-            beam_width=args.beam_width,
-            length_penalty=args.length_penalty,
-            tgt_type_id=tgt_type_id,
-        )
-
-        for source_ids, target_ids, predict_ids in zip(
-            src_ids.numpy().tolist(), raw_tgt_labels.numpy().tolist(), output_ids.tolist()
-        ):
-            if eos_id in predict_ids:
-                predict_ids = predict_ids[: predict_ids.index(eos_id)]
-            source_sentence = "".join(map(post_process, vocab.to_tokens(source_ids[1 : source_ids.index(eos_id)])))
-            tgt_sentence = "".join(map(post_process, vocab.to_tokens(target_ids[1 : target_ids.index(eos_id)])))
-            predict_ids = "".join(map(post_process, vocab.to_tokens(predict_ids)))
-            print("source :%s\ntarget :%s\npredict:%s\n" % (source_sentence, tgt_sentence, predict_ids))
-
-
-if __name__ == "__main__":
-    predict()
diff --git a/model_zoo/ernie-gen/train.py b/model_zoo/ernie-gen/train.py
deleted file mode 100644
index fa96b5599310..000000000000
--- a/model_zoo/ernie-gen/train.py
+++ /dev/null
@@ -1,330 +0,0 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import time
-
-import paddle
-import paddle.nn as nn
-from decode import beam_search_infilling, post_process
-from encode import after_padding, convert_example
-from model import StackModel
-from paddle.io import DataLoader
-from tqdm import tqdm
-
-from paddlenlp.data import Pad, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import Rouge1, Rouge2
-from paddlenlp.trainer.argparser import strtobool
-from paddlenlp.transformers import (
-    BertTokenizer,
-    ElectraTokenizer,
-    ErnieForGeneration,
-    ErnieTinyTokenizer,
-    ErnieTokenizer,
-    LinearDecayWithWarmup,
-    RobertaTokenizer,
-)
-from paddlenlp.utils.log import logger
-
-parser = argparse.ArgumentParser("seq2seq model with ERNIE-GEN")
-parser.add_argument(
-    "--model_name_or_path",
-    default=None,
-    type=str,
-    required=True,
-    help="Path to pre-trained model or shortcut name selected in the list: "
-    + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())),
-)
-parser.add_argument(
-    "--output_dir",
-    default=None,
-    type=str,
-    required=True,
-    help="The output directory where the model predictions and checkpoints will be written.",
-)
-parser.add_argument("--max_encode_len", type=int, default=5, help="The max encoding sentence length")
-parser.add_argument("--max_decode_len", type=int, default=5, help="The max decoding sentence length")
-parser.add_argument(
-    "--batch_size",
-    default=8,
-    type=int,
-    help="Batch size per GPU/CPU for training.",
-)
-parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
-parser.add_argument("--weight_decay", default=0.1, type=float, help="Weight decay if we apply some.")
-parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
-parser.add_argument(
-    "--num_epochs",
-    default=3,
-    type=int,
-    help="Total number of training epochs to perform.",
-)
-parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion.")
-parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
-parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
-parser.add_argument(
-    "--device",
-    default="gpu",
-    type=str,
-    choices=["cpu", "gpu", "xpu"],
-    help="The device to select to train the model, is must be cpu/gpu/xpu.",
-)
-parser.add_argument("--beam_width", type=int, default=1, help="Beam search width.")
-parser.add_argument("--noise_prob", type=float, default=0.0, help="Probability of token be repalced.")
-parser.add_argument(
-    "--use_random_noice",
-    action="store_true",
-    help="If set, replace target tokens with random token from vocabulary, else replace with `[NOISE]`.",
-)
-parser.add_argument("--label_smooth", type=float, default=0.0, help="The soft label smooth rate.")
-parser.add_argument("--length_penalty", type=float, default=1.0, help="The length penalty during decoding.")
-parser.add_argument("--init_checkpoint", type=str, default=None, help="Checkpoint to warm start from.")
-parser.add_argument("--save_dir", type=str, default=None, help="Model output directory.")
-parser.add_argument(
-    "--max_steps",
-    default=-1,
-    type=int,
-    help="If > 0: set total number of training steps to perform. Override num_epochs.",
-)
-parser.add_argument("--to_static", type=strtobool, default=False, help="Enable training under @to_static.")
-
-args = parser.parse_args()
-
-
-def evaluate(model, data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args):
-    model.eval()
-
-    vocab = tokenizer.vocab
-    eos_id = vocab[tokenizer.sep_token]
-    sos_id = vocab[tokenizer.cls_token]
-    pad_id = vocab[tokenizer.pad_token]
-    unk_id = vocab[tokenizer.unk_token]
-    vocab_size = len(vocab)
-    evaluated_sentences_ids = []
-    reference_sentences_ids = []
-    logger.info("Evaluating...")
-    for data in tqdm(data_loader):
-        (src_ids, src_tids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
-        # Use greedy_search_infilling or beam_search_infilling to get predictions
-        output_ids = beam_search_infilling(
-            model,
-            src_ids,
-            src_tids,
-            eos_id=eos_id,
-            sos_id=sos_id,
-            attn_id=attn_id,
-            pad_id=pad_id,
-            unk_id=unk_id,
-            vocab_size=vocab_size,
-            max_decode_len=args.max_decode_len,
-            max_encode_len=args.max_encode_len,
-            beam_width=args.beam_width,
-            length_penalty=args.length_penalty,
-            tgt_type_id=tgt_type_id,
-        )
-
-        for ids in output_ids.tolist():
-            if eos_id in ids:
-                ids = ids[: ids.index(eos_id)]
-            evaluated_sentences_ids.append(ids)
-
-        for ids in raw_tgt_labels.numpy().tolist():
-            ids = ids[: ids.index(eos_id)]
-            reference_sentences_ids.append(ids)
-
-    score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids)
-    score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids)
-
-    logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100))
-
-    evaluated_sentences = []
-    reference_sentences = []
-    for ids in reference_sentences_ids[:5]:
-        reference_sentences.append("".join(map(post_process, vocab.to_tokens(ids))))
-    for ids in evaluated_sentences_ids[:5]:
-        evaluated_sentences.append("".join(map(post_process, vocab.to_tokens(ids))))
-    logger.debug(reference_sentences)
-    logger.debug(evaluated_sentences)
-
-    model.train()
-
-
-def train():
-    paddle.set_device(args.device)
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
-    if "ernie-tiny" in args.model_name_or_path:
-        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
-    elif "ernie" in args.model_name_or_path:
-        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
-    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
-        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
-    elif "electra" in args.model_name_or_path:
-        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
-    else:
-        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
-    if args.init_checkpoint:
-        model_state = paddle.load(args.init_checkpoint)
-        model.set_state_dict(model_state)
-
-    train_dataset, dev_dataset = load_dataset("poetry", splits=("train", "dev"), lazy=False)
-    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
-    tgt_type_id = model.sent_emb.weight.shape[0] - 1
-
-    trans_func = convert_example(
-        tokenizer=tokenizer,
-        attn_id=attn_id,
-        tgt_type_id=tgt_type_id,
-        max_encode_len=args.max_encode_len,
-        max_decode_len=args.max_decode_len,
-        noise_prob=args.noise_prob,
-        use_random_noice=args.use_random_noice,
-    )
-
-    train_dataset = train_dataset.map(trans_func)
-    train_batch_sampler = paddle.io.DistributedBatchSampler(train_dataset, batch_size=args.batch_size, shuffle=True)
-    batchify_fn = lambda samples, fn=Tuple(
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # src_tids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
-        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # tgt_tids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
-        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
-    ): after_padding(fn(samples))
-    train_data_loader = DataLoader(
-        dataset=train_dataset,
-        batch_sampler=train_batch_sampler,
-        collate_fn=batchify_fn,
-        num_workers=0,
-        return_list=True,
-    )
-
-    dev_dataset = dev_dataset.map(trans_func)
-    dev_data_loader = DataLoader(
-        dataset=dev_dataset, batch_size=args.batch_size, collate_fn=batchify_fn, num_workers=0, return_list=True
-    )
-
-    label_num = model.word_emb.weight.shape[0]
-    train_model = StackModel(model)
-
-    if args.to_static:
-        train_model = paddle.jit.to_static(train_model)
-        logger.info("Successfully to apply @to_static to the whole model.")
-
-    if paddle.distributed.get_world_size() > 1:
-        # All 'forward' outputs derived from the module parameters using in DataParallel
-        # must participate in the calculation of losses and subsequent gradient calculations.
-        # So we use StackModel here to make the model only output loss in its 'forward' function.
-        train_model = paddle.DataParallel(train_model)
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_epochs
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        epsilon=args.adam_epsilon,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        grad_clip=nn.ClipGradByGlobalNorm(1.0),
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    rouge1 = Rouge1()
-    rouge2 = Rouge2()
-
-    global_step = 0
-    tic_train = time.time()
-    for epoch in range(args.num_epochs):
-        for step, batch in enumerate(train_data_loader):
-            global_step += 1
-            (
-                src_ids,
-                src_tids,
-                src_pids,
-                tgt_ids,
-                tgt_tids,
-                tgt_pids,
-                attn_ids,
-                mask_src_2_src,
-                mask_tgt_2_srctgt,
-                mask_attn_2_srctgtattn,
-                tgt_labels,
-                _,
-            ) = batch
-            if args.label_smooth > 0.0:
-                tgt_labels = nn.functional.label_smooth(
-                    nn.functional.one_hot(tgt_labels, label_num), epsilon=args.label_smooth
-                )
-            tgt_pos = paddle.nonzero(attn_ids == attn_id)
-            loss = train_model(
-                src_ids,
-                src_tids,
-                src_pids,
-                tgt_ids,
-                tgt_tids,
-                tgt_pids,
-                attn_ids,
-                mask_src_2_src,
-                mask_tgt_2_srctgt,
-                mask_attn_2_srctgtattn,
-                tgt_labels,
-                tgt_pos,
-            )
-            if global_step % args.logging_steps == 0:
-                if paddle.distributed.get_rank() == 0:
-                    logger.info(
-                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, lr: %.3e"
-                        % (
-                            global_step,
-                            epoch,
-                            step,
-                            loss,
-                            args.logging_steps / (time.time() - tic_train),
-                            lr_scheduler.get_lr(),
-                        )
-                    )
-                tic_train = time.time()
-
-            loss.backward()
-            optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            if (
-                global_step % args.save_steps == 0
-                or global_step == num_training_steps
-                and paddle.distributed.get_rank() == 0
-            ):
-                evaluate(model, dev_data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args)
-                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
-                if not os.path.exists(output_dir):
-                    os.makedirs(output_dir)
-                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                model_to_save.save_pretrained(output_dir)
-                tokenizer.save_pretrained(output_dir)
-            if global_step >= num_training_steps:
-                return
-
-
-if __name__ == "__main__":
-    train()
diff --git a/model_zoo/ernie-health/README.md b/model_zoo/ernie-health/README.md
deleted file mode 100644
index af5924fffec0..000000000000
--- a/model_zoo/ernie-health/README.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# ERNIE-Health 中文医疗预训练模型
-
-医疗领域存在大量的专业知识和医学术语，人类经过长时间的学习才能成为一名优秀的医生。那机器如何才能“读懂”医疗文献呢？尤其是面对电子病历、生物医疗文献中存在的大量非结构化、非标准化文本，计算机是无法直接使用、处理的。这就需要自然语言处理（Natural Language Processing，NLP）技术大展身手了。
-
-## 模型介绍
-
-本项目针对中文医疗语言理解任务，开源了中文医疗预训练模型 [ERNIE-Health](https://arxiv.org/pdf/2110.07244.pdf)（模型名称`ernie-health-chinese`）。
-
-ERNIE-Health 依托百度文心 ERNIE 先进的知识增强预训练语言模型打造, 通过医疗知识增强技术进一步学习海量的医疗数据, 精准地掌握了专业的医学知识。ERNIE-Health 利用医疗实体掩码策略对专业术语等实体级知识学习, 学会了海量的医疗实体知识。同时，通过医疗问答匹配任务学习病患病状描述与医生专业治疗方案的对应关系，获得了医疗实体知识之间的内在联系。ERNIE-Health 共学习了 60 多万的医疗专业术语和 4000 多万的医疗专业问答数据，大幅提升了对医疗专业知识的理解和建模能力。此外，ERNIE-Health 还探索了多级语义判别预训练任务，提升了模型对医疗知识的学习效率。该模型的整体结构与 ELECTRA 相似，包括生成器和判别器两部分。
-
-![Overview_of_EHealth](https://user-images.githubusercontent.com/25607475/163949632-8b34e23c-d0cd-49df-8d88-8549a253d221.png)
-
-更多技术细节可参考论文
-- [Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/pdf/2110.07244.pdf)
-
-## 模型效果
-
-ERNIE-Health模型以超越人类医学专家水平的成绩登顶中文医疗信息处理权威榜单 [CBLUE](https://github.com/CBLUEbenchmark/CBLUE) 冠军, 验证了 ERNIE 在医疗行业应用的重要价值。
-
-![CBLUERank](https://user-images.githubusercontent.com/25607475/160394225-04f75498-ce1a-4665-85f7-d495815eed51.png)
-
-相应的开源模型参数 ``ernie-health-chinese`` 在 CBLUE **验证集** 上的评测指标如下表所示：
-
-| Task      |  metric  | results | results (fp16) |
-| --------- | :------: | :-----: | :------------: |
-| CHIP-STS  | Macro-F1 | 0.88749 |    0.88555     |
-| CHIP-CTC  | Macro-F1 | 0.84136 |    0.83514     |
-| CHIP-CDN  |    F1    | 0.76979 |    0.76489     |
-| KUAKE-QQR | Accuracy | 0.83865 |    0.84053     |
-| KUAKE-QTR | Accuracy | 0.69722 |    0.69722     |
-| KUAKE-QIC | Accuracy | 0.81483 |    0.82046     |
-| CMeEE     | Micro-F1 | 0.66120 |    0.66026     |
-| CMeIE     | Micro-F1 | 0.61385 |    0.60076     |
-
-## 环境依赖
-
-- paddlepaddle >= 2.2.0
-- paddlenlp >= 2.3.4
-
-## 模型预训练
-
-PaddleNLP中提供了ERNIE-Health训练好的模型参数。``ernie-health-chinese``版本为160G医疗文本数据上的训练结果，数据包括脱敏医患对话语料、医疗健康科普文章、脱敏医院电子医疗病例档案以及医学和临床病理学教材。本节提供了预训练的整体流程，可用于自定义数据的学习。
-
-#### 注意: 预训练资源要求
-
-- 推荐使用至少4张16G以上显存的GPU进行预训练。
-- 数据量应尽可能接近ERNIE-Health论文中训练数据的量级，以获得好的预训练模型效果。
-- 若资源有限，可以直接使用开源的ERNIE-Health模型进行微调，具体实现可参考 [CBLUE样例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-health/cblue)。
-
-#### 数据准备
-
-- 数据编码：UTF-8
-- 数据格式：预训练文本数据放在同个目录下，每个文件中每行一句中文文本。
-
-- 数据预处理：首先对原始文本进行分词，分词结果中非首中文字符替换为``##``前缀的字符（例如，``医疗``处理后得到``[医, ##疗]``）。接着将token转换为对应的id。最后将目录下的全部数据合并存储，token ids拼接后存储至``.npy``文件，每条样本的长度存储在``.npz``文件。
-
-```shell
-python preprocess.py --input_path ./raw_data/ --output_file ./data/samples --tokenize_tool lac --num_worker 8
-```
-可配置参数包括
-- ``input_path`` 为原始文本数据所在目录，该目录下包含至少一个中文文本文件，UTF-8编码。
-- ``output_file`` 为预处理后数据的存储路径及文件名（不包含后缀）。
-- ``tokenize_tool``表示分词工具，包括``lac``、``seg``和``jieba``，默认为``lac``。
-- ``logging_steps`` 表示日志打印间隔，每处理``logging_steps``个句子打印一次日志。
-- ``num_worker`` 表示使用的进程数，增加进程数可加速预处理。
-
-
-#### 单机单卡
-
-```
-CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
-    --input_dir ./data \
-    --output_dir ./output \
-    --learning_rate 1e-7 \
-    --batch_size 10 \
-    --adam_epsilon 1e-8 \
-    --weight_decay 1e-2 \
-    --warmup_steps 10000 \
-    --max_steps 1000000 \
-    --save_steps 10000 \
-    --logging_steps 1 \
-    --seed 1000 \
-    --use_amp
-```
-
-#### 单机多卡
-
-```
-unset CUDA_VISIBLE_DEVICES
-python -m paddle.distributed.launch --gpus "0,1,2,3" run_pretrain.py \
-    --input_dir ./data \
-    --output_dir ./output \
-    --learning_rate 1e-7 \
-    --batch_size 10 \
-    --adam_epsilon 1e-8 \
-    --weight_decay 1e-2 \
-    --warmup_steps 10000 \
-    --max_steps 1000000 \
-    --save_steps 10000 \
-    --logging_steps 1 \
-    --seed 1000 \
-    --use_amp
-```
-
-可配置参数包括
-- ``model_name_or_path``表示内置模型参数名（目前支持``ernie-health-chinese``），或者模型参数配置路径（这时需配置 --init_from_ckpt 参数一起使用，一般用于断点恢复训练场景。）
-- ``input_dir``表示训练数据所在目录，该目录下要有``.npy``和``.npz``两个文件，格式与```preprocess.py``预处理结果相同。
-- ``output_dir``表示预训练模型参数和训练日志的保存目录。
-- ``batch_size``表示每次迭代每张卡上的样本数量。当batch_size=4时，运行时单卡约需要12G显存。如果实际GPU显存小于12G或大大多于12G，可适当调小/调大此配置。
-- ``learning_rate`` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
-- ``max_seq_length`` 表示最大句子长度，超过该长度将被截断。
-- ``weight_decay`` 表示每次迭代中参数缩小的比例，该值乘以学习率为真正缩小的比例。
-- ``adam_epsilon`` 表示adam优化器中的epsilon值。
-- ``warmup_steps`` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
-- ``num_epochs`` 表示训练轮数。
-- ``logging_steps`` 表示日志打印间隔。
-- ``save_steps`` 表示模型保存间隔。
-- ``max_steps`` 如果配置且大于0，表示预训练最多执行的迭代数量；如果不配置或配置小于0，则根据输入数据量、``batch_size``和``num_epochs``来确定预训练迭代数量。
-- ``device`` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
-- ``use_amp`` 表示是否开启混合精度(float16)进行训练，默认不开启。如果在命令中加上了--use_amp，则会开启。
-- ``init_from_ckpt`` 表示是否从某个checkpoint继续训练（断点恢复训练），默认不开启。如果在命令中加上了--init_from_ckpt，且 --model_name_or_path 配置的是路径，则会开启从某个checkpoint继续训练。
-
-#### Trainer 训练版本
-本样例同时提供了Trainer版本的预训练流程，预训练重启、可视化等流程较为完备。需要从源码安装paddlenlp使用。
-
-```
-unset CUDA_VISIBLE_DEVICES
-task_name="eheath-pretraining"
-
-python -u -m paddle.distributed.launch \
-    --gpus 0,1,2,3,4,5,6,7  \
-    run_pretrain_trainer.py \
-    --input_dir "./data" \
-    --output_dir "output/$task_name" \
-    --max_seq_length 512 \
-    --gradient_accumulation_steps 1\
-    --per_device_train_batch_size 8 \
-    --learning_rate 0.001 \
-    --max_steps 1000000 \
-    --save_steps 50000 \
-    --weight_decay 0.01 \
-    --warmup_ratio 0.01 \
-    --max_grad_norm 1.0 \
-    --logging_steps 20 \
-    --dataloader_num_workers 2 \
-    --device "gpu"\
-    --fp16  \
-    --fp16_opt_level "O1"  \
-    --do_train \
-    --disable_tqdm True\
-    --save_total_limit 10
-```
-大部分参数含义如上文所述，这里简要介绍一些新参数:
-
-- ``per_device_train_batch_size`` 同上文batch_size。训练时，每次迭代每张卡上的样本数目。
-- ``warmup_ratio`` 与warmup_steps类似，warmup步数占总步数的比例。
-- ``fp16`` 与`use_amp`相同，表示使用混合精度
-- ``fp16_opt_level`` 混合精度的策略。注：O2训练eHealth存在部分问题，暂时请勿使用。
-- ``save_total_limit`` 保存的ckpt数量的最大限制
-
-## 微调
-
-模型预训练结束后，可以对判别器进行微调以完成下游医疗任务。不同任务的模型加载方式如下：
-
-```
-from paddlenlp.transformers import *
-
-tokenizer = AutoTokenizer.from_pretrained('ernie-health-chinese')
-
-# 分类任务
-model = AutoModelForSequenceClassification.from_pretrained('ernie-health-chinese')
-# 序列标注任务
-model = AutoModelForTokenClassification.from_pretrained('ernie-health-chinese')
-# 阅读理解任务
-model = AutoModelForQuestionAnswering.from_pretrained('ernie-health-chinese')
-```
-
-本项目提供了在 CBLUE 数据集上的微调脚本，包括分类、实体识别和关系抽取三类任务，详细信息可参考 ``cblue``[目录](./cblue)。
-
-## 部署
-
-我们为ERNIE-Health微调后的模型提供了Python端部署方案，请根据实际情况进行实现。
-
-详细部署流程请参考：[基于ONNXRuntime推理部署指南](./cblue/deploy/predictor/)
-
-
-## Reference
-
-Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244)
diff --git a/model_zoo/ernie-health/cblue/README.md b/model_zoo/ernie-health/cblue/README.md
deleted file mode 100644
index ce37227d052c..000000000000
--- a/model_zoo/ernie-health/cblue/README.md
+++ /dev/null
@@ -1,130 +0,0 @@
-# 使用医疗领域预训练模型Fine-tune完成中文医疗语言理解任务
-
-本示例展示了中文医疗预训练模型 ERNIE-Health（[Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/abs/2110.07244)）如何 Fine-tune 完成中文医疗语言理解任务。
-
-## 数据集介绍
-
-本项目使用了中文医学语言理解测评（[Chinese Biomedical Language Understanding Evaluation，CBLUE](https://github.com/CBLUEbenchmark/CBLUE)）1.0 版本数据集，这是国内首个面向中文医疗文本处理的多任务榜单，涵盖了医学文本信息抽取（实体识别、关系抽取）、医学术语归一化、医学文本分类、医学句子关系判定和医学问答共5大类任务8个子任务。其数据来源分布广泛，包括医学教材、电子病历、临床试验公示以及互联网用户真实查询等。该榜单一经推出便受到了学界和业界的广泛关注，已逐渐发展成为检验AI系统中文医疗信息处理能力的“金标准”。
-
-* CMeEE：中文医学命名实体识别
-* CMeIE：中文医学文本实体关系抽取
-* CHIP-CDN：临床术语标准化任务
-* CHIP-CTC：临床试验筛选标准短文本分类
-* CHIP-STS：平安医疗科技疾病问答迁移学习
-* KUAKE-QIC：医疗搜索检索词意图分类
-* KUAKE-QTR：医疗搜索查询词-页面标题相关性
-* KUAKE-QQR：医疗搜索查询词-查询词相关性
-
-更多信息可参考CBLUE的[github](https://github.com/CBLUEbenchmark/CBLUE/blob/main/README_ZH.md)。其中对于临床术语标准化任务（CHIP-CDN），我们按照 ERNIE-Health 中的方法通过检索将原多分类任务转换为了二分类任务，即给定一诊断原词和一诊断标准词，要求判定后者是否是前者对应的诊断标准词。本项目提供了检索处理后的 CHIP-CDN 数据集（简写`CHIP-CDN-2C`），且构建了基于该数据集的example代码。
-
-## 模型介绍
-
-ERNIE-Health 模型的整体结构与 ELECTRA 相似，包括生成器和判别器两部分。 而 Fine-tune 过程只用到了判别器模块，由 12 层 Transformer 网络组成。
-
-## 快速开始
-
-### 代码结构说明
-
-以下是本项目主要代码结构及说明：
-
-```text
-cblue/
-├── train_classification.py   # 文本分类任务训练评估脚本
-├── train_ner.py              # 实体识别任务训练评估脚本
-├── train_spo.py              # 关系抽取任务训练评估脚本
-├── export_model.py           # 动态图导出静态图参数脚本
-└── README.md
-```
-
-### 依赖安装
-
-```shell
-pip install xlrd==1.2.0
-```
-
-### 模型训练
-
-我们按照任务类别划分，同时提供了8个任务的样例代码。可以运行下边的命令，在训练集上进行训练，并在**验证集**上进行验证。
-
-**训练参数设置（Training setup）及结果**
-
-| Task      | epochs | batch_size | learning_rate | max_seq_length |  metric  | results | results (fp16) |
-| --------- | :----: | :--------: | :-----------: | :------------: | :------: | :-----: | :------------: |
-| CHIP-STS  |    4   |     16     |      3e-5     |       96       | Macro-F1 | 0.88749 |    0.88555     |
-| CHIP-CTC  |    4   |     32     |      6e-5     |      160       | Macro-F1 | 0.84136 |    0.83514     |
-| CHIP-CDN  |   16   |    256     |      3e-5     |       32       |    F1    | 0.76979 |    0.76489     |
-| KUAKE-QQR |    2   |     32     |      6e-5     |       64       | Accuracy | 0.83865 |    0.84053     |
-| KUAKE-QTR |    4   |     32     |      6e-5     |       64       | Accuracy | 0.69722 |    0.69722     |
-| KUAKE-QIC |    4   |     32     |      6e-5     |      128       | Accuracy | 0.81483 |    0.82046     |
-| CMeEE     |    2   |     32     |      6e-5     |      128       | Micro-F1 | 0.66120 |    0.66026     |
-| CMeIE     |  100   |     12     |      6e-5     |      300       | Micro-F1 | 0.61385 |    0.60076     |
-
-可支持配置的参数：
-
-* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
-* `max_seq_length`：可选，ELECTRA模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
-* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
-* `learning_rate`：可选，Fine-tune的最大学习率；默认为6e-5。
-* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.01。
-* `epochs`: 训练轮次，默认为3。
-* `max_steps`: 最大训练步数。若训练`epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
-* `valid_steps`: evaluate的间隔steps数，默认100。
-* `save_steps`: 保存checkpoints的间隔steps数，默认100。
-* `logging_steps`: 日志打印的间隔steps数，默认10。
-* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.1。
-* `init_from_ckpt`：可选，模型参数路径，恢复模型训练；默认为None。
-* `seed`：可选，随机种子，默认为1000.
-* `device`: 选用什么设备进行训练，可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。
-* `use_amp`: 是否使用混合精度训练，默认为False。
-
-
-#### 医疗文本分类任务
-
-```shell
-$ unset CUDA_VISIBLE_DEVICES
-$ python -m paddle.distributed.launch --gpus '0,1,2,3' train_classification.py --dataset CHIP-CDN-2C --batch_size 256 --max_seq_length 32 --learning_rate 3e-5 --epochs 16
-```
-
-其他可支持配置的参数：
-
-* `dataset`：可选，CHIP-CDN-2C CHIP-CTC CHIP-STS KUAKE-QIC KUAKE-QTR KUAKE-QQR，默认为KUAKE-QIC数据集。
-
-#### 医疗命名实体识别任务（CMeEE）
-
-```shell
-$ export CUDA_VISIBLE_DEVICES=0
-$ python train_ner.py --batch_size 32 --max_seq_length 128 --learning_rate 6e-5 --epochs 12
-```
-
-#### 医疗关系抽取任务（CMeIE）
-
-```shell
-$ export CUDA_VISIBLE_DEVICES=0
-$ python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 100
-```
-
-### 静态图模型导出
-
-使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，用于部署推理等，具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
-
-运行方式：
-1. 分类任务静态图模型导出
-```shell
-python export_model.py --train_dataset CHIP-CDN-2C --params_path=./checkpoint/model_900/ --output_path=./export
-```
-
-2. SPO任务静态图模型导出
-```shell
-python export_model.py --train_dataset CMeIE --params_path=./checkpoint/model_900/ --output_path=./export
-```
-
-3. NER任务静态图模型导出
-```shell
-python export_model.py --train_dataset CMeEE --params_path=./checkpoint/model_1500/ --output_path=./export
-```
-
-**NOTICE**: train_dataset分类任务选择填上训练数据集名称，params_path选择最好参数的模型的路径。
-
-[1] CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [pdf](https://arxiv.org/abs/2106.08087) [git](https://github.com/CBLUEbenchmark/CBLUE) [web](https://tianchi.aliyun.com/specials/promotion/2021chinesemedicalnlpleaderboardchallenge)
-
-[2] Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244)
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/README.md b/model_zoo/ernie-health/cblue/deploy/predictor/README.md
deleted file mode 100644
index 4741a43385ff..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/README.md
+++ /dev/null
@@ -1,191 +0,0 @@
-# 基于ONNXRuntime推理部署指南
-
-本示例以[CBLUE数据集微调](../../README.md)得到的ERNIE-Health模型为例，分别提供了文本分类任务、实体识别任务和关系抽取任务的部署代码，自定义数据集可参考实现。
-在推理部署前需将微调后的动态图模型转换导出为静态图，详细步骤见[静态图模型导出](../../README.md)。
-
-**目录**
-   * [环境安装](#环境安装)
-   * [GPU部署推理样例](#gpu部署推理样例)
-   * [CPU部署推理样例](#cpu部署推理样例)
-   * [性能与精度测试](#性能与精度测试)
-       * [GPU精度与性能](#gpu精度与性能)
-       * [CPU精度与性能](#cpu精度与性能)
-
-## 环境安装
-
-ONNX模型转换和推理部署依赖于Paddle2ONNX和ONNXRuntime。其中Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。
-
-#### GPU端
-请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
-```
-python -m pip install -r requirements_gpu.tx
-```
-\* 如需使用半精度（FP16）部署，请确保GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。 更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)
-#### CPU端
-请使用如下命令安装所需依赖:
-```
-python -m pip install -r requirements_cpu.txt
-```
-## GPU部署推理样例
-
-请使用如下命令进行GPU上的部署，可用`use_fp16`开启**半精度部署推理加速**，可用`device_id`**指定GPU卡号**。
-
-- 文本分类任务
-
-```
-python infer_classification.py --device gpu --device_id 0 --dataset KUAKE-QIC --model_path_prefix ../../export/inference
-```
-
-- 实体识别任务
-
-```
-python infer_ner.py --device gpu --device_id 0 --dataset CMeEE --model_path_prefix ../../export/inference
-```
-
-- 关系抽取任务
-
-```
-python infer_spo.py --device gpu --device_id 0 --dataset CMeIE --model_path_prefix ../../export/inference
-```
-
-可支持配置的参数：
-
-* `model_path_prefix`：必须，待推理模型路径前缀。
-* `model_name_or_path`：选择预训练模型；默认为"ernie-health-chinese"。
-* `dataset`：CBLUE中的训练数据集。
-   * `文本分类任务`：包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C；默认为KUAKE-QIC。
-   * `实体抽取任务`：默认为CMeEE。
-   * `关系抽取任务`：默认为CMeIE。
-* `max_seq_length`：模型使用的最大序列长度，最大不能超过512；`关系抽取任务`默认为300，其余默认为128。
-* `use_fp16`：选择是否开启FP16进行加速，仅在`devive=gpu`时生效；默认关闭。
-* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
-* `device`: 选用什么设备进行训练，可选cpu、gpu；默认为gpu。
-* `device_id`: 选择GPU卡号；默认为0。
-* `data_file`：本地待预测数据文件；默认为None。
-
-#### 本地数据集加载
-如需使用本地数据集，请指定本地待预测数据文件 `data_file`，每行一条样例，单文本输入每句一行，双文本输入以`\t`分隔符隔开。例如
-
-**ctc-data.txt**
-```
-在过去的6个月曾服用偏头痛预防性药物或长期服用镇痛药物者，以及有酒精依赖或药物滥用习惯者；
-患有严重的冠心病、脑卒中，以及传染性疾病、精神疾病者；
-活动性乙肝（包括大三阳或小三阳）或血清学指标（HBsAg或/和HBeAg或/和HBcAb）阳性者，丙肝、肺结核、巨细胞病毒、严重真菌感染或HIV感染；
-...
-```
-
-**sts-data.txt**
-```
-糖尿病能吃减肥药吗？能治愈吗？\t糖尿病为什么不能吃减肥药？
-为什么慢性乙肝会急性发作\t引起隐匿性慢性乙肝的原因是什么
-标准血压是多少高血压指？低血压又指？\t半月前检查血压100／130，正常吗？
-...
-```
-
-## CPU部署推理样例
-
-请使用如下命令进行CPU上的部署，可用`num_threads`**调整预测线程数量**。
-
-- 文本分类任务
-
-```
-python infer_classification.py --device cpu --dataset KUAKE-QIC --model_path_prefix ../../export/inference
-```
-
-- 实体识别任务
-
-```
-python infer_ner.py --device cpu --dataset CMeEE --model_path_prefix ../../export/inference
-```
-
-- 关系抽取任务
-
-```
-python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../export/inference
-```
-
-可支持配置的参数：
-
-* `model_path_prefix`：必须，待推理模型路径前缀。
-* `model_name_or_path`：选择预训练模型；默认为"ernie-health-chinese"。
-* `dataset`：CBLUE中的训练数据集。
-   * `文本分类任务`：包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C；默认为KUAKE-QIC。
-   * `实体抽取任务`：默认为CMeEE。
-   * `关系抽取任务`：默认为CMeIE。
-* `max_seq_length`：模型使用的最大序列长度，最大不能超过512；`关系抽取任务`默认为300，其余默认为128。
-* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
-* `device`: 选用什么设备进行训练，可选cpu、gpu；默认为gpu。
-* `num_threads`：cpu线程数，在`device=gpu`时影响较小；默认为cpu的物理核心数量。
-* `data_file`：本地待预测数据文件，格式见[GPU部署推理样例](#本地数据集加载)中的介绍；默认为None。
-
-## 性能与精度测试
-
-本节提供了在CBLUE数据集上预测的性能和精度数据，以供参考。
-
-测试配置如下：
-
-1. 数据集
-
-    使用CBLUE数据集中的开发集用于ERNIE-Health微调模型部署推理的性能与精度测试，包括
-
-  - 医疗搜索检索词意图分类（KUAKE-QIC）任务
-  - 医疗搜索查询词-页面标题相关性（KUAKE-QTR）任务
-  - 医疗搜索查询词-查询词相关性（KUAKE-QQR）任务
-  - 临床试验筛选标准短文本分类(CHIP-CTC)任务
-  - 平安医疗科技疾病问答迁移学习（CHIP-STS）任务
-  - 临床术语标准化匹配（CHIP-CDN-2C）任务
-  - 中文医学命名实体识别（CMeEE）任务
-  - 中文医学文本实体关系抽取（CMeIE）任务
-
-2. 物理机环境
-
-    系统: CentOS Linux release 7.7.1908 (Core)
-
-    GPU: Tesla V100-SXM2-32GB
-
-    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
-
-    CUDA: 11.2
-
-    cuDNN: 8.1.0
-
-    Driver Version: 460.27.04
-
-    内存: 630 GB
-
-3. PaddlePaddle 版本：2.3.0
-
-4. PaddleNLP 版本：2.3.4
-
-5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 200（CHIP-CDN-2C 和 CMeIE 数据集为 20），部署运行时间 total_time，计算 latency = total_time / total_samples
-
-
-### GPU精度与性能
-
-| 数据集       | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP16 指标值 | FP32 latency(ms) | FP16 latency(ms) |
-| ----------  | ---------- | ---------- | ---------- | ---------- | ---------------- | ---------------- |
-| KUAKE-QIC   | 128        | Accuracy   | 0.8046     | 0.8046     | 1.92             | 0.46             |
-| KUAKE-QTR   | 64         | Accuracy   | 0.6886     | 0.6876 (-) | 0.92             | 0.23             |
-| KUAKE-QQR   | 64         | Accuracy   | 0.7755     | 0.7755     | 0.61             | 0.16             |
-| CHIP-CTC    | 160        | Macro F1   | 0.8445     | 0.8446 (+) | 2.34             | 0.60             |
-| CHIP-STS    | 96         | Macro F1   | 0.8892     | 0.8892     | 1.39             | 0.35             |
-| CHIP-CDN-2C | 256        | Macro F1   | 0.8921     | 0.8920 (-) | 1.58             | 0.48             |
-| CMeEE       | 128        | Micro F1   | 0.6469     | 0.6468 (-) | 1.90             | 0.48             |
-| CMeIE       | 300        | Micro F1   | 0.5903     | 0.5902 (-) | 50.32            | 41.50            |
-
-经过FP16转化加速比达到 1.2 ~ 4 倍左右，精度变化在 1e-4 ~ 1e-3 内。
-
-### CPU精度与性能
-
-测试环境及说明如上，测试 CPU 性能时，线程数设置为40。
-
-| 数据集      | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP32 latency(ms) |
-| ----------  | ------------ | ------------ | ---------- | ---------------- |
-| KUAKE-QIC   | 128          | Accuracy     | 0.8046     | 37.72            |
-| KUAKE-QTR   | 64           | Accuracy     | 0.6886     | 18.40            |
-| KUAKE-QQR   | 64           | Accuracy     | 0.7755     | 10.34            |
-| CHIP-CTC    | 160          | Macro F1     | 0.8445     | 47.43            |
-| CHIP-STS    | 96           | Macro F1     | 0.8892     | 27.67            |
-| CHIP-CDN-2C | 256          | Micro F1     | 0.8921     | 26.86            |
-| CMeEE       | 128          | Micro F1     | 0.6469     | 37.59            |
-| CMeIE       | 300          | Micro F1     | 0.5902     | 213.04           |
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py
deleted file mode 100644
index 2c4586fc9bc2..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py
+++ /dev/null
@@ -1,146 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import psutil
-from predictor import CLSPredictor
-
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
-    )
-    parser.add_argument(
-        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
-    )
-    parser.add_argument("--dataset", default="KUAKE-QIC", type=str, help="Dataset for text classfication.")
-    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
-    parser.add_argument(
-        "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
-    )
-    parser.add_argument(
-        "--use_fp16",
-        action="store_true",
-        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
-    )
-    parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
-    parser.add_argument(
-        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu."
-    )
-    parser.add_argument(
-        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
-    )
-    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
-    args = parser.parse_args()
-    return args
-
-
-LABEL_LIST = {
-    "kuake-qic": ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"],
-    "kuake-qtr": ["完全不匹配", "很少匹配，有一些参考价值", "部分匹配", "完全匹配"],
-    "kuake-qqr": ["B为A的语义父集，B指代范围大于A； 或者A与B语义毫无关联。", "B为A的语义子集，B指代范围小于A。", "表示A与B等价，表述完全一致。"],
-    "chip-ctc": [
-        "成瘾行为",
-        "居住情况",
-        "年龄",
-        "酒精使用",
-        "过敏耐受",
-        "睡眠",
-        "献血",
-        "能力",
-        "依存性",
-        "知情同意",
-        "数据可及性",
-        "设备",
-        "诊断",
-        "饮食",
-        "残疾群体",
-        "疾病",
-        "教育情况",
-        "病例来源",
-        "参与其它试验",
-        "伦理审查",
-        "种族",
-        "锻炼",
-        "性别",
-        "健康群体",
-        "实验室检查",
-        "预期寿命",
-        "读写能力",
-        "含有多类别的语句",
-        "肿瘤进展",
-        "疾病分期",
-        "护理",
-        "口腔相关",
-        "器官组织状态",
-        "药物",
-        "怀孕相关",
-        "受体状态",
-        "研究者决定",
-        "风险评估",
-        "性取向",
-        "体征(医生检测）",
-        " 吸烟状况",
-        "特殊病人特征",
-        "症状(患者感受)",
-        "治疗或手术",
-    ],
-    "chip-sts": ["语义不同", "语义相同"],
-    "chip-cdn-2c": ["否", "是"],
-}
-
-TEXT = {
-    "kuake-qic": ["心肌缺血如何治疗与调养呢？", "什么叫痔核脱出？什么叫外痔？"],
-    "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些？"]],
-    "kuake-qqr": [["茴香是发物吗", "茴香怎么吃？"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]],
-    "chip-ctc": ["(1)前牙结构发育不良：釉质发育不全、氟斑牙、四环素牙等；", "怀疑或确有酒精或药物滥用史；"],
-    "chip-sts": [["糖尿病能吃减肥药吗？能治愈吗？", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]],
-    "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]],
-}
-
-METRIC = {
-    "kuake-qic": "acc",
-    "kuake-qtr": "acc",
-    "kuake-qqr": "acc",
-    "chip-ctc": "macro",
-    "chip-sts": "macro",
-    "chip-cdn-2c": "macro",
-}
-
-
-def main():
-    args = parse_args()
-
-    for arg_name, arg_value in vars(args).items():
-        logger.info("{:20}: {}".format(arg_name, arg_value))
-
-    args.dataset = args.dataset.lower()
-    label_list = LABEL_LIST[args.dataset]
-    if args.data_file is not None:
-        with open(args.data_file, "r") as fp:
-            input_data = [x.strip().split("\t") for x in fp.readlines()]
-            input_data = [x[0] if len(x) == 1 else x for x in input_data]
-    else:
-        input_data = TEXT[args.dataset]
-
-    predictor = CLSPredictor(args, label_list)
-    predictor.predict(input_data)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py
deleted file mode 100644
index afc2d2ba99fc..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py
+++ /dev/null
@@ -1,116 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import psutil
-from predictor import NERPredictor
-
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
-    )
-    parser.add_argument(
-        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
-    )
-    parser.add_argument("--dataset", default="CMeEE", type=str, help="Dataset for named entity recognition.")
-    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
-    parser.add_argument(
-        "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization"
-    )
-    parser.add_argument(
-        "--use_fp16",
-        action="store_true",
-        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
-    )
-    parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
-    parser.add_argument(
-        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="Number of threads for cpu."
-    )
-    parser.add_argument(
-        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
-    )
-    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
-    args = parser.parse_args()
-    return args
-
-
-LABEL_LIST = {
-    "cmeee": [
-        [
-            "B-bod",
-            "I-bod",
-            "E-bod",
-            "S-bod",
-            "B-dis",
-            "I-dis",
-            "E-dis",
-            "S-dis",
-            "B-pro",
-            "I-pro",
-            "E-pro",
-            "S-pro",
-            "B-dru",
-            "I-dru",
-            "E-dru",
-            "S-dru",
-            "B-ite",
-            "I-ite",
-            "E-ite",
-            "S-ite",
-            "B-mic",
-            "I-mic",
-            "E-mic",
-            "S-mic",
-            "B-equ",
-            "I-equ",
-            "E-equ",
-            "S-equ",
-            "B-dep",
-            "I-dep",
-            "E-dep",
-            "S-dep",
-            "O",
-        ],
-        ["B-sym", "I-sym", "E-sym", "S-sym", "O"],
-    ]
-}
-
-TEXT = {"cmeee": ["研究证实，细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热，但以不规则发热为多，可能与患儿应用退热药物导致热型不规律有关。"]}
-
-
-def main():
-    args = parse_args()
-
-    for arg_name, arg_value in vars(args).items():
-        logger.info("{:20}: {}".format(arg_name, arg_value))
-
-    dataset = args.dataset.lower()
-    label_list = LABEL_LIST[dataset]
-    if args.data_file is not None:
-        with open(args.data_file, "r") as fp:
-            input_data = [x.strip() for x in fp.readlines()]
-    else:
-        input_data = TEXT[dataset]
-
-    predictor = NERPredictor(args, label_list)
-    predictor.predict(input_data)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py
deleted file mode 100644
index 972eade14d75..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py
+++ /dev/null
@@ -1,124 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-import psutil
-from predictor import SPOPredictor
-
-from paddlenlp.utils.log import logger
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
-    )
-    parser.add_argument(
-        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
-    )
-    parser.add_argument("--dataset", default="CMeIE", type=str, help="Dataset for named entity recognition.")
-    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
-    parser.add_argument(
-        "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization."
-    )
-    parser.add_argument(
-        "--use_fp16",
-        action="store_true",
-        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
-    )
-    parser.add_argument(
-        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu."
-    )
-    parser.add_argument("--batch_size", default=20, type=int, help="Batch size per GPU/CPU for predicting.")
-    parser.add_argument(
-        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
-    )
-    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
-    args = parser.parse_args()
-    return args
-
-
-LABEL_LIST = {
-    "cmeie": [
-        "预防",
-        "阶段",
-        "就诊科室",
-        "辅助治疗",
-        "化疗",
-        "放射治疗",
-        "手术治疗",
-        "实验室检查",
-        "影像学检查",
-        "辅助检查",
-        "组织学检查",
-        "内窥镜检查",
-        "筛查",
-        "多发群体",
-        "发病率",
-        "发病年龄",
-        "多发地区",
-        "发病性别倾向",
-        "死亡率",
-        "多发季节",
-        "传播途径",
-        "并发症",
-        "病理分型",
-        "相关（导致）",
-        "鉴别诊断",
-        "相关（转化）",
-        "相关（症状）",
-        "临床表现",
-        "治疗后症状",
-        "侵及周围组织转移的症状",
-        "病因",
-        "高危因素",
-        "风险评估因素",
-        "病史",
-        "遗传因素",
-        "发病机制",
-        "病理生理",
-        "药物治疗",
-        "发病部位",
-        "转移部位",
-        "外侵部位",
-        "预后状况",
-        "预后生存率",
-        "同义词",
-    ]
-}
-
-TEXT = {"cmeie": ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"]}
-
-
-def main():
-    args = parse_args()
-
-    for arg_name, arg_value in vars(args).items():
-        logger.info("{:20}: {}".format(arg_name, arg_value))
-
-    dataset = args.dataset.lower()
-    label_list = LABEL_LIST[dataset]
-    if args.data_file is not None:
-        with open(args.data_file, "r") as fp:
-            input_data = [x.strip() for x in fp.readlines()]
-    else:
-        input_data = TEXT[dataset]
-
-    predictor = SPOPredictor(args, label_list)
-    predictor.predict(input_data)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py b/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py
deleted file mode 100644
index 6e3137301cfd..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py
+++ /dev/null
@@ -1,361 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import time
-
-import numpy as np
-import onnxruntime as ort
-import paddle2onnx
-import six
-
-from paddlenlp.transformers import (
-    AutoTokenizer,
-    normalize_chars,
-    tokenize_special_chars,
-)
-from paddlenlp.utils.log import logger
-
-
-class InferBackend(object):
-    def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10):
-
-        if not isinstance(device, six.string_types):
-            logger.error(
-                ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device)
-            )
-            exit(0)
-        if device not in ["cpu", "gpu"]:
-            logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device))
-            exit(0)
-
-        logger.info(">>> [InferBackend] Creating Engine ...")
-
-        onnx_model = paddle2onnx.command.c_paddle_to_onnx(
-            model_file=model_path_prefix + ".pdmodel",
-            params_file=model_path_prefix + ".pdiparams",
-            opset_version=13,
-            enable_onnx_checker=True,
-        )
-        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
-        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
-        with open(float_onnx_file, "wb") as f:
-            f.write(onnx_model)
-
-        if device == "gpu":
-            logger.info(">>> [InferBackend] Use GPU to inference ...")
-            providers = ["CUDAExecutionProvider"]
-            if use_fp16:
-                logger.info(">>> [InferBackend] Use FP16 to inference ...")
-                import onnx
-                from onnxconverter_common import float16
-
-                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
-                onnx_model = onnx.load_model(float_onnx_file)
-                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
-                onnx.save_model(trans_model, fp16_model_file)
-                onnx_model = fp16_model_file
-        else:
-            logger.info(">>> [InferBackend] Use CPU to inference ...")
-            providers = ["CPUExecutionProvider"]
-            if use_fp16:
-                logger.warning(
-                    ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..."
-                )
-
-        sess_options = ort.SessionOptions()
-        sess_options.intra_op_num_threads = num_threads
-        self.predictor = ort.InferenceSession(
-            onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}]
-        )
-
-        self.input_handles = [
-            self.predictor.get_inputs()[0].name,
-            self.predictor.get_inputs()[1].name,
-        ]
-
-        if device == "gpu":
-            try:
-                assert "CUDAExecutionProvider" in self.predictor.get_providers()
-            except AssertionError:
-                raise AssertionError(
-                    """The environment for GPU inference is not set properly. \nA possible cause is that you had installed both onnxruntime and onnxruntime-gpu. \nPlease run the following commands to reinstall: \n1) pip uninstall -y onnxruntime onnxruntime-gpu  \n2) pip install onnxruntime-gpu"""
-                )
-        logger.info(">>> [InferBackend] Engine Created ...")
-
-    def infer(self, input_dict: dict):
-        input_dict = {k: v for k, v in input_dict.items() if k in self.input_handles}
-        result = self.predictor.run(None, input_dict)
-        return result
-
-
-class EHealthPredictor(object):
-    def __init__(self, args, label_list):
-        self.label_list = label_list
-        self._tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True)
-        self._max_seq_length = args.max_seq_length
-        self._batch_size = args.batch_size
-        self.inference_backend = InferBackend(
-            args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.num_threads
-        )
-
-    def predict(self, input_data: list):
-        encoded_inputs = self.preprocess(input_data)
-        infer_result = self.infer_batch(encoded_inputs)
-        result = self.postprocess(infer_result)
-        self.printer(result, input_data)
-        return result
-
-    def _infer(self, input_dict):
-        infer_data = self.inference_backend.infer(input_dict)
-        return infer_data
-
-    def infer_batch(self, encoded_inputs):
-        num_sample = len(encoded_inputs["input_ids"])
-        infer_data = None
-        num_infer_data = None
-        for idx in range(0, num_sample, self._batch_size):
-            l, r = idx, idx + self._batch_size
-            keys = encoded_inputs.keys()
-            input_dict = {k: encoded_inputs[k][l:r] for k in keys}
-            results = self._infer(input_dict)
-            if infer_data is None:
-                infer_data = [[x] for x in results]
-                num_infer_data = len(results)
-            else:
-                for i in range(num_infer_data):
-                    infer_data[i].append(results[i])
-        for i in range(num_infer_data):
-            infer_data[i] = np.concatenate(infer_data[i], axis=0)
-        return infer_data
-
-    def performance(self, encoded_inputs):
-        nums = len(encoded_inputs["input_ids"])
-        start_time = time.time()
-        infer_result = self.infer_batch(preprocess_result)  # noqa
-        total_time = time.time() - start_time
-        logger.info("sample nums: %d, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums))
-
-    def get_text_and_label(self, dataset):
-        raise NotImplementedError
-
-    def preprocess(self, input_data: list):
-        raise NotImplementedError
-
-    def postprocess(self, infer_data):
-        raise NotImplementedError
-
-    def printer(self, result, input_data):
-        raise NotImplementedError
-
-
-class CLSPredictor(EHealthPredictor):
-    def preprocess(self, input_data: list):
-        norm_text = lambda x: tokenize_special_chars(normalize_chars(x))
-        # To deal with a pair of input text.
-        if isinstance(input_data[0], list):
-            text = [norm_text(sample[0]) for sample in input_data]
-            text_pair = [norm_text(sample[1]) for sample in input_data]
-        else:
-            text = [norm_text(x) for x in input_data]
-            text_pair = None
-
-        data = self._tokenizer(
-            text=text, text_pair=text_pair, max_length=self._max_seq_length, padding=True, truncation=True
-        )
-
-        encoded_inputs = {
-            "input_ids": np.array(data["input_ids"], dtype="int64"),
-            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
-        }
-        return encoded_inputs
-
-    def postprocess(self, infer_data):
-        infer_data = infer_data[0]
-        max_value = np.max(infer_data, axis=1, keepdims=True)
-        exp_data = np.exp(infer_data - max_value)
-        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
-        label = probs.argmax(axis=-1)
-        confidence = probs.max(axis=-1)
-        return {"label": label, "confidence": confidence}
-
-    def printer(self, result, input_data):
-        label, confidence = result["label"], result["confidence"]
-        for i in range(len(label)):
-            logger.info("input data: {}".format(input_data[i]))
-            logger.info("labels: {}, confidence: {}".format(self.label_list[label[i]], confidence[i]))
-            logger.info("-----------------------------")
-
-
-class NERPredictor(EHealthPredictor):
-    """The predictor for CMeEE dataset."""
-
-    en_to_cn = {
-        "bod": "身体",
-        "mic": "微生物类",
-        "dis": "疾病",
-        "sym": "临床表现",
-        "pro": "医疗程序",
-        "equ": "医疗设备",
-        "dru": "药物",
-        "dep": "科室",
-        "ite": "医学检验项目",
-    }
-
-    def _extract_chunk(self, tokens):
-        chunks = set()
-        start_idx, cur_idx = 0, 0
-        while cur_idx < len(tokens):
-            if tokens[cur_idx][0] == "B":
-                start_idx = cur_idx
-                cur_idx += 1
-                while cur_idx < len(tokens) and tokens[cur_idx][0] == "I":
-                    if tokens[cur_idx][2:] == tokens[start_idx][2:]:
-                        cur_idx += 1
-                    else:
-                        break
-                if cur_idx < len(tokens) and tokens[cur_idx][0] == "E":
-                    if tokens[cur_idx][2:] == tokens[start_idx][2:]:
-                        chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx))
-                        cur_idx += 1
-            elif tokens[cur_idx][0] == "S":
-                chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx))
-                cur_idx += 1
-            else:
-                cur_idx += 1
-        return list(chunks)
-
-    def preprocess(self, infer_data):
-        infer_data = [[x.lower() for x in text] for text in infer_data]
-        data = self._tokenizer(
-            infer_data, max_length=self._max_seq_length, padding=True, is_split_into_words=True, truncation=True
-        )
-
-        encoded_inputs = {
-            "input_ids": np.array(data["input_ids"], dtype="int64"),
-            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
-        }
-        return encoded_inputs
-
-    def postprocess(self, infer_data):
-        tokens_oth = np.argmax(infer_data[0], axis=-1)
-        tokens_sym = np.argmax(infer_data[1], axis=-1)
-        entity = []
-        for oth_ids, sym_ids in zip(tokens_oth, tokens_sym):
-            token_oth = [self.label_list[0][x] for x in oth_ids]
-            token_sym = [self.label_list[1][x] for x in sym_ids]
-            chunks = self._extract_chunk(token_oth) + self._extract_chunk(token_sym)
-            sub_entity = []
-            for etype, sid, eid in chunks:
-                sub_entity.append({"type": self.en_to_cn[etype], "start_id": sid, "end_id": eid})
-            entity.append(sub_entity)
-        return {"entity": entity}
-
-    def printer(self, result, input_data):
-        result = result["entity"]
-        for i, preds in enumerate(result):
-            logger.info("input data: {}".format(input_data[i]))
-            logger.info("detected entities:")
-            for item in preds:
-                logger.info(
-                    "* entity: {}, type: {}, position: ({}, {})".format(
-                        input_data[i][item["start_id"] : item["end_id"]],
-                        item["type"],
-                        item["start_id"],
-                        item["end_id"],
-                    )
-                )
-            logger.info("-----------------------------")
-
-
-class SPOPredictor(EHealthPredictor):
-    """The predictor for the CMeIE dataset."""
-
-    def predict(self, input_data: list):
-        encoded_inputs = self.preprocess(input_data)
-        lengths = encoded_inputs["attention_mask"].sum(axis=-1)
-        infer_result = self.infer_batch(encoded_inputs)
-        result = self.postprocess(infer_result, lengths)
-        self.printer(result, input_data)
-        return result
-
-    def preprocess(self, infer_data):
-        infer_data = [[x.lower() for x in text] for text in infer_data]
-        data = self._tokenizer(
-            infer_data,
-            max_length=self._max_seq_length,
-            padding=True,
-            is_split_into_words=True,
-            truncation=True,
-            return_attention_mask=True,
-        )
-        encoded_inputs = {
-            "input_ids": np.array(data["input_ids"], dtype="int64"),
-            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
-            "attention_mask": np.array(data["attention_mask"], dtype="float32"),
-        }
-        return encoded_inputs
-
-    def postprocess(self, infer_data, lengths):
-        ent_logits = np.array(infer_data[0])
-        spo_logits = np.array(infer_data[1])
-        ent_pred_list = []
-        ent_idxs_list = []
-        for idx, ent_pred in enumerate(ent_logits):
-            seq_len = lengths[idx] - 2
-            start = np.where(ent_pred[:, 0] > 0.5)[0]
-            end = np.where(ent_pred[:, 1] > 0.5)[0]
-            ent_pred = []
-            ent_idxs = {}
-            for x in start:
-                y = end[end >= x]
-                if (x == 0) or (x > seq_len):
-                    continue
-                if len(y) > 0:
-                    y = y[0]
-                    if y > seq_len:
-                        continue
-                    ent_idxs[x] = (x - 1, y - 1)
-                    ent_pred.append((x - 1, y - 1))
-            ent_pred_list.append(ent_pred)
-            ent_idxs_list.append(ent_idxs)
-
-        spo_preds = spo_logits > 0
-        spo_pred_list = [[] for _ in range(len(spo_preds))]
-        idxs, preds, subs, objs = np.nonzero(spo_preds)
-        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
-            obj = ent_idxs_list[idx].get(o_id, None)
-            if obj is None:
-                continue
-            sub = ent_idxs_list[idx].get(s_id, None)
-            if sub is None:
-                continue
-            spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj)))
-
-        return {"entity": ent_pred_list, "spo": spo_pred_list}
-
-    def printer(self, result, input_data):
-        ent_pred_list, spo_pred_list = result["entity"], result["spo"]
-        for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)):
-            logger.info("input data: {}".format(input_data[i]))
-            logger.info("detected entities and relations:")
-            for sid, eid in ent:
-                logger.info("* entity: {}, position: ({}, {})".format(input_data[i][sid : eid + 1], sid, eid))
-            for s, p, o in rel:
-                logger.info(
-                    "+ spo: ({}, {}, {})".format(
-                        input_data[i][s[0] : s[1] + 1], self.label_list[p], input_data[i][o[0] : o[1] + 1]
-                    )
-                )
-            logger.info("-----------------------------")
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt
deleted file mode 100755
index 645682ec79c6..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-onnxruntime==1.10.0
-psutil
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt
deleted file mode 100755
index 2ca8b172eb79..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt
+++ /dev/null
@@ -1,4 +0,0 @@
-onnxruntime-gpu==1.11.1
-onnx==1.12.0
-onnxconverter-common==1.9.0
-psutil
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md
deleted file mode 100644
index 50166fe400b5..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md
+++ /dev/null
@@ -1,60 +0,0 @@
-# 基于PaddleNLP SimpleServing 的服务化部署
-
-## 目录
-- [环境准备](#环境准备)
-- [Server启动服务](#Server服务启动)
-- [其他参数设置](#其他参数设置)
-
-## 环境准备
-使用有SimpleServing功能的PaddleNLP版本
-```shell
-pip install paddlenlp >= 2.3.6
-```
-## Server服务启动
-### 分类任务启动
-#### 启动 分类 Server 服务
-```bash
-paddlenlp server server_classification:app --host 0.0.0.0 --port 8189
-```
-
-#### 分类任务发送服务
-```bash
-python client_classification.py --dataset  chip-cdn-2c
-```
-
-### NER 任务启动
-#### 启动 NER Server 服务
-```bash
-paddlenlp server server_ner:app --host 0.0.0.0 --port 8189
-```
-
-#### NER Client发送服务
-```bash
-python client_ner.py
-```
-
-### SPO 任务启动
-#### 启动 SPO Server 服务
-```bash
-paddlenlp server server_spo:app --host 0.0.0.0 --port 8189
-```
-
-#### SPO Client 发送服务
-```bash
-python client_spo.py
-```
-
-## 其他参数设置
-可以在client端设置 `max_seq_len`, `batch_size` 参数
-```python
-    data = {
-        'data': {
-            'text': texts,
-            'text_pair': text_pairs if len(text_pairs) > 0 else None
-        },
-        'parameters': {
-            'max_seq_len': args.max_seq_len,
-            'batch_size': args.batch_size
-        }
-    }
-```
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py
deleted file mode 100644
index 1993acb4b0f0..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py
+++ /dev/null
@@ -1,54 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-
-import requests
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--dataset", required=True, type=str, help="The dataset name for the simple seving")
-parser.add_argument(
-    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
-args = parser.parse_args()
-
-url = "http://0.0.0.0:8189/models/cblue_cls"
-headers = {"Content-Type": "application/json"}
-
-TEXT = {
-    "kuake-qic": ["心肌缺血如何治疗与调养呢？", "什么叫痔核脱出？什么叫外痔？"],
-    "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些？"]],
-    "kuake-qqr": [["茴香是发物吗", "茴香怎么吃？"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]],
-    "chip-ctc": ["(1)前牙结构发育不良：釉质发育不全、氟斑牙、四环素牙等；", "怀疑或确有酒精或药物滥用史；"],
-    "chip-sts": [["糖尿病能吃减肥药吗？能治愈吗？", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]],
-    "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]],
-}
-
-if __name__ == "__main__":
-    args.dataset = args.dataset.lower()
-    input_data = TEXT[args.dataset]
-    texts = []
-    text_pairs = []
-    for data in input_data:
-        if len(data) == 2:
-            text_pairs.append(data[1])
-        texts.append(data[0])
-    data = {
-        "data": {"text": texts, "text_pair": text_pairs if len(text_pairs) > 0 else None},
-        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
-    }
-    r = requests.post(url=url, headers=headers, data=json.dumps(data))
-    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py
deleted file mode 100644
index d3c64479ec20..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-
-import requests
-
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
-args = parser.parse_args()
-
-url = "http://0.0.0.0:8189/models/cblue_ner"
-headers = {"Content-Type": "application/json"}
-
-if __name__ == "__main__":
-    texts = ["研究证实，细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热，但以不规则发热为多，可能与患儿应用退热药物导致热型不规律有关。"]
-    texts = [[x.lower() for x in text] for text in texts]
-    data = {
-        "data": {
-            "text": texts,
-        },
-        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "is_split_into_words": True},
-    }
-    r = requests.post(url=url, headers=headers, data=json.dumps(data))
-    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py
deleted file mode 100644
index 38c34459d054..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py
+++ /dev/null
@@ -1,45 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-
-import requests
-
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
-args = parser.parse_args()
-
-url = "http://0.0.0.0:8189/models/cblue_spo"
-headers = {"Content-Type": "application/json"}
-
-if __name__ == "__main__":
-    texts = ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"]
-    texts = [[x.lower() for x in text] for text in texts]
-    data = {
-        "data": {
-            "text": texts,
-        },
-        "parameters": {
-            "max_seq_len": args.max_seq_len,
-            "batch_size": args.batch_size,
-            "return_attention_mask": True,
-            "is_split_into_words": True,
-        },
-    }
-    r = requests.post(url=url, headers=headers, data=json.dumps(data))
-    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py
deleted file mode 100644
index 1c024501dd15..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py
+++ /dev/null
@@ -1,25 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddlenlp import SimpleServer
-from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler
-
-app = SimpleServer()
-app.register(
-    "models/cblue_cls",
-    model_path="../../../export",
-    tokenizer_name="ernie-health-chinese",
-    model_handler=CustomModelHandler,
-    post_handler=MultiClassificationPostHandler,
-)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py
deleted file mode 100644
index 2b20efd1df81..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py
+++ /dev/null
@@ -1,129 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import numpy as np
-
-from paddlenlp import SimpleServer
-from paddlenlp.server import BasePostHandler, TokenClsModelHandler
-
-en_to_cn = {
-    "bod": "身体",
-    "mic": "微生物类",
-    "dis": "疾病",
-    "sym": "临床表现",
-    "pro": "医疗程序",
-    "equ": "医疗设备",
-    "dru": "药物",
-    "dep": "科室",
-    "ite": "医学检验项目",
-}
-
-label_list = [
-    [
-        "B-bod",
-        "I-bod",
-        "E-bod",
-        "S-bod",
-        "B-dis",
-        "I-dis",
-        "E-dis",
-        "S-dis",
-        "B-pro",
-        "I-pro",
-        "E-pro",
-        "S-pro",
-        "B-dru",
-        "I-dru",
-        "E-dru",
-        "S-dru",
-        "B-ite",
-        "I-ite",
-        "E-ite",
-        "S-ite",
-        "B-mic",
-        "I-mic",
-        "E-mic",
-        "S-mic",
-        "B-equ",
-        "I-equ",
-        "E-equ",
-        "S-equ",
-        "B-dep",
-        "I-dep",
-        "E-dep",
-        "S-dep",
-        "O",
-    ],
-    ["B-sym", "I-sym", "E-sym", "S-sym", "O"],
-]
-
-
-def _extract_chunk(tokens):
-    chunks = set()
-    start_idx, cur_idx = 0, 0
-    while cur_idx < len(tokens):
-        if tokens[cur_idx][0] == "B":
-            start_idx = cur_idx
-            cur_idx += 1
-            while cur_idx < len(tokens) and tokens[cur_idx][0] == "I":
-                if tokens[cur_idx][2:] == tokens[start_idx][2:]:
-                    cur_idx += 1
-                else:
-                    break
-            if cur_idx < len(tokens) and tokens[cur_idx][0] == "E":
-                if tokens[cur_idx][2:] == tokens[start_idx][2:]:
-                    chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx))
-                    cur_idx += 1
-        elif tokens[cur_idx][0] == "S":
-            chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx))
-            cur_idx += 1
-        else:
-            cur_idx += 1
-    return list(chunks)
-
-
-class NERPostHandler(BasePostHandler):
-    def __init__(self):
-        super().__init__()
-
-    @classmethod
-    def process(cls, data, parameters):
-        if "logits" not in data or "logits_1" not in data:
-            raise ValueError(
-                "The output of model handler do not include the 'logits', "
-                " please check the model handler output. The model handler output:\n{}".format(data)
-            )
-        tokens_oth = np.array(data["logits"])
-        tokens_sym = np.array(data["logits_1"])
-        tokens_oth = np.argmax(tokens_oth, axis=-1)
-        tokens_sym = np.argmax(tokens_sym, axis=-1)
-        entity = []
-        for oth_ids, sym_ids in zip(tokens_oth, tokens_sym):
-            token_oth = [label_list[0][x] for x in oth_ids]
-            token_sym = [label_list[1][x] for x in sym_ids]
-            chunks = _extract_chunk(token_oth) + _extract_chunk(token_sym)
-            sub_entity = []
-            for etype, sid, eid in chunks:
-                sub_entity.append({"type": en_to_cn[etype], "start_id": sid, "end_id": eid})
-            entity.append(sub_entity)
-        return {"entity": entity}
-
-
-app = SimpleServer()
-app.register(
-    "models/cblue_ner",
-    model_path="../../../export_ner",
-    tokenizer_name="ernie-health-chinese",
-    model_handler=TokenClsModelHandler,
-    post_handler=NERPostHandler,
-)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py
deleted file mode 100644
index 1a64cdbe66aa..000000000000
--- a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py
+++ /dev/null
@@ -1,142 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import numpy as np
-
-from paddlenlp import SimpleServer
-from paddlenlp.server import BasePostHandler, TokenClsModelHandler
-
-label_list = [
-    "预防",
-    "阶段",
-    "就诊科室",
-    "辅助治疗",
-    "化疗",
-    "放射治疗",
-    "手术治疗",
-    "实验室检查",
-    "影像学检查",
-    "辅助检查",
-    "组织学检查",
-    "内窥镜检查",
-    "筛查",
-    "多发群体",
-    "发病率",
-    "发病年龄",
-    "多发地区",
-    "发病性别倾向",
-    "死亡率",
-    "多发季节",
-    "传播途径",
-    "并发症",
-    "病理分型",
-    "相关（导致）",
-    "鉴别诊断",
-    "相关（转化）",
-    "相关（症状）",
-    "临床表现",
-    "治疗后症状",
-    "侵及周围组织转移的症状",
-    "病因",
-    "高危因素",
-    "风险评估因素",
-    "病史",
-    "遗传因素",
-    "发病机制",
-    "病理生理",
-    "药物治疗",
-    "发病部位",
-    "转移部位",
-    "外侵部位",
-    "预后状况",
-    "预后生存率",
-    "同义词",
-]
-
-
-class SPOPostHandler(BasePostHandler):
-    def __init__(self):
-        super().__init__()
-
-    @classmethod
-    def process(cls, data, parameters):
-        if "logits" not in data or "logits_1" not in data:
-            raise ValueError(
-                "The output of model handler do not include the 'logits', "
-                " please check the model handler output. The model handler output:\n{}".format(data)
-            )
-        lengths = np.array(data["attention_mask"], dtype="float32").sum(axis=-1)
-        ent_logits = np.array(data["logits"])
-        spo_logits = np.array(data["logits_1"])
-        ent_pred_list = []
-        ent_idxs_list = []
-        for idx, ent_pred in enumerate(ent_logits):
-            seq_len = lengths[idx] - 2
-            start = np.where(ent_pred[:, 0] > 0.5)[0]
-            end = np.where(ent_pred[:, 1] > 0.5)[0]
-            ent_pred = []
-            ent_idxs = {}
-            for x in start:
-                y = end[end >= x]
-                if (x == 0) or (x > seq_len):
-                    continue
-                if len(y) > 0:
-                    y = y[0]
-                    if y > seq_len:
-                        continue
-                    ent_idxs[x] = (x - 1, y - 1)
-                    ent_pred.append((x - 1, y - 1))
-            ent_pred_list.append(ent_pred)
-            ent_idxs_list.append(ent_idxs)
-
-        spo_preds = spo_logits > 0
-        spo_pred_list = [[] for _ in range(len(spo_preds))]
-        idxs, preds, subs, objs = np.nonzero(spo_preds)
-        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
-            obj = ent_idxs_list[idx].get(o_id, None)
-            if obj is None:
-                continue
-            sub = ent_idxs_list[idx].get(s_id, None)
-            if sub is None:
-                continue
-            spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj)))
-        input_data = data["data"]["text"]
-        ent_list = []
-        spo_list = []
-        for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)):
-            cur_ent_list = []
-            cur_spo_list = []
-            for sid, eid in ent:
-                cur_ent_list.append("".join([str(d) for d in input_data[i][sid : eid + 1]]))
-            for s, p, o in rel:
-                cur_spo_list.append(
-                    (
-                        "".join([str(d) for d in input_data[i][s[0] : s[1] + 1]]),
-                        label_list[p],
-                        "".join([str(d) for d in input_data[i][o[0] : o[1] + 1]]),
-                    )
-                )
-            ent_list.append(cur_ent_list)
-            spo_list.append(cur_spo_list)
-
-        return {"entity": ent_list, "spo": spo_list}
-
-
-app = SimpleServer()
-app.register(
-    "models/cblue_spo",
-    model_path="../../../export",
-    tokenizer_name="ernie-health-chinese",
-    model_handler=TokenClsModelHandler,
-    post_handler=SPOPostHandler,
-)
diff --git a/model_zoo/ernie-health/cblue/export_model.py b/model_zoo/ernie-health/cblue/export_model.py
deleted file mode 100644
index ebc71e376aa4..000000000000
--- a/model_zoo/ernie-health/cblue/export_model.py
+++ /dev/null
@@ -1,88 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-
-import paddle
-from model import ElectraForBinaryTokenClassification, ElectraForSPO
-
-from paddlenlp.transformers import ElectraForSequenceClassification
-
-NUM_CLASSES = {
-    "CHIP-CDN-2C": 2,
-    "CHIP-STS": 2,
-    "CHIP-CTC": 44,
-    "KUAKE-QQR": 3,
-    "KUAKE-QTR": 4,
-    "KUAKE-QIC": 11,
-    "CMeEE": [33, 5],
-    "CMeIE": 44,
-}
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--train_dataset", required=True, type=str, help="The name of dataset used for training.")
-    parser.add_argument(
-        "--params_path",
-        type=str,
-        required=True,
-        default="./checkpoint/",
-        help="The path to model parameters to be loaded.",
-    )
-    parser.add_argument(
-        "--output_path", type=str, default="./export", help="The path of model parameter in static graph to be saved."
-    )
-    args = parser.parse_args()
-    return args
-
-
-def main():
-    args = parse_args()
-
-    # Load the model parameters.
-    if args.train_dataset not in NUM_CLASSES:
-        raise ValueError(f"Please modify the code to fit {args.dataset}")
-
-    if args.train_dataset == "CMeEE":
-        model = ElectraForBinaryTokenClassification.from_pretrained(
-            args.params_path,
-            num_classes_oth=NUM_CLASSES[args.train_dataset][0],
-            num_classes_sym=NUM_CLASSES[args.train_dataset][1],
-        )
-    elif args.train_dataset == "CMeIE":
-        model = ElectraForSPO.from_pretrained(args.params_path, num_labels=NUM_CLASSES[args.train_dataset])
-    else:
-        model = ElectraForSequenceClassification.from_pretrained(
-            args.params_path, num_labels=NUM_CLASSES[args.train_dataset]
-        )
-
-    model.eval()
-
-    # Convert to static graph with specific input description:
-    # input_ids, token_type_ids
-    input_spec = [
-        paddle.static.InputSpec(shape=[None, None], dtype="int64"),
-        paddle.static.InputSpec(shape=[None, None], dtype="int64"),
-    ]
-    model = paddle.jit.to_static(model, input_spec=input_spec)
-
-    # Save in static graph model.
-    save_path = os.path.join(args.output_path, "inference")
-    paddle.jit.save(model, save_path)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/cblue/model.py b/model_zoo/ernie-health/cblue/model.py
deleted file mode 100644
index 71c9d62e71ff..000000000000
--- a/model_zoo/ernie-health/cblue/model.py
+++ /dev/null
@@ -1,122 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-import paddle.nn as nn
-
-from paddlenlp.transformers import ElectraConfig, ElectraModel, ElectraPretrainedModel
-
-
-class ElectraForBinaryTokenClassification(ElectraPretrainedModel):
-    """
-    Electra Model with two linear layers on top of the hidden-states output layers,
-    designed for token classification tasks with nesting.
-
-    Args:
-        electra (:class:`ElectraModel`):
-            An instance of ElectraModel.
-        num_classes (list):
-            The number of classes.
-        dropout (float, optionl):
-            The dropout probability for output of Electra.
-            If None, use the same value as `hidden_dropout_prob' of 'ElectraModel`
-            instance `electra`. Defaults to None.
-    """
-
-    def __init__(self, config: ElectraConfig, num_classes_oth, num_classes_sym):
-        super(ElectraForBinaryTokenClassification, self).__init__(config)
-        self.num_classes_oth = num_classes_oth
-        self.num_classes_sym = num_classes_sym
-        self.electra = ElectraModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier_oth = nn.Linear(config.hidden_size, self.num_classes_oth)
-        self.classifier_sym = nn.Linear(config.hidden_size, self.num_classes_sym)
-
-    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
-        sequence_output = self.electra(input_ids, token_type_ids, position_ids, attention_mask)
-        sequence_output = self.dropout(sequence_output)
-
-        logits_sym = self.classifier_sym(sequence_output)
-        logits_oth = self.classifier_oth(sequence_output)
-
-        return logits_oth, logits_sym
-
-
-class MultiHeadAttentionForSPO(nn.Layer):
-    """
-    Multi-head attention layer for SPO task.
-    """
-
-    def __init__(self, embed_dim, num_heads, scale_value=768):
-        super(MultiHeadAttentionForSPO, self).__init__()
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.scale_value = scale_value**-0.5
-        self.q_proj = nn.Linear(embed_dim, embed_dim * num_heads)
-        self.k_proj = nn.Linear(embed_dim, embed_dim * num_heads)
-
-    def forward(self, query, key):
-        q = self.q_proj(query)
-        k = self.k_proj(key)
-        q = paddle.reshape(q, shape=[0, 0, self.num_heads, self.embed_dim])
-        k = paddle.reshape(k, shape=[0, 0, self.num_heads, self.embed_dim])
-        q = paddle.transpose(q, perm=[0, 2, 1, 3])
-        k = paddle.transpose(k, perm=[0, 2, 1, 3])
-        scores = paddle.matmul(q, k, transpose_y=True)
-        scores = paddle.scale(scores, scale=self.scale_value)
-        return scores
-
-
-class ElectraForSPO(ElectraPretrainedModel):
-    """
-    Electra Model with a linear layer on top of the hidden-states output
-    layers for entity recognition, and a multi-head attention layer for
-    relation classification.
-
-    Args:
-        electra (:class:`ElectraModel`):
-            An instance of ElectraModel.
-        num_classes (int):
-            The number of classes.
-        dropout (float, optionl):
-            The dropout probability for output of Electra.
-            If None, use the same value as `hidden_dropout_prob' of 'ElectraModel`
-            instance `electra`. Defaults to None.
-    """
-
-    def __init__(self, config: ElectraConfig):
-        super(ElectraForSPO, self).__init__(config)
-        self.num_classes = config.num_labels
-        self.electra = ElectraModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier = nn.Linear(config.hidden_size, 2)
-        self.span_attention = MultiHeadAttentionForSPO(config.hidden_size, config.num_labels)
-
-    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
-        outputs = self.electra(
-            input_ids, token_type_ids, position_ids, attention_mask, output_hidden_states=True, return_dict=True
-        )
-        sequence_outputs = outputs.last_hidden_state
-        all_hidden_states = outputs.hidden_states
-        sequence_outputs = self.dropout(sequence_outputs)
-        ent_logits = self.classifier(sequence_outputs)
-
-        subject_output = all_hidden_states[-2]
-        cls_output = paddle.unsqueeze(sequence_outputs[:, 0, :], axis=1)
-        subject_output = subject_output + cls_output
-
-        output_size = self.num_classes + self.electra.config["hidden_size"]  # noqa:F841
-        rel_logits = self.span_attention(sequence_outputs, subject_output)
-
-        return ent_logits, rel_logits
diff --git a/model_zoo/ernie-health/cblue/train_classification.py b/model_zoo/ernie-health/cblue/train_classification.py
deleted file mode 100644
index b7e59b2f80f0..000000000000
--- a/model_zoo/ernie-health/cblue/train_classification.py
+++ /dev/null
@@ -1,263 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import distutils.util
-import os
-import random
-import time
-from functools import partial
-
-import numpy as np
-import paddle
-import paddle.nn.functional as F
-from paddle.metric import Accuracy
-from utils import LinearDecayWithWarmup, convert_example, create_dataloader
-
-from paddlenlp.data import Pad, Stack, Tuple
-from paddlenlp.datasets import load_dataset
-from paddlenlp.metrics import AccuracyAndF1, MultiLabelsMetric
-from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer
-
-METRIC_CLASSES = {
-    "KUAKE-QIC": Accuracy,
-    "KUAKE-QQR": Accuracy,
-    "KUAKE-QTR": Accuracy,
-    "CHIP-CTC": MultiLabelsMetric,
-    "CHIP-STS": MultiLabelsMetric,
-    "CHIP-CDN-2C": AccuracyAndF1,
-}
-
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--dataset",
-    choices=["KUAKE-QIC", "KUAKE-QQR", "KUAKE-QTR", "CHIP-STS", "CHIP-CTC", "CHIP-CDN-2C"],
-    default="KUAKE-QIC",
-    type=str,
-    help="Dataset for sequence classfication tasks.",
-)
-parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.")
-parser.add_argument(
-    "--device",
-    choices=["cpu", "gpu", "xpu", "npu"],
-    default="gpu",
-    help="Select which device to train model, default to gpu.",
-)
-parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs.")
-parser.add_argument(
-    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
-)
-parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument(
-    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task."
-)
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.")
-parser.add_argument(
-    "--warmup_proportion",
-    default=0.1,
-    type=float,
-    help="Linear warmup proportion of learning rate over the training process.",
-)
-parser.add_argument(
-    "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
-parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
-parser.add_argument(
-    "--save_dir",
-    default="./checkpoint",
-    type=str,
-    help="The output directory where the model checkpoints will be written.",
-)
-parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
-parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
-parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.")
-parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
-
-args = parser.parse_args()
-
-
-def set_seed(seed):
-    """set random seed"""
-    random.seed(seed)
-    np.random.seed(seed)
-    paddle.seed(seed)
-
-
-@paddle.no_grad()
-def evaluate(model, criterion, metric, data_loader):
-    """
-    Given a dataset, it evals model and compute the metric.
-
-    Args:
-        model(obj:`paddle.nn.Layer`): A model to classify texts.
-        dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
-        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
-        metric(obj:`paddle.metric.Metric`): The evaluation metric.
-    """
-    model.eval()
-    metric.reset()
-    losses = []
-    for batch in data_loader:
-        input_ids, token_type_ids, position_ids, labels = batch
-        logits = model(input_ids, token_type_ids, position_ids)
-        loss = criterion(logits, labels)
-        losses.append(loss.numpy())
-        correct = metric.compute(logits, labels)
-        metric.update(correct)
-    if isinstance(metric, Accuracy):
-        metric_name = "accuracy"
-        result = metric.accumulate()
-    elif isinstance(metric, MultiLabelsMetric):
-        metric_name = "macro f1"
-        _, _, result = metric.accumulate("macro")
-    else:
-        metric_name = "micro f1"
-        _, _, _, result, _ = metric.accumulate()
-
-    print("eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric_name, result))
-    model.train()
-    metric.reset()
-
-
-def do_train():
-    paddle.set_device(args.device)
-    rank = paddle.distributed.get_rank()
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args.seed)
-
-    train_ds, dev_ds = load_dataset("cblue", args.dataset, splits=["train", "dev"])
-
-    model = ElectraForSequenceClassification.from_pretrained(
-        "ernie-health-chinese", num_labels=len(train_ds.label_list)
-    )
-    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
-
-    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
-    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
-        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
-        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
-        Pad(axis=0, pad_val=args.max_seq_length - 1, dtype="int64"),  # position
-        Stack(dtype="int64"),
-    ): [data for data in fn(samples)]
-    train_data_loader = create_dataloader(
-        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-    dev_data_loader = create_dataloader(
-        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-
-    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
-        state_dict = paddle.load(args.init_from_ckpt)
-        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
-        if len(state_keys) > 0:
-            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
-        model.set_dict(state_dict)
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
-    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    criterion = paddle.nn.loss.CrossEntropyLoss()
-    if METRIC_CLASSES[args.dataset] is Accuracy:
-        metric = METRIC_CLASSES[args.dataset]()
-        metric_name = "accuracy"
-    elif METRIC_CLASSES[args.dataset] is MultiLabelsMetric:
-        metric = METRIC_CLASSES[args.dataset](num_labels=len(train_ds.label_list))
-        metric_name = "macro f1"
-    else:
-        metric = METRIC_CLASSES[args.dataset]()
-        metric_name = "micro f1"
-    if args.use_amp:
-        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
-    global_step = 0
-    tic_train = time.time()
-    total_train_time = 0
-    for epoch in range(1, args.epochs + 1):
-        for step, batch in enumerate(train_data_loader, start=1):
-            input_ids, token_type_ids, position_ids, labels = batch
-            with paddle.amp.auto_cast(
-                args.use_amp,
-                custom_white_list=["layer_norm", "softmax", "gelu", "tanh"],
-            ):
-                logits = model(input_ids, token_type_ids, position_ids)
-                loss = criterion(logits, labels)
-            probs = F.softmax(logits, axis=1)
-            correct = metric.compute(probs, labels)
-            metric.update(correct)
-
-            if isinstance(metric, Accuracy):
-                result = metric.accumulate()
-            elif isinstance(metric, MultiLabelsMetric):
-                _, _, result = metric.accumulate("macro")
-            else:
-                _, _, _, result, _ = metric.accumulate()
-
-            if args.use_amp:
-                scaler.scale(loss).backward()
-                scaler.minimize(optimizer, loss)
-            else:
-                loss.backward()
-                optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-
-            global_step += 1
-            if global_step % args.logging_steps == 0 and rank == 0:
-                time_diff = time.time() - tic_train
-                total_train_time += time_diff
-                print(
-                    "global step %d, epoch: %d, batch: %d, loss: %.5f, %s: %.5f, speed: %.2f step/s"
-                    % (global_step, epoch, step, loss, metric_name, result, args.logging_steps / time_diff)
-                )
-
-            if global_step % args.valid_steps == 0 and rank == 0:
-                evaluate(model, criterion, metric, dev_data_loader)
-
-            if global_step % args.save_steps == 0 and rank == 0:
-                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
-                if not os.path.exists(save_dir):
-                    os.makedirs(save_dir)
-                if paddle.distributed.get_world_size() > 1:
-                    model._layers.save_pretrained(save_dir)
-                else:
-                    model.save_pretrained(save_dir)
-                tokenizer.save_pretrained(save_dir)
-
-            if global_step >= num_training_steps:
-                return
-            tic_train = time.time()
-
-    if rank == 0 and total_train_time > 0:
-        print("Speed: %.2f steps/s" % (global_step / total_train_time))
-
-
-if __name__ == "__main__":
-    do_train()
diff --git a/model_zoo/ernie-health/cblue/train_ner.py b/model_zoo/ernie-health/cblue/train_ner.py
deleted file mode 100644
index de7e50b1f9dd..000000000000
--- a/model_zoo/ernie-health/cblue/train_ner.py
+++ /dev/null
@@ -1,254 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import random
-import time
-from functools import partial
-
-import numpy as np
-import paddle
-from model import ElectraForBinaryTokenClassification
-from utils import (
-    LinearDecayWithWarmup,
-    NERChunkEvaluator,
-    convert_example_ner,
-    create_dataloader,
-)
-
-from paddlenlp.data import Dict, Pad
-from paddlenlp.datasets import load_dataset
-from paddlenlp.transformers import ElectraTokenizer
-
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--device",
-    choices=["cpu", "gpu", "xpu", "npu"],
-    default="gpu",
-    help="Select which device to train model, default to gpu.",
-)
-parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
-parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument(
-    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning token classification task."
-)
-parser.add_argument(
-    "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
-parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
-parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-parser.add_argument(
-    "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process."
-)
-parser.add_argument("--use_amp", default=False, type=bool, help="Enable mixed precision training.")
-parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs.")
-parser.add_argument(
-    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
-)
-parser.add_argument("--seed", default=1000, type=int, help="Random seed.")
-parser.add_argument(
-    "--save_dir",
-    default="./checkpoint",
-    type=str,
-    help="The output directory where the model checkpoints will be written.",
-)
-parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
-
-args = parser.parse_args()
-
-
-def set_seed(seed):
-    """set random seed"""
-    random.seed(seed)
-    np.random.seed(seed)
-    paddle.seed(seed)
-
-
-@paddle.no_grad()
-def evaluate(model, criterion, metric, data_loader):
-    model.eval()
-    metric.reset()
-    losses = []
-    for batch in data_loader:
-        input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch
-        logits = model(input_ids, token_type_ids, position_ids)
-
-        loss_mask = masks.unsqueeze(2)
-        loss = [(criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym])]
-        losses.append([x.numpy() for x in loss])
-
-        lengths = paddle.sum(masks, axis=1)
-        preds = [paddle.argmax(x, axis=2) for x in logits]
-        correct = metric.compute(lengths, preds, [label_oth, label_sym])
-        metric.update(correct)
-        _, _, result = metric.accumulate()
-    loss = np.mean(losses, axis=0)
-    print("eval loss symptom: %.5f, loss others: %.5f, loss: %.5f, f1: %.5f" % (loss[1], loss[0], loss.sum(), result))
-    model.train()
-    metric.reset()
-
-
-def do_train():
-    paddle.set_device(args.device)
-    rank = paddle.distributed.get_rank()
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args.seed)
-
-    train_ds, dev_ds = load_dataset("cblue", "CMeEE", splits=["train", "dev"])
-
-    model = ElectraForBinaryTokenClassification.from_pretrained(
-        "ernie-health-chinese",
-        num_classes_oth=len(train_ds.label_list[0]),
-        num_classes_sym=len(train_ds.label_list[1]),
-    )
-    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
-
-    label_list = train_ds.label_list
-    pad_label_id = [len(label_list[0]) - 1, len(label_list[1]) - 1]
-
-    trans_func = partial(
-        convert_example_ner, tokenizer=tokenizer, max_seq_length=args.max_seq_length, pad_label_id=pad_label_id
-    )
-
-    batchify_fn = lambda samples, fn=Dict(  # noqa: E731
-        {
-            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
-            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),
-            "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
-            "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"),
-            "label_oth": Pad(axis=0, pad_val=pad_label_id[0], dtype="int64"),
-            "label_sym": Pad(axis=0, pad_val=pad_label_id[1], dtype="int64"),
-        }
-    ): fn(samples)
-
-    train_data_loader = create_dataloader(
-        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-
-    dev_data_loader = create_dataloader(
-        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-
-    if args.init_from_ckpt:
-        if not os.path.isfile(args.init_from_ckpt):
-            raise ValueError("init_from_ckpt is not a valid model filename.")
-        state_dict = paddle.load(args.init_from_ckpt)
-        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
-        if len(state_keys) > 0:
-            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
-        model.set_dict(state_dict)
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
-    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    criterion = paddle.nn.functional.softmax_with_cross_entropy
-
-    metric = NERChunkEvaluator(label_list)
-
-    if args.use_amp:
-        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
-
-    global_step = 0
-    tic_train = time.time()
-    total_train_time = 0
-    for epoch in range(1, args.epochs + 1):
-        for step, batch in enumerate(train_data_loader, start=1):
-            input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch
-            with paddle.amp.auto_cast(
-                args.use_amp,
-                custom_white_list=["layer_norm", "softmax", "gelu"],
-            ):
-                logits = model(input_ids, token_type_ids, position_ids)
-
-                loss_mask = paddle.unsqueeze(masks, 2)
-                losses = [
-                    (criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym])
-                ]
-                loss = losses[0] + losses[1]
-
-                lengths = paddle.sum(masks, axis=1)
-                preds = [paddle.argmax(x, axis=-1) for x in logits]
-                correct = metric.compute(lengths, preds, [label_oth, label_sym])
-                metric.update(correct)
-                _, _, f1 = metric.accumulate()
-
-                if args.use_amp:
-                    scaler.scale(loss).backward()
-                    scaler.minimize(optimizer, loss)
-                else:
-                    loss.backward()
-                    optimizer.step()
-                lr_scheduler.step()
-                optimizer.clear_grad()
-
-                global_step += 1
-                if global_step % args.logging_steps == 0 and rank == 0:
-                    time_diff = time.time() - tic_train
-                    total_train_time += time_diff
-                    print(
-                        "global step %d, epoch: %d, batch: %d, loss: %.5f, loss symptom: %.5f, loss others: %.5f, f1: %.5f, speed: %.2f step/s, learning_rate: %f"
-                        % (
-                            global_step,
-                            epoch,
-                            step,
-                            loss,
-                            losses[1],
-                            losses[0],
-                            f1,
-                            args.logging_steps / time_diff,
-                            lr_scheduler.get_lr(),
-                        )
-                    )
-
-                if global_step % args.valid_steps == 0 and rank == 0:
-                    evaluate(model, criterion, metric, dev_data_loader)
-
-                if global_step % args.save_steps == 0 and rank == 0:
-                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
-                    if not os.path.exists(save_dir):
-                        os.makedirs(save_dir)
-                    if paddle.distributed.get_world_size() > 1:
-                        model._layers.save_pretrained(save_dir)
-                    else:
-                        model.save_pretrained(save_dir)
-                    tokenizer.save_pretrained(save_dir)
-
-                if global_step >= num_training_steps:
-                    return
-                tic_train = time.time()
-
-    if rank == 0 and total_train_time > 0:
-        print("Speed: %.2f steps/s" % (global_step / total_train_time))
-
-
-if __name__ == "__main__":
-    do_train()
diff --git a/model_zoo/ernie-health/cblue/train_spo.py b/model_zoo/ernie-health/cblue/train_spo.py
deleted file mode 100644
index e5d8d5eb6128..000000000000
--- a/model_zoo/ernie-health/cblue/train_spo.py
+++ /dev/null
@@ -1,300 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import distutils.util
-import os
-import random
-import time
-from functools import partial
-
-import numpy as np
-import paddle
-import paddle.nn.functional as F
-from model import ElectraForSPO
-from tqdm import tqdm
-from utils import (
-    LinearDecayWithWarmup,
-    SPOChunkEvaluator,
-    convert_example_spo,
-    create_dataloader,
-)
-
-from paddlenlp.data import Dict, Pad
-from paddlenlp.datasets import load_dataset
-from paddlenlp.transformers import ElectraTokenizer
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.")
-parser.add_argument(
-    "--device",
-    choices=["cpu", "gpu", "xpu", "npu"],
-    default="gpu",
-    help="Select which device to train model, default to gpu.",
-)
-parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs.")
-parser.add_argument(
-    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
-)
-parser.add_argument("--batch_size", default=12, type=int, help="Batch size per GPU/CPU for training.")
-parser.add_argument(
-    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task."
-)
-parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.")
-parser.add_argument(
-    "--warmup_proportion",
-    default=0.1,
-    type=float,
-    help="Linear warmup proportion of learning rate over the training process.",
-)
-parser.add_argument(
-    "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization."
-)
-parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
-parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
-parser.add_argument(
-    "--save_dir",
-    default="./checkpoint",
-    type=str,
-    help="The output directory where the model checkpoints will be written.",
-)
-parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
-parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
-parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.")
-parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
-
-args = parser.parse_args()
-
-
-def set_seed(seed):
-    """set random seed"""
-    random.seed(seed)
-    np.random.seed(seed)
-    paddle.seed(seed)
-
-
-@paddle.no_grad()
-def evaluate(model, criterion, metric, data_loader):
-    """
-    Given a dataset, it evals model and compute the metric.
-    Args:
-        model(obj:`paddle.nn.Layer`): A model to classify texts.
-        dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
-        criterion(`paddle.nn.functional`): It can compute the loss.
-        metric(obj:`paddle.metric.Metric`): The evaluation metric.
-    """
-    model.eval()
-    metric.reset()
-    losses = []
-    for batch in tqdm(data_loader):
-        input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch
-        ent_mask = paddle.unsqueeze(masks, axis=2)
-        spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True)
-        spo_mask = paddle.unsqueeze(spo_mask, axis=1)
-
-        logits = model(input_ids, token_type_ids, position_ids)
-
-        ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum")
-        spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum")
-        loss = ent_loss + spo_loss
-        losses.append(loss.numpy())
-        lengths = paddle.sum(masks, axis=-1)
-        correct = metric.compute(lengths, logits[0], logits[1], ent_label[1], spo_label[1])
-        metric.update(correct)
-    results = metric.accumulate()
-    print(
-        "eval loss: %.5f, entity f1: %.5f, spo f1: %.5f" % (np.mean(losses), results["entity"][2], results["spo"][2])
-    )
-    model.train()
-    metric.reset()
-
-
-def do_train():
-    paddle.set_device(args.device)
-    rank = paddle.distributed.get_rank()
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args.seed)
-
-    train_ds, dev_ds = load_dataset("cblue", "CMeIE", splits=["train", "dev"])
-
-    model = ElectraForSPO.from_pretrained("ernie-health-chinese", num_classes=len(train_ds.label_list))
-    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
-
-    trans_func = partial(
-        convert_example_spo,
-        tokenizer=tokenizer,
-        num_classes=len(train_ds.label_list),
-        max_seq_length=args.max_seq_length,
-    )
-
-    def batchify_fn(data):
-        _batchify_fn = lambda samples, fn=Dict(  # noqa: E731
-            {
-                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
-                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
-                "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
-                "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"),
-            }
-        ): fn(samples)
-        ent_label = [x["ent_label"] for x in data]
-        spo_label = [x["spo_label"] for x in data]
-        input_ids, token_type_ids, position_ids, masks = _batchify_fn(data)
-        batch_size, batch_len = input_ids.shape
-        num_classes = len(train_ds.label_list)
-        # Create one-hot labels.
-        #
-        # For example,
-        # - text:
-        #   [CLS], 局, 部, 皮, 肤, 感, 染, 引, 起, 的, 皮, 疹, 等, [SEP]
-        #
-        # - ent_label (obj: `list`):
-        #   [(0, 5), (9, 10)] # ['局部皮肤感染', '皮疹']
-        #
-        # - one_hot_ent_label: # shape (sequence_length, 2)
-        #   [[ 0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0], # start index
-        #    [ 0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  1,  0,  0]] # end index
-        #
-        # - spo_label (obj: `list`):
-        #   [(0, 23, 9)] # [('局部皮肤感染', '相关（导致）', '皮疹')], where entities
-        #                  are encoded by their start indexes.
-        #
-        # - one_hot_spo_label: # shape (num_predicate, sequence_length, sequence_length)
-        #   [...,
-        #    [..., [0, ..., 1, ..., 0], ...], # for predicate '相关（导致）'
-        #    ...]                             # the value at [23, 1, 10] is set as 1
-        #
-        one_hot_ent_label = np.zeros([batch_size, batch_len, 2], dtype=np.float32)
-        one_hot_spo_label = np.zeros([batch_size, num_classes, batch_len, batch_len], dtype=np.float32)
-        for idx, ent_idxs in enumerate(ent_label):
-            # Shift index by 1 because input_ids start with [CLS] here.
-            for x, y in ent_idxs:
-                x = x + 1
-                y = y + 1
-                if x > 0 and x < batch_len and y < batch_len:
-                    one_hot_ent_label[idx, x, 0] = 1
-                    one_hot_ent_label[idx, y, 1] = 1
-        for idx, spo_idxs in enumerate(spo_label):
-            for s, p, o in spo_idxs:
-                s_id = s[0] + 1
-                o_id = o[0] + 1
-                if s_id > 0 and s_id < batch_len and o_id < batch_len:
-                    one_hot_spo_label[idx, p, s_id, o_id] = 1
-        # one_hot_xxx_label are used for loss computation.
-        # xxx_label are used for metric computation.
-        ent_label = [one_hot_ent_label, ent_label]
-        spo_label = [one_hot_spo_label, spo_label]
-        return input_ids, token_type_ids, position_ids, masks, ent_label, spo_label
-
-    train_data_loader = create_dataloader(
-        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-
-    dev_data_loader = create_dataloader(
-        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
-    )
-
-    if args.init_from_ckpt:
-        if not os.path.isfile(args.init_from_ckpt):
-            raise ValueError("init_from_ckpt is not a valid model filename.")
-        state_dict = paddle.load(args.init_from_ckpt)
-        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
-        if len(state_keys) > 0:
-            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
-        model.set_dict(state_dict)
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
-    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-
-    criterion = F.binary_cross_entropy_with_logits
-
-    metric = SPOChunkEvaluator(num_classes=len(train_ds.label_list))
-
-    if args.use_amp:
-        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
-    global_step = 0
-    tic_train = time.time()
-    total_train_time = 0
-    for epoch in range(1, args.epochs + 1):
-        for step, batch in enumerate(train_data_loader, start=1):
-            input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch
-            ent_mask = paddle.unsqueeze(masks, axis=2)
-            spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True)
-            spo_mask = paddle.unsqueeze(spo_mask, axis=1)
-
-            with paddle.amp.auto_cast(
-                args.use_amp,
-                custom_white_list=["layer_norm", "softmax", "gelu"],
-            ):
-                logits = model(input_ids, token_type_ids, position_ids)
-                ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum")
-                spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum")
-
-                loss = ent_loss + spo_loss
-
-            if args.use_amp:
-                scaler.scale(loss).backward()
-                scaler.minimize(optimizer, loss)
-            else:
-                loss.backward()
-                optimizer.step()
-            lr_scheduler.step()
-            optimizer.clear_grad()
-
-            global_step += 1
-            if global_step % args.logging_steps == 0 and rank == 0:
-                time_diff = time.time() - tic_train
-                total_train_time += time_diff
-                print(
-                    "global step %d, epoch: %d, batch: %d, loss: %.5f, "
-                    "ent_loss: %.5f, spo_loss: %.5f, speed: %.2f steps/s"
-                    % (global_step, epoch, step, loss, ent_loss, spo_loss, args.logging_steps / time_diff)
-                )
-
-            if global_step % args.valid_steps == 0 and rank == 0:
-                evaluate(model, criterion, metric, dev_data_loader)
-
-            if global_step % args.save_steps == 0 and rank == 0:
-                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
-                if not os.path.exists(save_dir):
-                    os.makedirs(save_dir)
-                if paddle.distributed.get_world_size() > 1:
-                    model._layers.save_pretrained(save_dir)
-                else:
-                    model.save_pretrained(save_dir)
-                tokenizer.save_pretrained(save_dir)
-
-            if global_step >= num_training_steps:
-                return
-            tic_train = time.time()
-
-    if rank == 0 and total_train_time > 0:
-        print("Speed: %.2f steps/s" % (global_step / total_train_time))
-
-
-if __name__ == "__main__":
-    do_train()
diff --git a/model_zoo/ernie-health/cblue/utils.py b/model_zoo/ernie-health/cblue/utils.py
deleted file mode 100644
index 4c0bda0c63eb..000000000000
--- a/model_zoo/ernie-health/cblue/utils.py
+++ /dev/null
@@ -1,503 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-
-import numpy as np
-import paddle
-from paddle.optimizer.lr import LambdaDecay
-
-from paddlenlp.transformers import normalize_chars, tokenize_special_chars
-
-
-def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
-    if trans_fn:
-        dataset = dataset.map(trans_fn)
-
-    shuffle = True if mode == "train" else False
-    if mode == "train":
-        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
-    else:
-        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
-
-    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
-
-
-class LinearDecayWithWarmup(LambdaDecay):
-    def __init__(self, learning_rate, total_steps, warmup, last_epoch=-1, verbose=False):
-        """
-        Creates a learning rate scheduler, which increases learning rate linearly
-        from 0 to given `learning_rate`, after this warmup period learning rate
-        would be decreased linearly from the base learning rate to 0.
-
-        Args:
-            learning_rate (float):
-                The base learning rate. It is a python float number.
-            total_steps (int):
-                The number of training steps.
-            warmup (int or float):
-                If int, it means the number of steps for warmup. If float, it means
-                the proportion of warmup in total training steps.
-            last_epoch (int, optional):
-                The index of last epoch. It can be set to restart training. If
-                None, it means initial learning rate.
-                Defaults to -1.
-            verbose (bool, optional):
-                If True, prints a message to stdout for each update.
-                Defaults to False.
-        """
-
-        warmup_steps = warmup if isinstance(warmup, int) else int(math.floor(warmup * total_steps))
-
-        def lr_lambda(current_step):
-            if current_step < warmup_steps:
-                return float(current_step) / float(max(1, warmup_steps))
-            return max(0.0, 1.0 - current_step / total_steps)
-
-        super(LinearDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose)
-
-
-def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
-    """
-    Builds model inputs from a sequence or a pair of sequences for sequence
-    classification tasks by concatenating and adding special tokens. And
-    creates a mask from the two sequences for sequence-pair classification
-    tasks.
-
-    The convention in Electra/EHealth is:
-
-    - single sequence:
-        input_ids:      ``[CLS] X [SEP]``
-        token_type_ids: ``  0   0   0``
-        position_ids:   ``  0   1   2``
-
-    - a senquence pair:
-        input_ids:      ``[CLS] X [SEP] Y [SEP]``
-        token_type_ids: ``  0   0   0   1   1``
-        position_ids:   ``  0   1   2   3   4``
-
-    Args:
-        example (obj:`dict`):
-            A dictionary of input data, containing text and label if it has.
-        tokenizer (obj:`PretrainedTokenizer`):
-            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
-            Users can refer to the superclass for more information.
-        max_seq_length (obj:`int`):
-            The maximum total input sequence length after tokenization.
-            Sequences longer will be truncated, and the shorter will be padded.
-        is_test (obj:`bool`, default to `False`):
-            Whether the example contains label or not.
-
-    Returns:
-        input_ids (obj:`list[int]`):
-            The list of token ids.
-        token_type_ids (obj:`list[int]`):
-            List of sequence pair mask.
-        position_ids (obj:`list[int]`):
-            List of position ids.
-        label(obj:`numpy.array`, data type of int64, optional):
-            The input label if not is_test.
-    """
-    text_a = example["text_a"]
-    text_b = example.get("text_b", None)
-
-    text_a = tokenize_special_chars(normalize_chars(text_a))
-    if text_b is not None:
-        text_b = tokenize_special_chars(normalize_chars(text_b))
-
-    encoded_inputs = tokenizer(text=text_a, text_pair=text_b, max_seq_len=max_seq_length, return_position_ids=True)
-    input_ids = encoded_inputs["input_ids"]
-    token_type_ids = encoded_inputs["token_type_ids"]
-    position_ids = encoded_inputs["position_ids"]
-
-    if is_test:
-        return input_ids, token_type_ids, position_ids
-    label = np.array([example["label"]], dtype="int64")
-    return input_ids, token_type_ids, position_ids, label
-
-
-def convert_example_ner(example, tokenizer, max_seq_length=512, pad_label_id=-100, is_test=False):
-    """
-    Builds model inputs from a sequence and creates labels for named-
-    entity recognition task CMeEE.
-
-    For example, a sample should be:
-
-    - input_ids:      ``[CLS]  x1   x2 [SEP] [PAD]``
-    - token_type_ids: ``  0    0    0    0     0``
-    - position_ids:   ``  0    1    2    3     0``
-    - attention_mask: ``  1    1    1    1     0``
-    - label_oth:      `` 32    3   32   32    32`` (optional, label ids of others)
-    - label_sym:      ``  4    4    4    4     4`` (optional, label ids of symptom)
-
-    Args:
-        example (obj:`dict`):
-            A dictionary of input data, containing text and label if it has.
-        tokenizer (obj:`PretrainedTokenizer`):
-            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
-            Users can refer to the superclass for more information.
-        max_seq_length (obj:`int`):
-            The maximum total input sequence length after tokenization.
-            Sequences longer will be truncated, and the shorter will be padded.
-        is_test (obj:`bool`, default to `False`):
-            Whether the example contains label or not.
-
-    Returns:
-        encoded_output (obj: `dict[str, list|np.array]`):
-            The sample dictionary including `input_ids`, `token_type_ids`,
-            `position_ids`, `attention_mask`, `label_oth` (optional),
-            `label_sym` (optional)
-    """
-
-    encoded_inputs = {}
-    text = example["text"]
-    if len(text) > max_seq_length - 2:
-        text = text[: max_seq_length - 2]
-    text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"]
-    input_len = len(text)
-    encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text)
-    encoded_inputs["token_type_ids"] = np.zeros(input_len)
-    encoded_inputs["position_ids"] = list(range(input_len))
-    encoded_inputs["attention_mask"] = np.ones(input_len)
-
-    if not is_test:
-        labels = example["labels"]
-        if input_len - 2 < len(labels[0]):
-            labels[0] = labels[0][: input_len - 2]
-        if input_len - 2 < len(labels[1]):
-            labels[1] = labels[1][: input_len - 2]
-        encoded_inputs["label_oth"] = [pad_label_id[0]] + labels[0] + [pad_label_id[0]]
-        encoded_inputs["label_sym"] = [pad_label_id[1]] + labels[1] + [pad_label_id[1]]
-
-    return encoded_inputs
-
-
-def convert_example_spo(example, tokenizer, num_classes, max_seq_length=512, is_test=False):
-    """
-    Builds model inputs from a sequence and creates labels for SPO prediction
-    task CMeIE.
-
-    For example, a sample should be:
-
-    - input_ids:      ``[CLS]  x1   x2 [SEP] [PAD]``
-    - token_type_ids: ``  0    0    0    0     0``
-    - position_ids:   ``  0    1    2    3     0``
-    - attention_mask: ``  1    1    1    1     0``
-    - ent_label:      ``[[0    1    0    0     0], # start ids are set as 1
-                         [0    0    1    0     0]] # end ids are set as 1
-    - spo_label: a tensor of shape [num_classes, max_batch_len, max_batch_len].
-                 Set [predicate_id, subject_start_id, object_start_id] as 1
-                 when (subject, predicate, object) exists.
-
-    Args:
-        example (obj:`dict`):
-            A dictionary of input data, containing text and label if it has.
-        tokenizer (obj:`PretrainedTokenizer`):
-            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
-            Users can refer to the superclass for more information.
-        num_classes (obj:`int`):
-            The number of predicates.
-        max_seq_length (obj:`int`):
-            The maximum total input sequence length after tokenization.
-            Sequences longer will be truncated, and the shorter will be padded.
-        is_test (obj:`bool`, default to `False`):
-            Whether the example contains label or not.
-
-    Returns:
-        encoded_output (obj: `dict[str, list|np.array]`):
-            The sample dictionary including `input_ids`, `token_type_ids`,
-            `position_ids`, `attention_mask`, `ent_label` (optional),
-            `spo_label` (optional)
-    """
-    encoded_inputs = {}
-    text = example["text"]
-    if len(text) > max_seq_length - 2:
-        text = text[: max_seq_length - 2]
-    text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"]
-    input_len = len(text)
-    encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text)
-    encoded_inputs["token_type_ids"] = np.zeros(input_len)
-    encoded_inputs["position_ids"] = list(range(input_len))
-    encoded_inputs["attention_mask"] = np.ones(input_len)
-    if not is_test:
-        encoded_inputs["ent_label"] = example["ent_label"]
-        encoded_inputs["spo_label"] = example["spo_label"]
-    return encoded_inputs
-
-
-class NERChunkEvaluator(paddle.metric.Metric):
-    """
-    NERChunkEvaluator computes the precision, recall and F1-score for chunk detection.
-    It is often used in sequence tagging tasks, such as Named Entity Recognition (NER).
-
-    Args:
-        label_list (list):
-            The label list.
-
-    Note:
-        Difference from `paddlenlp.metric.ChunkEvaluator`:
-
-        - `paddlenlp.metric.ChunkEvaluator`
-           All sequences with non-'O' labels are taken as chunks when computing num_infer.
-        - `NERChunkEvaluator`
-           Only complete sequences are taken as chunks, namely `B- I- E-` or `S-`.
-    """
-
-    def __init__(self, label_list):
-        super(NERChunkEvaluator, self).__init__()
-        self.id2label = [dict(enumerate(x)) for x in label_list]
-        self.num_classes = [len(x) for x in label_list]
-        self.num_infer = 0
-        self.num_label = 0
-        self.num_correct = 0
-
-    def compute(self, lengths, predictions, labels):
-        """
-        Computes the prediction, recall and F1-score for chunk detection.
-
-        Args:
-            lengths (Tensor):
-                The valid length of every sequence, a tensor with shape `[batch_size]`.
-            predictions (Tensor):
-                The predictions index, a tensor with shape `[batch_size, sequence_length]`.
-            labels (Tensor):
-                The labels index, a tensor with shape `[batch_size, sequence_length]`.
-
-        Returns:
-            tuple: Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`).
-
-            With the fields:
-
-            - `num_infer_chunks` (Tensor): The number of the inference chunks.
-            - `num_label_chunks` (Tensor): The number of the label chunks.
-            - `num_correct_chunks` (Tensor): The number of the correct chunks.
-        """
-        assert len(predictions) == len(labels)
-        assert len(predictions) == len(self.id2label)
-        preds = [x.numpy() for x in predictions]
-        labels = [x.numpy() for x in labels]
-
-        preds_chunk = set()
-        label_chunk = set()
-        for idx, (pred, label) in enumerate(zip(preds, labels)):
-            for i, case in enumerate(pred):
-                case = [self.id2label[idx][x] for x in case[: lengths[i]]]
-                preds_chunk |= self.extract_chunk(case, i)
-            for i, case in enumerate(label):
-                case = [self.id2label[idx][x] for x in case[: lengths[i]]]
-                label_chunk |= self.extract_chunk(case, i)
-
-        num_infer = len(preds_chunk)
-        num_label = len(label_chunk)
-        num_correct = len(preds_chunk & label_chunk)
-        return num_infer, num_label, num_correct
-
-    def update(self, correct):
-        num_infer, num_label, num_correct = correct
-        self.num_infer += num_infer
-        self.num_label += num_label
-        self.num_correct += num_correct
-
-    def accumulate(self):
-        precision = self.num_correct / (self.num_infer + 1e-6)
-        recall = self.num_correct / (self.num_label + 1e-6)
-        f1 = 2 * precision * recall / (precision + recall + 1e-6)
-        return precision, recall, f1
-
-    def reset(self):
-        self.num_infer = 0
-        self.num_label = 0
-        self.num_correct = 0
-
-    def name(self):
-        return "precision", "recall", "f1"
-
-    def extract_chunk(self, sequence, cid=0):
-        chunks = set()
-
-        start_idx, cur_idx = 0, 0
-        while cur_idx < len(sequence):
-            if sequence[cur_idx][0] == "B":
-                start_idx = cur_idx
-                cur_idx += 1
-                while cur_idx < len(sequence) and sequence[cur_idx][0] == "I":
-                    if sequence[cur_idx][2:] == sequence[start_idx][2:]:
-                        cur_idx += 1
-                    else:
-                        break
-                if cur_idx < len(sequence) and sequence[cur_idx][0] == "E":
-                    if sequence[cur_idx][2:] == sequence[start_idx][2:]:
-                        chunks.add((cid, sequence[cur_idx][2:], start_idx, cur_idx))
-                        cur_idx += 1
-            elif sequence[cur_idx][0] == "S":
-                chunks.add((cid, sequence[cur_idx][2:], cur_idx, cur_idx))
-                cur_idx += 1
-            else:
-                cur_idx += 1
-
-        return chunks
-
-
-class SPOChunkEvaluator(paddle.metric.Metric):
-    """
-    SPOChunkEvaluator computes the precision, recall and F1-score for multiple
-    chunk detections, including Named Entity Recognition (NER) and SPO Prediction.
-
-    Args:
-        num_classes (int):
-            The number of predicates.
-    """
-
-    def __init__(self, num_classes=None):
-        super(SPOChunkEvaluator, self).__init__()
-        self.num_classes = num_classes
-        self.num_infer_ent = 0
-        self.num_infer_spo = 1e-10
-        self.num_label_ent = 0
-        self.num_label_spo = 1e-10
-        self.num_correct_ent = 0
-        self.num_correct_spo = 0
-
-    def compute(self, lengths, ent_preds, spo_preds, ent_labels, spo_labels):
-        """
-        Computes the prediction, recall and F1-score for NER and SPO prediction.
-
-        Args:
-            lengths (Tensor):
-                The valid length of every sequence, a tensor with shape `[batch_size]`.
-            ent_preds (Tensor):
-                The predictions of entities.
-                A tensor with shape `[batch_size, sequence_length, 2]`.
-                `ent_preds[:, :, 0]` denotes the start indexes of entities.
-                `ent_preds[:, :, 1]` denotes the end indexes of entities.
-            spo_preds (Tensor):
-                The predictions of predicates between all possible entities.
-                A tensor with shape `[batch_size, num_classes, sequence_length, sequence_length]`.
-            ent_labels (list[list|tuple]):
-                The entity labels' indexes. A list of pair `[start_index, end_index]`.
-            spo_labels (list[list|tuple]):
-                The SPO labels' indexes. A list of triple `[[subject_start_index, subject_end_index],
-                predicate_id, [object_start_index, object_end_index]]`.
-
-        Returns:
-            tuple:
-                Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`).
-                The `ent` denotes results of NER and the `spo` denotes results of SPO prediction.
-
-            With the fields:
-
-            - `num_infer_chunks` (dict): The number of the inference chunks.
-            - `num_label_chunks` (dict): The number of the label chunks.
-            - `num_correct_chunks` (dict): The number of the correct chunks.
-        """
-        ent_preds = ent_preds.numpy()
-        spo_preds = spo_preds.numpy()
-
-        ent_pred_list = []
-        ent_idxs_list = []
-        for idx, ent_pred in enumerate(ent_preds):
-            seq_len = lengths[idx] - 2
-            start = np.where(ent_pred[:, 0] > 0.5)[0]
-            end = np.where(ent_pred[:, 1] > 0.5)[0]
-            ent_pred = []
-            ent_idxs = {}
-            for x in start:
-                y = end[end >= x]
-                if (x == 0) or (x > seq_len):
-                    continue
-                if len(y) > 0:
-                    y = y[0]
-                    if y > seq_len:
-                        continue
-                    ent_idxs[x] = (x - 1, y - 1)
-                    ent_pred.append((x - 1, y - 1))
-            ent_pred_list.append(ent_pred)
-            ent_idxs_list.append(ent_idxs)
-
-        spo_preds = spo_preds > 0
-        spo_pred_list = [[] for _ in range(len(spo_preds))]
-        idxs, preds, subs, objs = np.nonzero(spo_preds)
-        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
-            obj = ent_idxs_list[idx].get(o_id, None)
-            if obj is None:
-                continue
-            sub = ent_idxs_list[idx].get(s_id, None)
-            if sub is None:
-                continue
-            spo_pred_list[idx].append((sub, p_id, obj))
-
-        correct = {"ent": 0, "spo": 0}
-        infer = {"ent": 0, "spo": 0}
-        label = {"ent": 0, "spo": 0}
-        for ent_pred, ent_true in zip(ent_pred_list, ent_labels):
-            ent_true = [tuple(x) for x in ent_true]
-            infer["ent"] += len(set(ent_pred))
-            label["ent"] += len(set(ent_true))
-            correct["ent"] += len(set(ent_pred) & set(ent_true))
-
-        for spo_pred, spo_true in zip(spo_pred_list, spo_labels):
-            spo_true = [(tuple(s), p, tuple(o)) for s, p, o in spo_true]
-            infer["spo"] += len(set(spo_pred))
-            label["spo"] += len(set(spo_true))
-            correct["spo"] += len(set(spo_pred) & set(spo_true))
-
-        return infer, label, correct
-
-    def update(self, corrects):
-        assert len(corrects) == 3
-        for item in corrects:
-            assert isinstance(item, dict)
-            for value in item.values():
-                if not self._is_number_or_matrix(value):
-                    raise ValueError("The numbers must be a number(int) or a numpy ndarray.")
-        num_infer, num_label, num_correct = corrects
-        self.num_infer_ent += num_infer["ent"]
-        self.num_infer_spo += num_infer["spo"]
-        self.num_label_ent += num_label["ent"]
-        self.num_label_spo += num_label["spo"]
-        self.num_correct_ent += num_correct["ent"]
-        self.num_correct_spo += num_correct["spo"]
-
-    def accumulate(self):
-        spo_precision = self.num_correct_spo / self.num_infer_spo
-        spo_recall = self.num_correct_spo / self.num_label_spo
-        spo_f1 = 2 * self.num_correct_spo / (self.num_infer_spo + self.num_label_spo)
-        ent_precision = self.num_correct_ent / self.num_infer_ent if self.num_infer_ent > 0 else 0.0
-        ent_recall = self.num_correct_ent / self.num_label_ent if self.num_label_ent > 0 else 0.0
-        ent_f1 = (
-            2 * ent_precision * ent_recall / (ent_precision + ent_recall) if (ent_precision + ent_recall) != 0 else 0.0
-        )
-        return {"entity": (ent_precision, ent_recall, ent_f1), "spo": (spo_precision, spo_recall, spo_f1)}
-
-    def _is_number_or_matrix(self, var):
-        def _is_number_(var):
-            return (
-                isinstance(var, int)
-                or isinstance(var, np.int64)
-                or isinstance(var, float)
-                or (isinstance(var, np.ndarray) and var.shape == (1,))
-            )
-
-        return _is_number_(var) or isinstance(var, np.ndarray)
-
-    def reset(self):
-        self.num_infer_ent = 0
-        self.num_infer_spo = 1e-10
-        self.num_label_ent = 0
-        self.num_label_spo = 1e-10
-        self.num_correct_ent = 0
-        self.num_correct_spo = 0
-
-    def name(self):
-        return {"entity": ("precision", "recall", "f1"), "spo": ("precision", "recall", "f1")}
diff --git a/model_zoo/ernie-health/dataset.py b/model_zoo/ernie-health/dataset.py
deleted file mode 100644
index 252b2e952aee..000000000000
--- a/model_zoo/ernie-health/dataset.py
+++ /dev/null
@@ -1,204 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-import numpy as np
-import paddle
-
-
-class MedicalCorpus(paddle.io.Dataset):
-    def __init__(self, data_path, tokenizer):
-        self.data_path = data_path
-        self.tokenizer = tokenizer
-        # Add ids for suffixal chinese tokens in tokenized text, e.g. '##度' in '百度'.
-        # It should coincide with the vocab dictionary in preprocess.py.
-        orig_len = len(self.tokenizer)  # noqa:F841
-        suffix_vocab = {}
-        for idx, token in enumerate(range(0x4E00, 0x9FA6)):
-            suffix_vocab[len(self.tokenizer) + idx] = "##" + chr(token)
-        self.tokenizer.added_tokens_decoder.update(suffix_vocab)
-        self._samples, self._global_index = self._read_data_files(data_path)
-
-    def _get_data_files(self, data_path):
-        # Get all prefix of .npy/.npz files in the current and next-level directories.
-        files = [
-            os.path.join(data_path, f)
-            for f in os.listdir(data_path)
-            if (os.path.isfile(os.path.join(data_path, f)) and "_idx.npz" in str(f))
-        ]
-        files = [x.replace("_idx.npz", "") for x in files]
-        return files
-
-    def _read_data_files(self, data_path):
-        data_files = self._get_data_files(data_path)
-        samples = []
-        indexes = []
-        for file_id, file_name in enumerate(data_files):
-
-            for suffix in ["_ids.npy", "_idx.npz"]:
-                if not os.path.isfile(file_name + suffix):
-                    raise ValueError("File Not found, %s" % (file_name + suffix))
-
-            token_ids = np.load(file_name + "_ids.npy", mmap_mode="r", allow_pickle=True)
-            samples.append(token_ids)
-
-            split_ids = np.load(file_name + "_idx.npz")
-            end_ids = np.cumsum(split_ids["lens"], dtype=np.int64)
-            file_ids = np.full(end_ids.shape, file_id)
-            split_ids = np.stack([file_ids, end_ids], axis=-1)
-            indexes.extend(split_ids)
-        indexes = np.stack(indexes, axis=0)
-        return samples, indexes
-
-    def __len__(self):
-        return len(self._global_index)
-
-    def __getitem__(self, index):
-        file_id, end_id = self._global_index[index]
-        start_id = 0
-        if index > 0:
-            pre_file_id, pre_end_id = self._global_index[index - 1]
-            if pre_file_id == file_id:
-                start_id = pre_end_id
-        word_token_ids = self._samples[file_id][start_id:end_id]
-        token_ids = []
-        is_suffix = np.zeros(word_token_ids.shape)
-        for idx, token_id in enumerate(word_token_ids):
-            token = self.tokenizer.convert_ids_to_tokens(int(token_id))
-            if "##" in token:
-                token_id = self.tokenizer.convert_tokens_to_ids(token[-1])
-                is_suffix[idx] = 1
-            token_ids.append(token_id)
-
-        return token_ids, is_suffix.astype(np.int64)
-
-
-class DataCollatorForErnieHealth(object):
-    def __init__(self, tokenizer, mlm_prob, max_seq_length, return_dict=False):
-        self.tokenizer = tokenizer
-        self.mlm_prob = mlm_prob
-        self.max_seq_len = max_seq_length
-        self.return_dict = return_dict
-        self._ids = {
-            "cls": self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token),
-            "sep": self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token),
-            "pad": self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token),
-            "mask": self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token),
-        }
-
-    def __call__(self, data):
-        masked_input_ids_a, input_ids_a, labels_a = self.mask_tokens(data)
-        masked_input_ids_b, input_ids_b, labels_b = self.mask_tokens(data)
-        masked_input_ids = paddle.concat([masked_input_ids_a, masked_input_ids_b], axis=0).astype("int64")
-        input_ids = paddle.concat([input_ids_a, input_ids_b], axis=0)
-        labels = paddle.concat([labels_a, labels_b], axis=0)
-        if self.return_dict:
-            return {"input_ids": masked_input_ids, "raw_input_ids": input_ids, "generator_labels": labels}
-
-        else:
-            return masked_input_ids, input_ids, labels
-
-    def mask_tokens(self, batch_data):
-
-        token_ids = [x[0] for x in batch_data]
-        is_suffix = [x[1] for x in batch_data]
-
-        # Create probability matrix where the probability of real tokens is
-        # self.mlm_prob, while that of others is zero.
-        data = self.add_special_tokens_and_set_maskprob(token_ids, is_suffix)
-        token_ids, is_suffix, prob_matrix = data
-        token_ids = paddle.to_tensor(token_ids, dtype="int64", stop_gradient=True)
-        masked_token_ids = token_ids.clone()
-        labels = token_ids.clone()
-
-        # Create masks for words, where '百' must be masked if '度' is masked
-        # for the word '百度'.
-        prob_matrix = prob_matrix * (1 - is_suffix)
-        word_mask_index = np.random.binomial(1, prob_matrix).astype("float")
-        is_suffix_mask = is_suffix == 1
-        word_mask_index_tmp = word_mask_index
-        while word_mask_index_tmp.sum() > 0:
-            word_mask_index_tmp = np.concatenate(
-                [np.zeros((word_mask_index.shape[0], 1)), word_mask_index_tmp[:, :-1]], axis=1
-            )
-            word_mask_index_tmp = word_mask_index_tmp * is_suffix_mask
-            word_mask_index += word_mask_index_tmp
-        word_mask_index = word_mask_index.astype("bool")
-        labels[~word_mask_index] = -100
-
-        # 80% replaced with [MASK].
-        token_mask_index = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & word_mask_index
-        masked_token_ids[token_mask_index] = self._ids["mask"]
-
-        # 10% replaced with random token ids.
-        token_random_index = paddle.to_tensor(
-            paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy()
-            & word_mask_index
-            & ~token_mask_index
-        )
-        random_tokens = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64")
-        masked_token_ids = paddle.where(token_random_index, random_tokens, masked_token_ids)
-
-        return masked_token_ids, token_ids, labels
-
-    def add_special_tokens_and_set_maskprob(self, token_ids, is_suffix):
-        batch_size = len(token_ids)
-        batch_token_ids = np.full((batch_size, self.max_seq_len), self._ids["pad"])
-        batch_token_ids[:, 0] = self._ids["cls"]
-        batch_is_suffix = np.full_like(batch_token_ids, -1)
-        prob_matrix = np.zeros_like(batch_token_ids, dtype="float32")
-
-        for idx in range(batch_size):
-            if len(token_ids[idx]) > self.max_seq_len - 2:
-                token_ids[idx] = token_ids[idx][: self.max_seq_len - 2]
-                is_suffix[idx] = is_suffix[idx][: self.max_seq_len - 2]
-            seq_len = len(token_ids[idx])
-            batch_token_ids[idx, seq_len + 1] = self._ids["sep"]
-            batch_token_ids[idx, 1 : seq_len + 1] = token_ids[idx]
-            batch_is_suffix[idx, 1 : seq_len + 1] = is_suffix[idx]
-            prob_matrix[idx, 1 : seq_len + 1] = self.mlm_prob
-
-        return batch_token_ids, batch_is_suffix, prob_matrix
-
-
-def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None):
-    """
-    Creats dataloader.
-    Args:
-        dataset(obj:`paddle.io.Dataset`):
-            Dataset instance.
-        mode(obj:`str`, optional, defaults to obj:`train`):
-            If mode is 'train', it will shuffle the dataset randomly.
-        batch_size(obj:`int`, optional, defaults to 1):
-            The sample number of a mini-batch.
-        use_gpu(obj:`bool`, optional, defaults to obj:`True`):
-            Whether to use gpu to run.
-    Returns:
-        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
-    """
-
-    if mode == "train" and use_gpu:
-        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
-        dataloader = paddle.io.DataLoader(
-            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
-        )
-    else:
-        shuffle = True if mode == "train" else False
-        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
-        dataloader = paddle.io.DataLoader(
-            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
-        )
-
-    return dataloader
diff --git a/model_zoo/ernie-health/preprocess.py b/model_zoo/ernie-health/preprocess.py
deleted file mode 100644
index c8c6ec2c5629..000000000000
--- a/model_zoo/ernie-health/preprocess.py
+++ /dev/null
@@ -1,201 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import io
-import multiprocessing
-import os
-import re
-import sys
-import time
-
-import numpy as np
-from tqdm import tqdm
-
-from paddlenlp.transformers import ElectraTokenizer
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Preprocessor for ERNIE-Health")
-    parser.add_argument(
-        "--input_path", type=str, required=True, help="The path to input text files where a sentence per line."
-    )
-    parser.add_argument("--output_file", type=str, required=True, help="The output file path of preprocessed ids.")
-    parser.add_argument(
-        "--tokenize_tool",
-        type=str,
-        default="lac",
-        choices=["lac", "seg", "jieba"],
-        help="The tokenization tool for chinese words.",
-    )
-    parser.add_argument("--logging_steps", type=int, default=100, help="The interval between progress updates.")
-    parser.add_argument("--num_worker", type=int, default=1, help="Number of worker processes to launch.")
-
-    args = parser.parse_args()
-    return args
-
-
-def lac_segmentation():
-    from LAC import LAC
-
-    tool = LAC(mode="lac")
-
-    def process(text):
-        words, _ = tool.run(text)
-        return words
-
-    return process
-
-
-def seg_segmentation():
-    from LAC import LAC
-
-    tool = LAC(mode="seg")
-
-    def process(text):
-        words = tool.run(text)
-        return words
-
-    return process
-
-
-def jieba_segmentation():
-    import jieba
-
-    def process(text):
-        words = jieba.cut(text)
-        return list(words)
-
-    return process
-
-
-SEGMENTATION_FN = {"lac": lac_segmentation(), "seg": seg_segmentation(), "jieba": jieba_segmentation()}
-
-
-class ProcessFn(object):
-    def __init__(self, args):
-        self.args = args
-
-    def initializer(self):
-        ProcessFn.tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
-        ProcessFn.segmenter = SEGMENTATION_FN[self.args.tokenize_tool]
-        # Update vocabulary with '##'-prefixed chinese characters.
-        # The ids should coincide with those in run_pretrain.py.
-        suffix_vocab = {}
-        for idx, token in enumerate(range(0x4E00, 0x9FA6)):
-            suffix_vocab["##" + chr(token)] = len(ProcessFn.tokenizer) + idx
-        ProcessFn.tokenizer.added_tokens_encoder.update(suffix_vocab)
-
-        def mark_word_in_tokens(tokens, words, max_word_length=4):
-            word_set = set(words)
-            index = 0
-            while index < len(tokens):
-                # Skip non-chinese characters.
-                if len(re.findall("[\u4E00-\u9FA5]", tokens[index])) == 0:
-                    index += 1
-                    continue
-                # Find the word with maximum length and mark it.
-                find_word = False
-                for length in range(max_word_length, 0, -1):
-                    if index + length > len(tokens):
-                        continue
-                    if "".join(tokens[index : index + length]) in word_set:
-                        for i in range(1, length):
-                            tokens[index + i] = "##" + tokens[index + i]
-                        index += length
-                        find_word = True
-                        break
-
-                if not find_word:
-                    index += 1
-            return tokens
-
-        def process(text):
-            words = ProcessFn.segmenter(text.strip())
-            tokens = ProcessFn.tokenizer.tokenize("".join(words))
-            tokens = mark_word_in_tokens(tokens, words)
-            tokens = ProcessFn.tokenizer.convert_tokens_to_ids(tokens)
-            return tokens
-
-        ProcessFn.process = process
-
-    def encode(self, text):
-        token_ids = ProcessFn.process(text)
-        return token_ids, len(text.encode("utf-8"))
-
-
-def main():
-    args = parse_args()
-
-    file_paths = []
-    if os.path.isfile(args.input_path):
-        file_paths.append(args.input_path)
-    else:
-        for root, dirs, files in os.walk(args.input_path):
-            for file_name in files:
-                file_paths.append(os.path.join(root, file_name))
-    file_paths.sort()
-
-    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
-    save_dtype = np.uint16 if tokenizer.vocab_size < 2**16 - 1 else np.int32
-    processer = ProcessFn(args)
-
-    pool = multiprocessing.Pool(args.num_worker, initializer=processer.initializer)
-
-    token_id_stream = io.BytesIO()
-    sent_len_stream = io.BytesIO()
-
-    step = 0
-    sent_count = 0
-    total_bytes_processed = 0
-    start_tic = time.time()
-
-    for path in tqdm(file_paths):
-        text_fp = open(path, "r")
-        processed_text = pool.imap(processer.encode, text_fp, 256)
-        print("Processing %s" % path)
-        for i, (tokens, bytes_processed) in enumerate(processed_text, start=1):
-            step += 1
-            total_bytes_processed += bytes_processed
-
-            sentence_len = len(tokens)
-            if sentence_len == 0:
-                continue
-            sent_len_stream.write(sentence_len.to_bytes(4, byteorder="little", signed=True))
-            sent_count += 1
-            token_id_stream.write(np.array(tokens, dtype=save_dtype).tobytes(order="C"))
-
-            if step % args.logging_steps == 0:
-                time_cost = time.time() - start_tic
-                mbs = total_bytes_processed / time_cost / 1024 / 1024
-                print(
-                    f"Processed {step} sentences",
-                    f"({step/time_cost:.2f} sentences/s, {mbs:.4f} MB/s).",
-                    file=sys.stderr,
-                )
-
-    pool.close()
-    print("Saving tokens to files...")
-    all_token_ids = np.frombuffer(token_id_stream.getbuffer(), dtype=save_dtype)
-    all_sent_lens = np.frombuffer(sent_len_stream.getbuffer(), dtype=np.int32)
-    np.save(args.output_file + "_ids.npy", all_token_ids)
-    np.savez(args.output_file + "_idx.npz", lens=all_sent_lens)
-
-    print("Total sentences num: %d" % len(all_sent_lens))
-    print("Total tokens num: %d" % len(all_token_ids))
-    print("Average tokens per sentence: %.2f" % (len(all_token_ids) / len(all_sent_lens)))
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/run_pretrain.py b/model_zoo/ernie-health/run_pretrain.py
deleted file mode 100644
index 2774a865ed9c..000000000000
--- a/model_zoo/ernie-health/run_pretrain.py
+++ /dev/null
@@ -1,396 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-import os
-import random
-import time
-from collections import defaultdict
-
-import numpy as np
-import paddle
-from dataset import DataCollatorForErnieHealth, MedicalCorpus, create_dataloader
-from visualdl import LogWriter
-
-from paddlenlp.transformers import (
-    ElectraConfig,
-    ElectraTokenizer,
-    ErnieHealthForTotalPretraining,
-    LinearDecayWithWarmup,
-)
-from paddlenlp.utils.log import logger
-
-MODEL_CLASSES = {
-    "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer),
-}
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--model_name_or_path",
-        default="ernie-health-chinese",
-        type=str,
-        help="Path to pre-trained model or shortcut name selected in the list: "
-        + ", ".join(
-            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
-        ),
-    )
-    parser.add_argument(
-        "--input_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The input directory where the data will be read from.",
-    )
-    parser.add_argument(
-        "--output_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The output directory where the model predictions and checkpoints will be written.",
-    )
-    parser.add_argument("--max_seq_length", default=512, type=int, help="The max length of each sequence")
-    parser.add_argument(
-        "--mlm_prob", default=0.15, type=float, help="The probability of tokens to be sampled as masks."
-    )
-    parser.add_argument(
-        "--batch_size",
-        default=256,
-        type=int,
-        help="Batch size per GPU/CPU for training.",
-    )
-    parser.add_argument("--learning_rate", default=2e-4, type=float, help="The initial learning rate for Adam.")
-    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
-    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
-    parser.add_argument(
-        "--num_epochs",
-        default=100,
-        type=int,
-        help="Total number of training epochs to perform.",
-    )
-    parser.add_argument(
-        "--max_steps",
-        default=-1,
-        type=int,
-        help="If > 0: set total number of training steps to perform. Override num_epochs.",
-    )
-    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
-    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
-    parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.")
-    parser.add_argument(
-        "--init_from_ckpt",
-        action="store_true",
-        help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start",
-    )
-    parser.add_argument(
-        "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train."
-    )
-    parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.")
-    parser.add_argument(
-        "--device",
-        default="gpu",
-        type=str,
-        choices=["cpu", "gpu"],
-        help="The device to select to train the model, is must be cpu/gpu.",
-    )
-    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
-    args = parser.parse_args()
-    return args
-
-
-def set_seed(seed):
-    # Use the same data seed(for data shuffle) for all procs to guarantee data
-    # consistency after sharding.
-    random.seed(seed)
-    np.random.seed(seed)
-    # Maybe different op seeds(for dropout) for different procs is better. By:
-    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
-    paddle.seed(seed)
-
-
-class WorkerInitObj(object):
-    def __init__(self, seed):
-        self.seed = seed
-
-    def __call__(self, id):
-        np.random.seed(seed=self.seed + id)
-        random.seed(self.seed + id)
-
-
-def do_train(args):
-    paddle.enable_static() if not args.eager_run else None
-    paddle.set_device(args.device)
-    if paddle.distributed.get_world_size() > 1:
-        paddle.distributed.init_parallel_env()
-
-    set_seed(args.seed)
-
-    config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"]
-
-    # Loads or initialize a model.
-    pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys())
-
-    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
-        # Load checkpoint
-        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-        with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f:
-            config_dict = json.load(f)
-            model_name = config_dict["model_name"]
-        if model_name in pretrained_models:
-            model = model_class.from_pretrained(args.model_name_or_path)
-            model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")))
-        else:
-            raise ValueError(
-                "initialize a model from ckpt need model_name "
-                "in model_config_file. The supported model_name "
-                "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys())
-            )
-    else:
-        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
-        model_config = config_class()
-        model = model_class(model_config)
-        args.init_from_ckpt = False
-
-    if paddle.distributed.get_world_size() > 1:
-        model = paddle.DataParallel(model)
-
-    # Loads dataset.
-    tic_load_data = time.time()
-    logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
-
-    train_dataset = MedicalCorpus(data_path=args.input_dir, tokenizer=tokenizer)
-    logger.info("load data done, total : %s s" % (time.time() - tic_load_data))
-
-    # Reads data and generates mini-batches.
-    data_collator = DataCollatorForErnieHealth(
-        tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm_prob=args.mlm_prob
-    )
-
-    train_data_loader = create_dataloader(
-        train_dataset,
-        batch_size=args.batch_size,
-        mode="train",
-        use_gpu=True if args.device in "gpu" else False,
-        data_collator=data_collator,
-    )
-
-    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_epochs)
-    args.num_epochs = (num_training_steps - 1) // len(train_data_loader) + 1
-
-    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
-
-    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
-
-    # Generate parameter names needed to perform weight decay.
-    # All bias and LayerNorm parameters are excluded.
-    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
-    optimizer = paddle.optimizer.AdamW(
-        learning_rate=lr_scheduler,
-        epsilon=args.adam_epsilon,
-        parameters=model.parameters(),
-        weight_decay=args.weight_decay,
-        grad_clip=clip,
-        apply_decay_param_fun=lambda x: x in decay_params,
-    )
-    if args.use_amp:
-        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
-
-    logger.info("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
-    trained_global_step = global_step = 0
-    t_loss = defaultdict(lambda: paddle.to_tensor([0.0]))
-    log_loss = defaultdict(lambda: paddle.to_tensor([0.0]))
-    loss_list = defaultdict(list)
-    log_list = []
-    tic_train = time.time()
-
-    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
-        optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt")))
-        trained_global_step = global_step = config_dict["global_step"]
-        if trained_global_step < num_training_steps:
-            logger.info(
-                "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s"
-                % (trained_global_step, trained_global_step + 1)
-            )
-        else:
-            logger.info(
-                "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !"
-                % (trained_global_step, num_training_steps)
-            )
-            exit(0)
-
-    if paddle.distributed.get_rank() == 0:
-        writer = LogWriter(os.path.join(args.output_dir, "loss_log"))
-
-    for epoch in range(args.num_epochs):
-        for step, batch in enumerate(train_data_loader):
-            if trained_global_step > 0:
-                trained_global_step -= 1
-                continue
-            global_step += 1
-            masked_input_ids, input_ids, gen_labels = batch
-
-            if args.use_amp:
-                with paddle.amp.auto_cast():
-                    loss, gen_loss, rtd_loss, mts_loss, csp_loss = model(
-                        input_ids=masked_input_ids,
-                        raw_input_ids=input_ids,
-                        generator_labels=gen_labels,
-                    )
-
-                scaled = scaler.scale(loss)
-                scaled.backward()
-                t_loss["loss"] += loss.detach()
-                t_loss["gen"] += gen_loss.detach()
-                t_loss["rtd"] += rtd_loss.detach()
-                t_loss["mts"] += mts_loss.detach()
-                t_loss["csp"] += csp_loss.detach()
-                scaler.minimize(optimizer, scaled)
-            else:
-                loss, gen_loss, rtd_loss, mts_loss, csp_loss = model(
-                    input_ids=masked_input_ids,
-                    raw_input_ids=input_ids,
-                    generator_labels=gen_labels,
-                )
-                loss.backward()
-                t_loss["loss"] += loss.detach()
-                t_loss["gen"] += gen_loss.detach()
-                t_loss["rtd"] += rtd_loss.detach()
-                t_loss["mts"] += mts_loss.detach()
-                t_loss["csp"] += csp_loss.detach()
-                optimizer.step()
-
-            lr_scheduler.step()
-            optimizer.clear_grad()
-            if global_step % args.logging_steps == 0:
-                local_loss = dict(
-                    [(k, (t_loss[k] - log_loss[k]) / args.logging_steps) for k in ["loss", "gen", "rtd", "mts", "csp"]]
-                )
-                if paddle.distributed.get_world_size() > 1:
-                    for k in ["loss", "gen", "rtd", "mts", "csp"]:
-                        paddle.distributed.all_gather(loss_list[k], local_loss[k])
-                    if paddle.distributed.get_rank() == 0:
-                        tmp_loss = dict(
-                            [
-                                (k, float((paddle.stack(loss_list[k]).sum() / len(loss_list[k])).numpy()))
-                                for k in ["loss", "gen", "rtd", "mts", "csp"]
-                            ]
-                        )
-                        log_str = (
-                            "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
-                            "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, "
-                            "seq_contrastive: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it"
-                        ).format(
-                            global_step,
-                            num_training_steps,
-                            epoch,
-                            step,
-                            tmp_loss["loss"],
-                            tmp_loss["gen"],
-                            tmp_loss["rtd"],
-                            tmp_loss["mts"],
-                            tmp_loss["csp"],
-                            optimizer.get_lr(),
-                            (time.time() - tic_train) / args.logging_steps,
-                        )
-                        logger.info(log_str)
-                        log_list.append(log_str)
-                        writer.add_scalar("generator_loss", tmp_loss["gen"], global_step)
-                        writer.add_scalar("rtd_loss", tmp_loss["rtd"] * 50, global_step)
-                        writer.add_scalar("mts_loss", tmp_loss["mts"] * 20, global_step)
-                        writer.add_scalar("csp_loss", tmp_loss["csp"], global_step)
-                        writer.add_scalar("total_loss", tmp_loss["loss"], global_step)
-                        writer.add_scalar("lr", optimizer.get_lr(), global_step)
-                    loss_list = defaultdict(list)
-                else:
-                    local_loss = dict([(k, float(v)) for k, v in local_loss.items()])
-                    log_str = (
-                        "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
-                        "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, "
-                        "seq_contrastive_loss: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it"
-                    ).format(
-                        global_step,
-                        num_training_steps,
-                        epoch,
-                        step,
-                        local_loss["loss"],
-                        local_loss["gen"],
-                        local_loss["rtd"],
-                        local_loss["mts"],
-                        local_loss["csp"],
-                        optimizer.get_lr(),
-                        (time.time() - tic_train) / args.logging_steps,
-                    )
-                    logger.info(log_str)
-                    log_list.append(log_str)
-                    loss_dict = {
-                        "generator_loss": local_loss["gen"],
-                        "rtd_loss": local_loss["rtd"] * 50,
-                        "mts_loss": local_loss["mts"] * 20,
-                        "csp_loss": local_loss["csp"],
-                    }
-                    for k, v in loss_dict.items():
-                        writer.add_scalar("loss/%s" % k, v, global_step)
-                    writer.add_scalar("total_loss", local_loss["loss"], global_step)
-                    writer.add_scalar("lr", optimizer.get_lr(), global_step)
-                log_loss = dict(t_loss)
-                tic_train = time.time()
-
-            if global_step % args.save_steps == 0:
-                if paddle.distributed.get_rank() == 0:
-                    output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step)
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
-                    config_to_save = model_to_save.discriminator.electra.config.to_dict()
-                    if "self" in config_to_save:
-                        del config_to_save["self"]
-                    run_states = {
-                        "model_name": model_name if args.init_from_ckpt else args.model_name_or_path,
-                        "global_step": global_step,
-                        "epoch": epoch,
-                        "step": step,
-                    }
-                    with open(os.path.join(output_dir, "model_config.json"), "w") as f:
-                        json.dump(config_to_save, f)
-                    with open(os.path.join(output_dir, "run_states.json"), "w") as f:
-                        json.dump(run_states, f)
-                    paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams"))
-                    tokenizer.save_pretrained(output_dir)
-                    paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
-                    if len(log_list) > 0:
-                        with open(os.path.join(output_dir, "train.log"), "w") as f:
-                            for log in log_list:
-                                if len(log.strip()) > 0:
-                                    f.write(log.strip() + "\n")
-            if global_step >= num_training_steps:
-                if paddle.distributed.get_rank() == 0:
-                    writer.close()
-                return
-
-
-def print_arguments(args):
-    """print arguments"""
-    print("-----------  Configuration Arguments -----------")
-    for arg, value in sorted(vars(args).items()):
-        print("%s: %s" % (arg, value))
-    print("------------------------------------------------")
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    print_arguments(args)
-    do_train(args)
diff --git a/model_zoo/ernie-health/run_pretrain_trainer.py b/model_zoo/ernie-health/run_pretrain_trainer.py
deleted file mode 100644
index 98505d68958f..000000000000
--- a/model_zoo/ernie-health/run_pretrain_trainer.py
+++ /dev/null
@@ -1,166 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import time
-from dataclasses import dataclass, field
-from typing import Optional
-
-import paddle
-from dataset import DataCollatorForErnieHealth, MedicalCorpus
-
-from paddlenlp.trainer import (
-    PdArgumentParser,
-    Trainer,
-    TrainingArguments,
-    get_last_checkpoint,
-)
-from paddlenlp.transformers import (
-    ElectraConfig,
-    ElectraTokenizer,
-    ErnieHealthForTotalPretraining,
-)
-from paddlenlp.utils.log import logger
-
-MODEL_CLASSES = {
-    "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer),
-}
-
-
-@dataclass
-class DataArguments:
-    """
-    Arguments pertaining to what data we are going to input our model for training and evaluating.
-    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
-    specify them on the command line.
-    """
-
-    input_dir: str = field(
-        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
-    )
-    max_seq_length: int = field(
-        default=512,
-        metadata={
-            "help": "The maximum total input sequence length after tokenization. Sequences longer "
-            "than this will be truncated, sequences shorter will be padded."
-        },
-    )
-    masked_lm_prob: float = field(
-        default=0.15,
-        metadata={"help": "Mask token prob."},
-    )
-
-
-@dataclass
-class ModelArguments:
-    """
-    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
-    """
-
-    model_type: Optional[str] = field(
-        default="ernie-health", metadata={"help": "Only support for ernie-health pre-training for now."}
-    )
-    model_name_or_path: str = field(
-        default="ernie-health-chinese",
-        metadata={
-            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
-        },
-    )
-
-
-def main():
-    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
-    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
-
-    training_args.eval_iters = 10
-    training_args.test_iters = training_args.eval_iters * 10
-    # training_args.recompute = True
-
-    # Log model and data config
-    training_args.print_config(model_args, "Model")
-    training_args.print_config(data_args, "Data")
-
-    paddle.set_device(training_args.device)
-
-    # Log on each process the small summary:
-    logger.warning(
-        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
-        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
-    )
-
-    # Detecting last checkpoint.
-    last_checkpoint = None
-    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
-        last_checkpoint = get_last_checkpoint(training_args.output_dir)
-        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
-            raise ValueError(
-                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
-                "Use --overwrite_output_dir to overcome."
-            )
-        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
-            logger.info(
-                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
-                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
-            )
-
-    config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"]
-
-    # Loads or initialize a model.
-    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
-
-    model_config = config_class()
-    model = model_class(model_config)
-
-    # Loads dataset.
-    tic_load_data = time.time()
-    logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
-
-    train_dataset = MedicalCorpus(data_path=data_args.input_dir, tokenizer=tokenizer)
-    logger.info("load data done, total : %s s" % (time.time() - tic_load_data))
-
-    # Reads data and generates mini-batches.
-    data_collator = DataCollatorForErnieHealth(
-        tokenizer=tokenizer,
-        max_seq_length=data_args.max_seq_length,
-        mlm_prob=data_args.masked_lm_prob,
-        return_dict=True,
-    )
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        data_collator=data_collator,
-        train_dataset=train_dataset if training_args.do_train else None,
-        eval_dataset=None,
-        tokenizer=tokenizer,
-    )
-
-    checkpoint = None
-    if training_args.resume_from_checkpoint is not None:
-        checkpoint = training_args.resume_from_checkpoint
-    elif last_checkpoint is not None:
-        checkpoint = last_checkpoint
-
-    # Training
-    if training_args.do_train:
-        train_result = trainer.train(resume_from_checkpoint=checkpoint)
-        metrics = train_result.metrics
-        trainer.save_model()
-        trainer.log_metrics("train", metrics)
-        trainer.save_metrics("train", metrics)
-        trainer.save_state()
-
-
-if __name__ == "__main__":
-    main()
diff --git a/model_zoo/ernie-health/run_trainer.sh b/model_zoo/ernie-health/run_trainer.sh
deleted file mode 100644
index 94b99e834713..000000000000
--- a/model_zoo/ernie-health/run_trainer.sh
+++ /dev/null
@@ -1,45 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-#set -x
-unset CUDA_VISIBLE_DEVICES
-
-task_name="eheath-pretraining"
-rm -rf output/$task_name/log
-
-python -u -m paddle.distributed.launch \
-    --gpus 0,1,2,3,4,5,6,7  \
-    run_pretrain_trainer.py \
-    --input_dir "./data" \
-    --output_dir "output/$task_name" \
-    --max_seq_length 512 \
-    --gradient_accumulation_steps 1\
-    --per_device_train_batch_size 8 \
-    --per_device_eval_batch_size 8 \
-    --learning_rate 0.001 \
-    --max_steps 1000000 \
-    --save_steps 50000 \
-    --weight_decay 0.01 \
-    --warmup_ratio 0.01 \
-    --max_grad_norm 1.0 \
-    --logging_steps 20 \
-    --dataloader_num_workers 4 \
-    --device "gpu"\
-    --fp16  \
-    --fp16_opt_level "O1"  \
-    --do_train \
-    --disable_tqdm True\
-    --save_total_limit 10 
-
-# WARNING: fp16_opt_level O2 may cause ehealth pretraining fail !
\ No newline at end of file
diff --git a/model_zoo/ernie-m/README.md b/model_zoo/ernie-m/README.md
deleted file mode 100644
index 634f88c21deb..000000000000
--- a/model_zoo/ernie-m/README.md
+++ /dev/null
@@ -1,237 +0,0 @@
-# ERNIE-M
-
-* [模型介绍](#模型介绍)
-* [开始运行](#开始运行)
-  * [环境要求](#环境要求)
-  * [数据准备](#数据准备)
-  * [模型训练](#模型训练)
-    * [参数释义](#参数释义)
-    * [单卡训练](#单卡训练)
-    * [单机多卡](#单机多卡)
-    * [预测评估](#预测评估)
-  * [部署](#部署)
-    * [FastDeploy 部署](#FastDeploy部署)
-      * [Python 部署](#Python部署)
-    * [服务化部署](#服务化部署)
-* [参考论文](#参考论文)
-
-## 模型介绍
-
-[ERNIE-M](https://arxiv.org/abs/2012.15674) 是百度提出的一种多语言语言模型。原文提出了一种新的训练方法，让模型能够将多种语言的表示与单语语料库对齐，以克服平行语料库大小对模型性能的限制。原文的主要想法是将回译机制整合到预训练的流程中，在单语语料库上生成伪平行句对，以便学习不同语言之间的语义对齐，从而增强跨语言模型的语义建模。实验结果表明，ERNIE-M 优于现有的跨语言模型，并在各种跨语言下游任务中提供了最新的 SOTA 结果。
-原文提出两种方法建模各种语言间的对齐关系:
-
-- **Cross-Attention Masked Language Modeling(CAMLM)**: 该算法在少量双语语料上捕捉语言间的对齐信息。其需要在不利用源句子上下文的情况下，通过目标句子还原被掩盖的词语，使模型初步建模了语言间的对齐关系。
-- **Back-Translation masked language modeling(BTMLM)**: 该方法基于回译机制从单语语料中学习语言间的对齐关系。通过CAMLM 生成伪平行语料，然后让模型学习生成的伪平行句子，使模型可以利用单语语料更好地建模语义对齐关系。
-
-
-![framework](https://user-images.githubusercontent.com/40912707/201308423-bf4f0100-3ada-4bae-89d5-b07ffec1e2c0.png)
-
-本项目是 ERNIE-M 的 PaddlePaddle 动态图实现，包含模型训练，模型验证等内容。以下是本例的简要目录结构及说明：
-
-```text
-.
-|-- README.md                        # 文档
-|-- deploy                           # 部署目录
-|   |-- predictor                    # onnx离线部署
-|   |   |-- README.md
-|   |   |-- ernie_m_predictor.py
-|   |   |-- inference.py
-|   |   |-- requirements_cpu.txt
-|   |   `-- requirements_gpu.txt
-|   `-- simple_serving               # 基于PaddleNLP SimpleServing 服务化部署
-|       |-- README.md
-|       |-- client_seq_cls.py
-|       `-- server_seq_cls.py
-`-- run_classifier.py                # 分类任务微调脚本
-```
-
-## 开始运行
-
-下面提供以XNLI数据集进行模型微调相关训练、预测、部署的代码，XNLI数据集是MNLI的子集，并且已被翻译成14种不同的语言（包含一些较低资源语言）。与MNLI一样，目标是预测文本蕴含（句子 A 是否暗示/矛盾/都不是句子 B ）。
-
-### 环境要求
-
-python >= 3.7
-paddlepaddle >= 2.3
-paddlenlp >= 2.4.9
-paddle2onnx >= 1.0.5
-
-### 数据准备
-
-此次微调数据使用XNLI数据集, 可以通过下面的方式来使用数据集
-
-```python
-from datasets import load_dataset
-
-# all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
-# load xnli dataset of english
-train_ds, eval_ds, test_ds = load_dataset("xnli", "en", split=["train_ds", "validation", "test"])
-```
-
-### 模型训练
-
-#### 参数释义
-
-- `task_type` 表示了自然语言推断任务的类型，目前支持的类型为："cross-lingual-transfer", "translate-train-all"
-  ，分别表示在英文数据集上训练并在所有15种语言数据集上测试、在所有15种语言数据集上训练和测试。
-- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："ernie-m-base"， "ernie-m-large"
-  。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./finetuned_models"。
-- `do_train` 是否进行训练任务。
-- `do_eval` 是否进行评估任务。
-- `do_predict` 是否进行评测任务。
-- `do_export` 是否导出模型。
-- `output_dir` 表示模型保存路径。
-- `export_model_dir` 模型的导出路径。
-- `per_device_train_batch_size` 表示训练时每次迭代**每张**卡上的样本数目。
-- `per_device_eval_batch_size` 表示验证时每次迭代**每张**卡上的样本数目。
-- `max_seq_length` 表示最大句子长度，超过该长度将被截断，不足该长度的将会进行 padding。
-- `learning_rate` 表示基础学习率大小，将于 learning rate scheduler 产生的值相乘作为当前学习率。
-- `classifier_dropout` 表示模型用于分类的 dropout rate ，默认是0.1。
-- `num_train_epochs` 表示训练轮数。
-- `logging_steps` 表示日志打印间隔步数。
-- `save_steps` 表示模型保存及评估间隔步数。
-- `layerwise_decay` 表示 AdamW with Layerwise decay 的逐层衰减系数。
-- `warmup_rate` 表示学习率warmup系数。
-- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
-- `seed` 表示随机数种子。
-- `device` 表示训练使用的设备, 'gpu'表示使用 GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用 CPU。
-- `fp16` 表示是否启用自动混合精度训练。
-- `scale_loss` 表示自动混合精度训练的参数。
-- `load_best_model_at_end` 训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用。
-- `metric_for_best_model` 最优模型指标，如`eval_accuarcy`等，用于比较模型好坏。
-
-#### 单卡训练
-
-`run_classifier.py`是模型微调脚本，可以使用如下命令对预训练模型进行微调训练。
-
-```shell
-python run_classifier.py \
-  --do_train \
-  --do_eval \
-  --do_export \
-  --task_type cross-lingual-transfer \
-  --model_name_or_path ernie-m-base \
-  --output_dir ./finetuned_models/ \
-  --export_model_dir ./finetuned_models/ \
-  --per_device_train_batch_size 16 \
-  --per_device_eval_batch_size 16 \
-  --max_seq_length 256 \
-  --learning_rate 5e-5 \
-  --classifier_dropout 0.1 \
-  --weight_decay 0.0 \
-  --layerwise_decay 0.8 \
-  --save_steps 12272 \
-  --eval_steps 767 \
-  --num_train_epochs 5 \
-  --warmup_ratio 0.1 \
-  --load_best_model_at_end True \
-  --metric_for_best_model eval_accuracy \
-  --overwrite_output_dir
-```
-
-#### 单机多卡
-
-同样，可以执行如下命令实现多卡训练
-
-```shell
-python -m paddle.distributed.launch --gpus 0,1 run_classifier.py \
-  --do_train \
-  --do_eval \
-  --do_export \
-  --task_type cross-lingual-transfer \
-  --model_name_or_path ernie-m-base \
-  --output_dir ./finetuned_models/ \
-  --export_model_dir ./finetuned_models/ \
-  --per_device_train_batch_size 16 \
-  --per_device_eval_batch_size 16 \
-  --max_seq_length 256 \
-  --learning_rate 5e-5 \
-  --classifier_dropout 0.1 \
-  --weight_decay 0.0 \
-  --layerwise_decay 0.8 \
-  --save_steps 12272 \
-  --eval_steps 767 \
-  --num_train_epochs 5 \
-  --warmup_ratio 0.1 \
-  --load_best_model_at_end True \
-  --metric_for_best_model eval_accuracy \
-  --overwrite_output_dir \
-  --remove_unused_columns False
-```
-
-这里设置额外的参数`--remove_unused_columns`为`False`是因为数据集中不需要的字段已经被手动去除了。
-
-#### 预测评估
-
-当训练完成后，可以直接加载训练保存的模型进行评估，此时`--model_name_or_path`传入训练时的`output_dir`即`./finetuned_models`。
-
-```shell
-python run_classifier.py \
-    --do_predict \
-    --task_type cross-lingual-transfer \
-    --model_name_or_path ./finetuned_models \
-    --output_dir ./finetuned_models
-```
-
-预测结果（label）和预测的置信度（confidence）将写入`./finetuned_models/test_results.json`文件。
-
-
-在XNLI数据集上微调 cross-lingual-transfer 类型的自然语言推断任务后，在测试集上有如下结果
-| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| Cross-lingual Transfer |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
-| XLM | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 | 75.1 |
-| Unicoder | 85.1 | 79.0 | 79.4 | 77.8 | 77.2 | 77.2 | 76.3 | 72.8 | 73.5 | 76.4 | 73.6 | 76.2 | 69.4 | 69.7 | 66.7 | 75.4 |
-| XLM-R | 85.8 | 79.7 | 80.7 | 78.7 | 77.5 | 79.6 | 78.1 | 74.2 | 73.8 | 76.5 | 74.6 | 76.7 | 72.4 | 66.5 | 68.3 | 76.2 |
-| INFOXLM | **86.4** | **80.6** | 80.8 | 78.9 | 77.8 | 78.9 | 77.6 | 75.6 | 74.0 | 77.0 | 73.7 | 76.7 | 72.0 | 66.4 | 67.1 | 76.2 |
-| **ERNIE-M** | 85.5 | 80.1 | **81.2** | **79.2** | **79.1** | **80.4** | **78.1** | **76.8** | **76.3** | **78.3** | **75.8** | **77.4** | **72.9** | **69.5** | **68.8** | **77.3** |
-| XLM-R Large | 89.1 | 84.1 | 85.1 | 83.9 | 82.9 | 84.0 | 81.2 | 79.6 | 79.8 | 80.8 | 78.1 | 80.2 | 76.9 | 73.9 | 73.8 | 80.9 |
-| INFOXLM Large | **89.7** | 84.5 | 85.5 | 84.1 | 83.4 | 84.2 | 81.3 | 80.9 | 80.4 | 80.8 | 78.9 | 80.9 | 77.9 | 74.8 | 73.7 | 81.4 |
-| VECO Large | 88.2 | 79.2 | 83.1 | 82.9 | 81.2 | 84.2 | 82.8 | 76.2 | 80.3 | 74.3 | 77.0 | 78.4 | 71.3 | **80.4** | **79.1** | 79.9 |
-| **ERNIR-M Large** | 89.3 | **85.1** | **85.7** | **84.4** | **83.7** | **84.5** | 82.0 | **81.2** | **81.2** | **81.9** | **79.2** | **81.0** | **78.6** | 76.2 | 75.4 | **82.0** |
-| Translate-Train-All |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
-| XLM | 85.0 | 80.8 | 81.3 | 80.3 | 79.1 | 80.9 | 78.3 | 75.6 | 77.6 | 78.5 | 76.0 | 79.5 | 72.9 | 72.8 | 68.5 | 77.8 |
-| Unicoder | 85.6 | 81.1 | 82.3 | 80.9 | 79.5 | 81.4 | 79.7 | 76.8 | 78.2 | 77.9 | 77.1 | 80.5 | 73.4 | 73.8 | 69.6 | 78.5 |
-| XLM-R | 85.4 | 81.4 | 82.2 | 80.3 | 80.4 | 81.3 | 79.7 | 78.6 | 77.3 | 79.7 | 77.9 | 80.2 | 76.1 | 73.1 | 73.0 | 79.1 |
-| INFOXLM | 86.1 | 82.0 | 82.8 | 81.8 | 80.9 | 82.0 | 80.2 | 79.0 | 78.8 | 80.5 | 78.3 | 80.5 | 77.4 | 73.0 | 71.6 | 79.7 |
-| **ERNIE-M** | **86.2** | **82.5** | **83.8** | **82.6** | **82.4** | **83.4** | **80.2** | **80.6** | **80.5** | **81.1** | **79.2** | **80.5** | **77.7** | **75.0** | **73.3** | **80.6** |
-| XLM-R Large | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | **83.7** | **81.6** | 78.0 | 78.1 | 83.6 |
-| VECO Large | 88.9 | 82.4 | 86.0 | 84.7 | 85.3 | 86.2 | **85.8** | 80.1 | 83.0 | 77.2 | 80.9 | 82.8 | 75.3 | **83.1** | **83.0** | 83.0 |
-| **ERNIE-M Large** | **89.5** | **86.5** | **86.9** | **86.1** | **86.0** | **86.8** | 84.1 | **83.8** | **84.1** | **84.5** | **82.1** | 83.5 | 81.1 | 79.4 | 77.9 | **84.2** |
-
-<a name="部署"></a>
-
-## 部署
-
-我们基于 FastDeploy 为 ERNIE-M 提供了多种部署方案，可以满足不同场景下的部署需求，请根据实际情况进行选择。
-
-<a name="FastDeploy部署"></a>
-
-### FastDeploy 部署
-
-⚡️[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)是一款全场景、易用灵活、极致高效的AI推理部署工具，为开发者提供多硬件、多推理引擎后端的部署能力。开发者只需调用一行代码即可随意切换硬件、推理引擎后端。
-
-<div align="center">
-
-<img src="https://user-images.githubusercontent.com/54695910/213087724-7175953a-0e07-4af8-a4a1-5304163da2e0.png" >
-
-</div>
-
-目前 ERNIE-M 模型已提供基于 FastDeploy 的部署示例，支持在多款硬件（CPU、GPU、昆仑芯、华为昇腾以及 Graphcore IPU）以及推理引擎后端进行部署。
-
-<a name="Python部署"></a>
-
-#### Python 部署
-
-Python 部署请参考：[Python 部署指南](./deploy/python/README.md)
-
-<a name="服务化部署"></a>
-
-### 服务化部署
-
-* [PaddleNLP SimpleServing 服务化部署指南](./deploy/simple_serving/README.md)
-
-
-## 参考论文
-
- [Ouyang X ,  Wang S ,  Pang C , et al. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora[J].  2020.](https://arxiv.org/abs/2012.15674)
diff --git a/model_zoo/ernie-m/deploy/python/README.md b/model_zoo/ernie-m/deploy/python/README.md
deleted file mode 100644
index 167c3e0bd470..000000000000
--- a/model_zoo/ernie-m/deploy/python/README.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# FastDeploy ERNIE-M 模型 Python 部署示例
-
-在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
-
-本目录下分别提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的文本分类任务的 Python 部署示例。
-
-
-## 快速开始
-
-以下示例展示如何基于 FastDeploy 库完成 ERNIE-M 模型在 XNLI 数据集上进行自然语言推断任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE-M 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-m/finetuned_models/export`（用户可按实际情况设置）。
-
-
-```bash
-
-# CPU 推理
-python seq_cls_infer.py --model_dir ../../finetuned_models/export/ --device cpu --backend paddle
-
-# GPU 推理
-python seq_cls_infer.py --model_dir ../../finetuned_models/export/model --device gpu --backend paddle
-
-```
-
-运行完成后返回的结果如下：
-
-```bash
-
-[INFO] fastdeploy/runtime/runtime.cc(309)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
-Batch id:0, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"我从来没有被告知任何与任何人会面。", label:contradiction, similarity:0.9975
-Batch id:1, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"我被告知将有一个人被叫进来与我见面。", label:entailment, similarity:0.9866
-Batch id:2, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"那个人来得有点晚。", label:neutral, similarity:0.9921
-
-```
-
-## 参数说明
-
-| 参数 |参数说明 |
-|----------|--------------|
-|--model_dir | 指定部署模型的目录， |
-|--batch_size |输入的batch size，默认为 1|
-|--max_length |最大序列长度，默认为 128|
-|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
-|--device_id | 运行设备的id。默认为0。 |
-|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
-|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
-|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
-
-## FastDeploy 高阶用法
-
-FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE-M 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE-M 模型。
-
-符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
-
-<table>
-    <tr>
-        <td align=center> 硬件</td>
-        <td align=center> 硬件对应的接口</td>
-        <td align=center> 可用的推理引擎  </td>
-        <td align=center> 推理引擎对应的接口 </td>
-        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
-        <td align=center> 是否支持 FP16 模式 </td>
-    </tr>
-    <tr>
-        <td rowspan=3 align=center> CPU </td>
-        <td rowspan=3 align=center> use_cpu() </td>
-        <td align=center> Paddle Inference </td>
-        <td align=center> use_paddle_infer_backend() </td>
-        <td align=center>  ✅ </td>
-        <td align=center>  N/A </td>
-    </tr>
-    <tr>
-      <td align=center> ONNX Runtime </td>
-      <td align=center> use_ort_backend() </td>
-      <td align=center>  ✅ </td>
-      <td align=center>  N/A </td>
-    </tr>
-    <tr>
-      <td align=center> OpenVINO </td>
-      <td align=center> use_openvino_backend() </td>
-      <td align=center> ❔ </td>
-      <td align=center>  N/A </td>
-    </tr>
-    <tr>
-        <td rowspan=4 align=center> GPU </td>
-        <td rowspan=4 align=center> use_gpu() </td>
-        <td align=center> Paddle Inference </td>
-        <td align=center> use_paddle_infer_backend() </td>
-        <td align=center>  ✅ </td>
-        <td align=center>  N/A </td>
-    </tr>
-    <tr>
-      <td align=center> ONNX Runtime </td>
-      <td align=center> use_ort_backend() </td>
-      <td align=center>  ✅ </td>
-      <td align=center>  ❔ </td>
-    </tr>
-    <tr>
-      <td align=center> Paddle TensorRT </td>
-      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
-      <td align=center> ✅ </td>
-      <td align=center> ✅ </td>
-    </tr>
-    <tr>
-      <td align=center> TensorRT </td>
-      <td align=center> use_trt_backend() </td>
-      <td align=center> ✅ </td>
-      <td align=center> ✅ </td>
-    </tr>
-    <tr>
-        <td align=center> 昆仑芯 XPU </td>
-        <td align=center> use_kunlunxin() </td>
-        <td align=center> Paddle Lite </td>
-        <td align=center> use_paddle_lite_backend() </td>
-        <td align=center>  N/A </td>
-        <td align=center>  ✅  </td>
-    </tr>
-    <tr>
-        <td align=center> 华为 昇腾 </td>
-        <td align=center> use_ascend() </td>
-        <td align=center> Paddle Lite </td>
-        <td align=center> use_paddle_lite_backend() </td>
-        <td align=center> ❔ </td>
-        <td align=center> ✅ </td>
-    </tr>
-    <tr>
-        <td align=center> Graphcore IPU </td>
-        <td align=center> use_ipu() </td>
-        <td align=center> Paddle Inference </td>
-        <td align=center> use_paddle_infer_backend() </td>
-        <td align=center> ❔ </td>
-        <td align=center> N/A </td>
-    </tr>
-</table>
diff --git a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
deleted file mode 100644
index 9b4662e6798a..000000000000
--- a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
+++ /dev/null
@@ -1,144 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import distutils.util
-import os
-
-import fastdeploy as fd
-import numpy as np
-
-from paddlenlp.transformers import AutoTokenizer
-
-
-def parse_arguments():
-    import argparse
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_dir", required=True, help="The directory of model.")
-    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
-    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
-    parser.add_argument(
-        "--device",
-        type=str,
-        default="cpu",
-        choices=["gpu", "cpu"],
-        help="Type of inference device, support 'cpu' or 'gpu'.",
-    )
-    parser.add_argument(
-        "--backend",
-        type=str,
-        default="paddle",
-        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
-        help="The inference runtime backend.",
-    )
-    parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
-    parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
-    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
-    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
-    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
-    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
-    return parser.parse_args()
-
-
-def batchfy_text(texts, batch_size):
-    batch_texts = []
-    batch_start = 0
-    while batch_start < len(texts):
-        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
-        batch_start += batch_size
-    return batch_texts
-
-
-class Predictor(object):
-    def __init__(self, args):
-        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir)
-        self.runtime = self.create_fd_runtime(args)
-        self.batch_size = args.batch_size
-        self.max_length = args.max_length
-
-    def create_fd_runtime(self, args):
-        option = fd.RuntimeOption()
-        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
-        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
-        option.set_model_path(model_path, params_path)
-        if args.device == "cpu":
-            option.use_cpu()
-            option.set_cpu_thread_num(args.cpu_threads)
-        else:
-            option.use_gpu(args.device_id)
-        if args.backend == "paddle":
-            option.use_paddle_infer_backend()
-        elif args.backend == "onnx_runtime":
-            option.use_ort_backend()
-        elif args.backend == "openvino":
-            option.use_openvino_backend()
-        else:
-            option.use_trt_backend()
-            if args.backend == "paddle_tensorrt":
-                option.use_paddle_infer_backend()
-                option.paddle_infer_option.collect_trt_shape = True
-                option.paddle_infer_option.enable_trt = True
-            trt_file = os.path.join(args.model_dir, "model.trt")
-            option.trt_option.set_shape(
-                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
-            )
-            if args.use_fp16:
-                option.trt_option.enable_fp16 = True
-                trt_file = trt_file + ".fp16"
-            option.trt_option.serialize_file = trt_file
-        return fd.Runtime(option)
-
-    def preprocess(self, text, text_pair):
-        data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True)
-        input_ids_name = self.runtime.get_input_info(0).name
-        input_map = {
-            input_ids_name: np.array(data["input_ids"], dtype="int64"),
-        }
-        return input_map
-
-    def infer(self, input_map):
-        results = self.runtime.infer(input_map)
-        return results
-
-    def postprocess(self, infer_data):
-        logits = np.array(infer_data[0])
-        max_value = np.max(logits, axis=1, keepdims=True)
-        exp_data = np.exp(logits - max_value)
-        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
-        out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)}
-        return out_dict
-
-    def predict(self, texts, texts_pair=None):
-        input_map = self.preprocess(texts, texts_pair)
-        infer_result = self.infer(input_map)
-        output = self.postprocess(infer_result)
-        return output
-
-
-if __name__ == "__main__":
-    args = parse_arguments()
-    predictor = Predictor(args)
-    text = ["他们告诉我，呃，我最后会被叫到一个人那里去见面。"] * 3
-    text_pair = ["我从来没有被告知任何与任何人会面。", "我被告知将有一个人被叫进来与我见面。", "那个人来得有点晚。"]
-    batch_texts = batchfy_text(text, args.batch_size)
-    batch_texts_pair = batchfy_text(text_pair, args.batch_size)
-    label_list = ["entailment", "neutral", "contradiction"]
-
-    for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)):
-        outputs = predictor.predict(texts, texts_pair)
-        for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)):
-            print(
-                f'Batch id:{bs}, example id:{i}, sentence1:"{sentence1}", sentence2:"{sentence2}", '
-                f"label:{label_list[outputs['label'][i]]}, confidence:{outputs['confidence'][i]:.4f}"
-            )
diff --git a/model_zoo/ernie-m/deploy/simple_serving/README.md b/model_zoo/ernie-m/deploy/simple_serving/README.md
deleted file mode 100644
index 30da3c4f796a..000000000000
--- a/model_zoo/ernie-m/deploy/simple_serving/README.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# 基于PaddleNLP SimpleServing 的服务化部署
-
-## 目录
-- [环境准备](#环境准备)
-- [Server启动服务](#Server服务启动)
-- [其他参数设置](#其他参数设置)
-
-## 环境准备
-
-paddlenlp >= 2.5.0
-
-## Server服务启动
-### 文本分类任务启动
-#### 启动文本分类 Server 服务
-```bash
-paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189
-```
-
-#### 分类任务发送服务
-```bash
-python client_seq_cls.py --language zh
-```
-
-## 其他参数设置
-可以在client端设置 `max_seq_len`, `batch_size` 参数
-```python
-    data = {
-        'data': {
-            'text': texts,
-            'text_pair': text_pairs
-        },
-        'parameters': {
-            'max_seq_len': args.max_seq_len,
-            'batch_size': args.batch_size
-        }
-    }
-```
diff --git a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
deleted file mode 100644
index 5fc1de30fa04..000000000000
--- a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
+++ /dev/null
@@ -1,43 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-
-import requests
-from datasets import load_dataset
-
-# yapf: disable
-parser = argparse.ArgumentParser()
-parser.add_argument("--language", required=True, type=str, help="The language for the simple seving")
-parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.")
-parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
-args = parser.parse_args()
-# yapf: enable
-
-url = "http://0.0.0.0:8189/models/ernie_m_cls"
-headers = {"Content-Type": "application/json"}
-
-
-if __name__ == "__main__":
-    examples = load_dataset("xnli", args.language, split="validation")[:10]
-    texts = [text for text in examples["premise"]]
-    text_pairs = [text for text in examples["hypothesis"]]
-
-    data = {
-        "data": {"text": texts, "text_pair": text_pairs},
-        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
-    }
-    r = requests.post(url=url, headers=headers, data=json.dumps(data))
-    print(r.text)
diff --git a/model_zoo/ernie-m/run_classifier.py b/model_zoo/ernie-m/run_classifier.py
deleted file mode 100644
index 0d1886c6dd6f..000000000000
--- a/model_zoo/ernie-m/run_classifier.py
+++ /dev/null
@@ -1,322 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import os
-import random
-from dataclasses import dataclass, field
-from functools import partial
-from typing import Optional
-
-import numpy as np
-import paddle
-from datasets import load_dataset
-from paddle.io import Dataset
-from paddle.metric import Accuracy
-
-import paddlenlp
-from paddlenlp.data import DataCollatorWithPadding
-from paddlenlp.trainer import (
-    PdArgumentParser,
-    Trainer,
-    TrainingArguments,
-    get_last_checkpoint,
-)
-from paddlenlp.transformers import (
-    AutoModelForSequenceClassification,
-    AutoTokenizer,
-    ErnieMForSequenceClassification,
-)
-from paddlenlp.utils.log import logger
-
-all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
-task_type_list = ["cross-lingual-transfer", "translate-train-all"]
-
-
-@dataclass
-class ModelArguments:
-    task_type: str = field(
-        default=None,
-        metadata={"help": "The type of the task to finetune selected in the list: " + ", ".join(task_type_list)},
-    )
-    model_name_or_path: str = field(
-        default=None,
-        metadata={
-            "help": "Path to pre-trained model or shortcut name selected in the list: "
-            + ", ".join(list(ErnieMForSequenceClassification.pretrained_init_configuration.keys()))
-        },
-    )
-    max_seq_length: Optional[int] = field(
-        default=256,
-        metadata={
-            "help": "The maximum total input sequence length after tokenization. Sequences longer "
-            "than this will be truncated, sequences shorter will be padded."
-        },
-    )
-    classifier_dropout: Optional[float] = field(default=0.1, metadata={"help": "Dropout rate."})
-    layerwise_decay: Optional[float] = field(default=0.8, metadata={"help": "Layerwise decay ratio."})
-    export_model_dir: Optional[str] = field(
-        default="./best_models",
-        metadata={"help": "Path to directory to store the exported inference model."},
-    )
-    use_test_data: Optional[bool] = field(
-        default=False, metadata={"help": "Whether to use a tiny dataset for CI test."}
-    )
-    test_data_path: Optional[str] = field(default=None, metadata={"help": "Path to tiny dataset."})
-
-
-def set_seed(seed):
-    # Use the same data seed(for data shuffle) for all procs to guarantee data
-    # consistency after sharding.
-    random.seed(seed)
-    np.random.seed(seed)
-    # Maybe different op seeds(for dropout) for different procs is better. By:
-    # `paddle.seed(seed + paddle.distributed.get_rank())`
-    paddle.seed(seed)
-
-
-def convert_example(example, tokenizer, max_seq_length=256):
-    """convert a example into necessary features"""
-    # Convert raw text to feature
-    tokenized_example = tokenizer(
-        example["premise"],
-        text_pair=example["hypothesis"],
-        max_length=max_seq_length,
-        padding=False,
-        truncation=True,
-        return_position_ids=True,
-        return_attention_mask=True,
-        return_token_type_ids=False,
-    )
-    return tokenized_example
-
-
-def load_xnli_dataset(args, path, lang, split=None):
-    """load dataset for specificed language"""
-    if args.use_test_data:
-        if args.test_data_path is None:
-            raise ValueError("Should specified `test_data_path` for test datasets when `use_test_data` is True.")
-        data_files = {
-            "train": args.test_data_path,
-            "validation": args.test_data_path,
-            "test": args.test_data_path,
-        }
-        return load_dataset("json", data_files=data_files, split=split)
-    else:
-        return load_dataset(path, lang, split=split)
-
-
-class XnliDataset(Dataset):
-    """
-    Make all languages datasets be loaded in lazy mode.
-    """
-
-    def __init__(self, datasets):
-        self.datasets = datasets
-        # Ar language has 2000 empty data.
-        self.num_samples = [len(i) for i in datasets]
-        self.cumsum_len = np.cumsum(self.num_samples)
-
-    def __getitem__(self, idx):
-        language_idx = np.argmax(self.cumsum_len > idx)
-        last = language_idx - 1 if language_idx > 0 else language_idx
-        sample_idx = idx - self.cumsum_len[last] if idx >= self.cumsum_len[last] else idx
-        return self.datasets[int(language_idx)][int(sample_idx)]
-
-    def __len__(self):
-        return self.cumsum_len[-1]
-
-
-def do_train():
-    training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
-    training_args: TrainingArguments = training_args
-    model_args: ModelArguments = model_args
-
-    training_args.print_config(model_args, "Model")
-
-    paddle.set_device(training_args.device)
-
-    set_seed(training_args.seed)
-
-    # Detecting last checkpoint.
-    last_checkpoint = None
-    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
-        last_checkpoint = get_last_checkpoint(training_args.output_dir)
-        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
-            raise ValueError(
-                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
-                "Use --overwrite_output_dir to overcome."
-            )
-        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
-            logger.info(
-                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
-                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
-            )
-
-    # Dataset pre-process
-    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
-    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=model_args.max_seq_length)
-    remove_columns = ["premise", "hypothesis"]
-
-    def collect_all_languages_dataset(split):
-        all_ds = []
-        for language in all_languages:
-            ds = load_xnli_dataset(model_args, "xnli", language, split=split)
-            all_ds.append(ds.map(trans_func, batched=True, remove_columns=remove_columns))
-        return XnliDataset(all_ds)
-
-    if model_args.task_type == "cross-lingual-transfer":
-        raw_datasets = load_xnli_dataset(model_args, "xnli", "en")
-        if training_args.do_train:
-            train_ds = raw_datasets["train"].map(trans_func, batched=True, remove_columns=remove_columns)
-        if training_args.do_eval:
-            eval_ds = raw_datasets["validation"].map(trans_func, batched=True, remove_columns=remove_columns)
-        if training_args.do_predict:
-            test_ds = raw_datasets["test"].map(trans_func, batched=True, remove_columns=remove_columns)
-    elif model_args.task_type == "translate-train-all":
-        if training_args.do_train:
-            train_ds = collect_all_languages_dataset("train")
-        if training_args.do_eval:
-            eval_ds = collect_all_languages_dataset("validation")
-        if training_args.do_predict:
-            test_ds = collect_all_languages_dataset("test")
-    else:
-        raise ValueError(
-            f"task_type should be 'cross-lingual-transfer' or 'translate-train-all' but '{model_args.task_type}' is specificed."
-        )
-
-    data_collator = DataCollatorWithPadding(tokenizer)
-
-    num_labels = 3
-    model = AutoModelForSequenceClassification.from_pretrained(
-        model_args.model_name_or_path, num_labels=num_labels, classifier_dropout=model_args.classifier_dropout
-    )
-
-    # Define the metrics of tasks.
-    def compute_metrics(p):
-        # Define the metrics of tasks.
-        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
-
-        preds = paddle.to_tensor(preds)
-        label = paddle.to_tensor(p.label_ids)
-
-        metric = Accuracy()
-        result = metric.compute(preds, label)
-        metric.update(result)
-        accu = metric.accumulate()
-        return {"accuracy": accu}
-
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        data_collator=data_collator,
-        train_dataset=train_ds if training_args.do_train else None,
-        eval_dataset=eval_ds if training_args.do_eval else None,
-        tokenizer=tokenizer,
-        compute_metrics=compute_metrics,
-        # optimizers=[optimizer, lr_scheduler],
-    )
-
-    def using_layerwise_lr_decay(layerwise_decay, model, training_args):
-        """
-        Generate parameter names needed to perform weight decay.
-        All bias and LayerNorm parameters are excluded.
-        """
-        # params_list = [{"params": param, "learning_rate": lr * decay_ratio}, ... ]
-        params_list = []
-        n_layers = model.config.num_hidden_layers
-        for name, param in model.named_parameters():
-            ratio = 1.0
-            param_to_train = {"params": param, "dygraph_key_name": name}
-            if any(nd in name for nd in ["bias", "norm"]):
-                param_to_train["weight_decay"] = 0.0
-            else:
-                param_to_train["weight_decay"] = training_args.weight_decay
-
-            if "encoder.layers" in name:
-                idx = name.find("encoder.layers.")
-                layer = int(name[idx:].split(".")[2])
-                ratio = layerwise_decay ** (n_layers - layer)
-            elif "embedding" in name:
-                ratio = layerwise_decay ** (n_layers + 1)
-
-            param_to_train["learning_rate"] = ratio
-
-            params_list.append(param_to_train)
-        return params_list
-
-    params_to_train = using_layerwise_lr_decay(model_args.layerwise_decay, model, training_args)
-
-    trainer.set_optimizer_grouped_parameters(params_to_train)
-
-    checkpoint = None
-    if training_args.resume_from_checkpoint is not None:
-        checkpoint = training_args.resume_from_checkpoint
-    elif last_checkpoint is not None:
-        checkpoint = last_checkpoint
-
-    # training
-    if training_args.do_train:
-        train_result = trainer.train(resume_from_checkpoint=checkpoint)
-        metrics = train_result.metrics
-        trainer.save_model()
-        trainer.log_metrics("train", metrics)
-        trainer.save_metrics("train", metrics)
-        trainer.save_state()
-
-    # Evaluating
-    if training_args.do_eval:
-        combined = {}
-        for language in all_languages:
-            eval_ds = load_xnli_dataset(model_args, "xnli", language, split="validation")
-            eval_ds = eval_ds.map(trans_func, batched=True, remove_columns=remove_columns, load_from_cache_file=True)
-            metrics = trainer.evaluate(eval_dataset=eval_ds)
-            metrics = {k + f"_{language}": v for k, v in metrics.items()}
-            combined.update({f"eval_accuracy_{language}": metrics.get(f"eval_accuracy_{language}", 0.0)})
-            trainer.log_metrics("eval", metrics)
-
-        combined.update({"eval_accuracy_average": np.mean(list(combined.values()))})
-        trainer.log_metrics("eval", combined)
-        trainer.save_metrics("eval", combined)
-
-    # Predicting
-    if training_args.do_predict:
-        test_ret = trainer.predict(test_ds)
-        trainer.log_metrics("test", test_ret.metrics)
-        logits = test_ret.predictions
-        max_value = np.max(logits, axis=1, keepdims=True)
-        exp_data = np.exp(logits - max_value)
-        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
-        out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()}
-        out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w")
-        json.dump(out_dict, out_file)
-
-    # Export inference model
-    if training_args.do_export and paddle.distributed.get_rank() == 0:
-        # You can also load from certain checkpoint
-        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
-        model_to_save = trainer.model
-        model_to_save = model_to_save._layers if isinstance(model_to_save, paddle.DataParallel) else model_to_save
-        input_spec = [
-            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
-        ]
-        model_args.export_model_dir = os.path.join(model_args.export_model_dir, "export")
-        paddlenlp.transformers.export_model(
-            model=model_to_save, input_spec=input_spec, path=model_args.export_model_dir
-        )
-        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
-
-
-if __name__ == "__main__":
-    do_train()
diff --git a/model_zoo/plato-xl/README.md b/model_zoo/plato-xl/README.md
deleted file mode 100644
index 9d8fa5be7275..000000000000
--- a/model_zoo/plato-xl/README.md
+++ /dev/null
@@ -1,148 +0,0 @@
-# PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
-
-## 模型简介
-
-构建高质量的开放领域（Open-Domain）的对话机器人，使得它能用自然语言与人自由地交流，这一直是自然语言处理领域终极目标之一。
-
-PLATO-XL 是业界首个开源的百亿超大规模开放域对话预训练模型，其使用了参数高效(encoder-decoder共享参数)的 UnifiedTransformer（prefix LM）模型架构，将模型参数量提升到了11B量级，经过了十亿级样本对话数据的预训练，并引入role embedding区分多方对话中的对话角色提升预训练效果，最终模型闲聊测试效果超过了众多代表性的对话模型。可以直接使用 PLATO-XL 构建高质量的开放领域对话机器人。
-
-PaddleNLP 内置了 PLATO-XL 英文预训练模型以供使用。由于 PLATO-XL 模型规模较大，这使得其在预测时生成对话回复的时间较长，并且 11B 的参数量也可能超出部分型号 GPU 显存容量，这是大模型推理与落地存在的普遍和关键问题。PaddleNLP FastGeneration 为 PLATO-XL 提供了 GPU 上的高性能生成加速能力，并且支持模型并行（张量并行）推理允许通过多张小显存容量的 GPU 使用百亿大模型，相比单卡代码中也只增加了`enable_ft_para()`一行，此外模型并行能进一步提升预测速度。
-
-本项目提供了 PLATO-XL 英文模型使用 PaddleNLP FastGeneration 进行高性能预测的使用示例。PLATO-XL 的训练及更多内容请参考 [PaddlePaddle/Knover](https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL)。
-
-## 开始运行
-### 单卡高性能推理
-
-`infer.py` 是 PLATO-XL 高性能预测使用示例脚本，可以使用如下命令运行：
-
-```shell
-python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
-```
-
-该脚本各个参数含义如下：
-
-- `topk` 用于Top-K采样策略，采样时将只从概率最高K个token中采样，默认为1，即greedy search。
-- `topp` 用于Top-P采样策略，采样时将只从概率最高且累加概率不超过该值的token中采样，默认为1.0。
-- `max_out_len` 指定生成的最大长度，默认为64。
-- `min_out_len` 指定生成的最小长度，默认为1。
-- `temperature` 用于调整预测概率分布，默认为1.0，即保持模型原有的预测概率。
-- `use_faster` 使用 FastGeneration
-- `use_fp16` 使用FP16，只在使用FastGeneration时生效
-
-脚本中使用了一条如下的多轮对话的样本数据， 由`List[str]`表示，其中每个`str`表示一句话，将根据历史对话内容生成回复。
-
-```python
-    history = [
-        "hi , Mary ! What do you usually like to do in your spare time ?",
-        "well , I spend a lot of time watching movies .",
-        "what a confidence ! I always watch a lot of movies , too ."
-        "oh really , Frank ? What kind of movies do you like ?"
-    ]
-```
-
-**注意** 由于 PLATO-XL 模型较大，单卡预测至少需要22G显存（使用FP16时），且模型下载需要一定时间（FP32的权重文件约41G）。
-
-### 多卡并行推理
-
-多卡并行推理当前依赖 MPI（[MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可）和[NCCL](https://developer.nvidia.com/nccl)，如需使用还请先安装依赖。安装完成后仍然使用 `infer.py` 来进行预测，相比单卡时不同的只是通过mpi来启动运行，如下：
-
-```shell
-mpirun -n 4 python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
-```
-
-其中`-n 4`指明使用的进程和GPU数，在`n`设置为1时仍将进行单卡推理。由于多卡并行推理使用和单卡使用不同的依赖库，第一次运行时将重新进行JIT编译。
-
-### 性能测试
-
-`infer.py` 中同时提供了性能测试的支持，在上面预测命令的基础上加上 `--profile` 即可，如下：
-
-```shell
-mpirun -n 4 python infer.py --batch_size 8 --min_out_len 20 --max_out_len 20 --topk 1 --use_faster --use_fp16 --profile
-```
-
-此外还指定了`batch_size`和`min_out_len`来得到特定输入输出大小下的性能，性能测试将给出循环运行多次的平均时延。以下为单卡高性能推理和4卡张量并行推理性能数据（V100，CUDA 10.2，输入长度60、输出长度20），可以看出4卡并行速度为单卡的2倍左右。
-
-<table>
-<caption>PLATO-XL 高性能推理速度&nbsp;&nbsp;(in ms/batch)</caption>
-    <tr style="text-align:center;">
-        <td align=center>batch size</td>
-        <td align=center>K</td>
-        <td align=center>FastGeneration</br>1卡</br>FP16</td>
-        <td align=center>FastGeneration</br>4卡</br>FP16</td>
-        <td align=center>多卡并行</br>SpeedUp</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>1</td>
-        <td align=center>1</td>
-        <td align=center>706.937</td>
-        <td align=center>348.653</td>
-        <td align=center>2.027</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>1</td>
-        <td align=center>10</td>
-        <td align=center>707.514</td>
-        <td align=center>348.699</td>
-        <td align=center>2.029</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>4</td>
-        <td align=center>1</td>
-        <td align=center>768.597</td>
-        <td align=center>384.730</td>
-        <td align=center>1.997</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>4</td>
-        <td align=center>10</td>
-        <td align=center>770.008</td>
-        <td align=center>385.244</td>
-        <td align=center>1.998</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>8</td>
-        <td align=center>1</td>
-        <td align=center>862.017</td>
-        <td align=center>418.313</td>
-        <td align=center>2.060</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>8</td>
-        <td align=center>10</td>
-        <td align=center>866.490</td>
-        <td align=center>418.965</td>
-        <td align=center>2.068</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>16</td>
-        <td align=center>1</td>
-        <td align=center>1016.362</td>
-        <td align=center>486.974</td>
-        <td align=center>2.087</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>16</td>
-        <td align=center>10</td>
-        <td align=center>1060.472</td>
-        <td align=center>488.156</td>
-        <td align=center>2.172</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>32</td>
-        <td align=center>1</td>
-        <td align=center>1325.700</td>
-        <td align=center>606.770</td>
-        <td align=center>2.184</td>
-    </tr>
-    <tr style="text-align:center;">
-        <td align=center>32</td>
-        <td align=center>10</td>
-        <td align=center>1326.222</td>
-        <td align=center>608.479</td>
-        <td align=center>2.179</td>
-    </tr>
-</table>
-
-## Reference
-
-1. Bao S, He H, Wang F, et al. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation[J]. arXiv preprint arXiv:2109.09519, 2021.
diff --git a/model_zoo/plato-xl/infer.py b/model_zoo/plato-xl/infer.py
deleted file mode 100644
index e96458fb12b2..000000000000
--- a/model_zoo/plato-xl/infer.py
+++ /dev/null
@@ -1,168 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import os
-import time
-from pprint import pprint
-
-import paddle
-
-from paddlenlp.data import DataCollatorWithPadding
-from paddlenlp.ops import enable_ft_para, get_ft_para_conf
-from paddlenlp.trainer.argparser import strtobool
-from paddlenlp.transformers import (
-    UnifiedTransformerLMHeadModel,
-    UnifiedTransformerTokenizer,
-)
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--use_role", type=strtobool, default=True, help="Whether to use role embeddings.")
-    parser.add_argument(
-        "--position_style",
-        default="relative",
-        choices=["continuous", "relative"],
-        type=str,
-        help="The type for positional embedding. Default is relative.",
-    )
-    parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
-    parser.add_argument(
-        "--num_return_sequences", default=1, type=int, help="The number of returned sequences for each sample."
-    )
-    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output sequence length.")
-    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output sequence length.")
-    parser.add_argument(
-        "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling."
-    )
-    parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.")
-    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.")
-    parser.add_argument("--use_faster", action="store_true", help="Whether to use faster generation. ")
-    parser.add_argument(
-        "--use_fp16",
-        action="store_true",
-        help="Whether to use fp16 to predict. Only available when `use_faster` is True.",
-    )
-    parser.add_argument("--profile", action="store_true", help="Whether to profile.")
-    args = parser.parse_args()
-    return args
-
-
-def profile(batch_size, total_step=50, warmup_step=10, rank=0):
-    def _wrapper(func):
-        def _impl(*args, **kwargs):
-            for i in range(total_step):
-                if i == warmup_step:
-                    paddle.device.cuda.synchronize()
-                    start_time = time.time()
-                out = func(*args, **kwargs)
-            paddle.device.cuda.synchronize()
-            end_time = time.time()
-            if rank is None or get_ft_para_conf().rank == rank:
-                time_interval = end_time - start_time
-                num_batch = total_step - warmup_step
-                print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval))
-            return out
-
-        return _impl
-
-    return _wrapper
-
-
-def postprocess_response(token_ids, tokenizer):
-    """Post-process the decoded sequence. Truncate from the first <eos>."""
-    eos_pos = len(token_ids)
-    for i, tok_id in enumerate(token_ids):
-        if tok_id == tokenizer.sep_token_id:
-            eos_pos = i
-            break
-    token_ids = token_ids[:eos_pos]
-    tokens = tokenizer.convert_ids_to_tokens(token_ids)
-    tokens = tokenizer.merge_subword(tokens)
-    response = " ".join(tokens)
-    return response
-
-
-def main(args):
-    # For memory saving when using FastGeneration:
-    # If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of q/k/v
-    # is fused, it will try to delete the original unfused weights. Note the
-    # rollback to original model would not be guarantee anymore when the faster
-    # model failed if the original weights are deleted.
-    os.environ["PPFG_QKV_MEM_OPT"] = "1"
-    if args.use_fp16:
-        assert args.use_faster, "Only supports FP16 when using FastGeneration."
-        paddle.set_default_dtype("float16")
-    enable_ft_para()
-    # TODO(guosheng): Maybe device can be set in `enable_ft_para`
-    paddle.set_device("gpu:" + str(get_ft_para_conf().rank))
-
-    if args.profile:
-        UnifiedTransformerLMHeadModel.generate = profile(args.batch_size)(UnifiedTransformerLMHeadModel.generate)
-    tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-xl")
-    model = UnifiedTransformerLMHeadModel.from_pretrained("plato-xl")
-    model.eval()
-
-    history = [
-        "hi , Mary ! What do you usually like to do in your spare time ?",
-        "well , I spend a lot of time watching movies .",
-        "what a confidence ! I always watch a lot of movies , too ."
-        "oh really , Frank ? What kind of movies do you like ?",
-    ]
-    inputs = [history] * args.batch_size
-    inputs = list(
-        map(
-            lambda history: tokenizer.dialogue_encode(
-                history=history,
-                add_start_token_as_response=True,
-                return_length=True,
-                return_role_ids=args.use_role,
-                position_style=args.position_style,
-            ),
-            inputs,
-        )
-    )
-    collator = DataCollatorWithPadding(tokenizer)
-    data = collator(inputs)
-
-    outputs, _ = model.generate(
-        input_ids=data["input_ids"],
-        token_type_ids=data["token_type_ids"],
-        position_ids=data["position_ids"],
-        attention_mask=data["attention_mask"].cast("float32"),  # TODO(guosheng): remove this cast
-        role_ids=data.get("role_ids", None),
-        seq_len=data["seq_len"],
-        max_length=args.max_out_len,
-        min_length=args.min_out_len,
-        decode_strategy="sampling",
-        top_k=args.topk,
-        top_p=args.topp,
-        temperature=args.temperature,
-        num_return_sequences=args.num_return_sequences,
-        use_fast=args.use_faster,
-        use_fp16_decoding=args.use_fp16,
-    )
-
-    # Only make the first process to output.
-    if get_ft_para_conf().rank == 0:
-        for i in range(len(outputs)):
-            result = postprocess_response(outputs[i].numpy(), tokenizer)
-            print("Result:", result)
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    pprint(args)
-    main(args)
diff --git a/scripts/regression/ci_case.sh b/scripts/regression/ci_case.sh
index 9df416bdbc5f..c1d9e924503a 100644
--- a/scripts/regression/ci_case.sh
+++ b/scripts/regression/ci_case.sh
@@ -194,31 +194,6 @@ bigbird(){
         --max_pred_length 75 >${log_path}/bigbird_pretrain) >>${log_path}/bigbird_pretrain 2>&1
     print_info $? bigbird_pretrain
 }
-# 7 electra
-electra(){
-    cd ${nlp_dir}/model_zoo/electra/
-    export CUDA_VISIBLE_DEVICES=${cudaid2}
-    export DATA_DIR=./BookCorpus/
-    wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/BookCorpus.tar.gz && tar -xzvf BookCorpus.tar.gz
-    time (python -u ./run_pretrain.py \
-        --model_type electra \
-        --model_name_or_path electra-small \
-        --input_dir ./BookCorpus/ \
-        --output_dir ./pretrain_model/ \
-        --train_batch_size 64 \
-        --learning_rate 5e-4 \
-        --max_seq_length 128 \
-        --weight_decay 1e-2 \
-        --adam_epsilon 1e-6 \
-        --warmup_steps 10000 \
-        --num_train_epochs 4 \
-        --logging_steps 1 \
-        --save_steps 1 \
-        --max_steps 1 \
-        --device gpu >${log_path}/electra_pretrain) >>${log_path}/electra_pretrain 2>&1
-    print_info $? electra_pretrain
-}
-
 # 9 ernie
 ernie(){
     #data process
@@ -265,23 +240,6 @@ ernie(){
         --device "gpu" >${log_path}/ernie_1.0_pretrain_trainer >>${log_path}/ernie_1.0_pretrain_trainer 2>&1
     print_info $? ernie_1.0_pretrain_trainer
 }
-# 10 xlnet
-xlnet(){
-    cd ${nlp_dir}/examples/language_model/xlnet/
-    export CUDA_VISIBLE_DEVICES=${cudaid2}
-    time (python -m paddle.distributed.launch ./run_glue.py \
-        --model_name_or_path xlnet-base-cased \
-        --task_name SST-2 \
-        --max_seq_length 128 \
-        --batch_size 32 \
-        --learning_rate 2e-5 \
-        --num_train_epochs 3 \
-        --max_steps 1 \
-        --logging_steps 1 \
-        --save_steps 1 \
-        --output_dir ./xlnet/ >${log_path}/xlnet_train) >>${log_path}/xlnet_train 2>&1
-    print_info $? xlnet_train
-}
 # 11 ofa
 ofa(){
     cd ${nlp_dir}/examples/model_compression/ofa/
@@ -547,41 +505,6 @@ transformer() {
 
     # fast_transformer
 }
-#25 ernie-doc
-ernie-doc(){
-    cd ${nlp_dir}/model_zoo/ernie-doc/
-    export CUDA_VISIBLE_DEVICES=${cudaid2}
-    time (python -m paddle.distributed.launch  --log_dir hyp run_classifier.py --epochs 15 --layerwise_decay 0.7 --learning_rate 5e-5 --batch_size 4 --save_steps 100 --max_steps 100  --dataset hyp --output_dir hyp >${log_path}/ernie-doc_hyp) >>${log_path}/ernie-doc_hyp 2>&1
-    print_info $? ernie-doc_hyp
-    time (python -m paddle.distributed.launch  --log_dir cmrc2018 run_mrc.py --batch_size 4 --layerwise_decay 0.8 --dropout 0.2 --learning_rate 4.375e-5 --epochs 1 --save_steps 100 --max_steps 100  --dataset cmrc2018 --output_dir cmrc2018  >${log_path}/ernie-doc_cmrc2018) >>${log_path}/ernie-doc_cmrc2018 2>&1
-    print_info $?  ernie-doc_cmrc2018
-    time (python -m paddle.distributed.launch  --log_dir c3 run_mcq.py --learning_rate 6.5e-5 --epochs 1 --save_steps 100 --max_steps 100  --output_dir c3 >${log_path}/ernie-doc_c3) >>${log_path}/ernie-doc_c3 2>&1
-    print_info $? ernie-doc_c3
-    time (python -m paddle.distributed.launch  --log_dir cail/ run_semantic_matching.py --epochs 1 --layerwise_decay 0.8 --learning_rate 1.25e-5 --batch_size 4  --save_steps 100 --max_steps 100 --output_dir cail >${log_path}/ernie-doc_cail) >>${log_path}/ernie-doc_cail 2>&1
-    print_info $? ernie-doc_cail
-    time (python -m paddle.distributed.launch  --log_dir msra run_sequence_labeling.py --learning_rate 3e-5 --epochs 1 --save_steps 100 --max_steps 100  --output_dir msra  >${log_path}/ernie-doc_msar) >>${log_path}/ernie-doc_msar 2>&1
-    print_info $? ernie-doc_msar
-    time (python run_mrc.py  --model_name_or_path ernie-doc-base-zh  --dataset dureader_robust  --batch_size 8 --learning_rate 2.75e-4 --epochs 1 --save_steps 10 --max_steps 2 --logging_steps 10 --device gpu >${log_path}/ernie-doc_dureader_robust) >>${log_path}/ernie-doc_dureader_robust 2>&1
-    print_info $? ernie-doc_dureader_robust
-}
-#26 transformer-xl
-transformer-xl (){
-    cd ${nlp_dir}/examples/language_model/transformer-xl/
-    mkdir gen_data && cd gen_data
-    wget https://paddle-qa.bj.bcebos.com/paddlenlp/enwik8.tar.gz && tar -zxvf enwik8.tar.gz
-    cd ../
-    export CUDA_VISIBLE_DEVICES=${cudaid2}
-    time (sed -i 's/print_step: 100/print_step: 1/g' configs/enwik8.yaml
-        sed -i 's/save_step: 10000/save_step: 3/g' configs/enwik8.yaml
-        sed -i 's/batch_size: 16/batch_size: 8/g' configs/enwik8.yaml
-        sed -i 's/max_step: 400000/max_step: 3/g' configs/enwik8.yaml
-        python -m paddle.distributed.launch  train.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_train_enwik8) >>${log_path}/transformer-xl_train_enwik8 2>&1
-    print_info $? transformer-xl_train_enwik8
-    time (sed -i 's/batch_size: 8/batch_size: 1/g' configs/enwik8.yaml
-        sed -i 's#init_from_params: "./trained_models/step_final/"#init_from_params: "./trained_models/step_3/"#g' configs/enwik8.yaml
-        python eval.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_eval_enwik8) >>${log_path}/transformer-xl_eval_enwik8 2>&1
-    print_info $? transformer-xl_eval_enwik8
-}
 #28 question_matching
 question_matching() {
     cd ${nlp_dir}/examples/text_matching/question_matching/
@@ -635,57 +558,7 @@ ernie-csc() {
     python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
     print_info $? ernie-csc_deploy
 }
-#31 ernie-m
-ernie-m() {
-    export CUDA_VISIBLE_DEVICES=${cudaid2}
-    cd ${nlp_dir}/model_zoo/ernie-m
-    # TODO(ouyanghongyu): remove the following scripts later.
-    if [ ! -f 'test.py' ];then
-        echo '模型测试文件不存在！'
-        # finetuned for cross-lingual-transfer
-        python -m paddle.distributed.launch --log_dir output_clt run_classifier.py \
-            --do_train \
-            --do_eval \
-            --do_export \
-            --device gpu \
-            --task_type cross-lingual-transfer \
-            --model_name_or_path __internal_testing__/ernie-m \
-            --use_test_data True \
-            --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \
-            --output_dir output_clt \
-            --export_model_dir output_clt \
-            --per_device_train_batch_size 8 \
-            --save_steps 1 \
-            --eval_steps 1  \
-            --max_steps 2 \
-            --overwrite_output_dir \
-            --remove_unused_columns False >${log_path}/ernie-m_clt >>${log_path}/ernie-m_clt 2>&1
-        print_info $? ernie-m_clt
-        # finetuned for translate-train-all
-        python -m paddle.distributed.launch --log_dir output_tta run_classifier.py \
-            --do_train \
-            --do_eval \
-            --do_export \
-            --device gpu \
-            --task_type translate-train-all \
-            --model_name_or_path __internal_testing__/ernie-m \
-            --use_test_data True \
-            --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \
-            --output_dir output_tta \
-            --export_model_dir output_tta \
-            --per_device_train_batch_size 8 \
-            --save_steps 1 \
-            --eval_steps 1  \
-            --max_steps 2 \
-            --overwrite_output_dir \
-            --remove_unused_columns False >${log_path}/ernie-m_tta >>${log_path}/ernie-m_tta 2>&1
-        print_info $? ernie-m_tta
-    else
-        python -m pytest ${nlp_dir}/tests/model_zoo/test_ernie_m.py >${log_path}/ernie-m >>${log_path}/ernie-m 2>&1
-        print_info $? ernie-m
-    fi
-}
-#32 clue
+
 clue() {
     cd ${nlp_dir}/legacy/examples/benchmark/clue/classification
     python -u ./run_clue_classifier_trainer.py \
@@ -818,15 +691,6 @@ ernie-3.0(){
     # python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018  --output_dir ./best_models/cmrc2018 --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5  --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_qa >>${log_path}/ernie-3.0_compress_qa 2>&1
     # print_info $? ernie-3.0_compress_qa
 }
-ernie-health(){
-    cd ${nlp_dir}/tests/model_zoo/
-    if [ ! -f 'test_ernie-health.py' ];then
-        echo '模型测试文件不存在！'
-    else
-        python -m pytest tests/model_zoo/test_ernie-health.py >${log_path}/ernie-health_unittest>>${log_path}/ernie-health_unittest 2>&1
-        print_info $? tests ernie-health_unittest
-    fi
-}
 uie(){
     cd ${nlp_dir}/model_zoo/uie/
     mkdir data && cd data && wget https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json && cd ../
@@ -864,10 +728,6 @@ ernie-1.0(){
     ernie
 }
 
-ernie_m(){
-    ernie-m
-}
-
 ernie_layout(){
     ernie-layout
 }
@@ -876,14 +736,6 @@ ernie_csc(){
     ernie-csc
 }
 
-ernie_doc(){
-    ernie-doc
-}
-
-ernie_health(){
-    ernie-health
-}
-
 segment_parallel_utils(){
     cd ${nlp_dir}
     echo "test segment_parallel_utils, cudaid1:${cudaid1}, cudaid2:${cudaid2}"
diff --git a/scripts/regression/get_model_list.py b/scripts/regression/get_model_list.py
index 3e90ea2bb404..4ebe373c43ff 100644
--- a/scripts/regression/get_model_list.py
+++ b/scripts/regression/get_model_list.py
@@ -34,15 +34,10 @@ def get_model_list():
         "clue",
         "couplet",
         "doc",
-        "electra",
         "elmo",
         "ernie",
         "ernie-1.0",
         "ernie-csc",
-        "ernie-doc",
-        "ernie-gen",
-        "ernie-health",
-        "ernie-m",
         "ernie_matching",
         "few_shot",
         "glue",
@@ -59,7 +54,6 @@ def get_model_list():
         "pretrained_models",
         "question_matching",
         "rnn",
-        "rnnlm",
         "semantic_indexing",
         "sentiment_analysis",
         "simbert",
@@ -72,14 +66,13 @@ def get_model_list():
         "tcn",
         "tinybert",
         "transformer",
-        "transformer-xl",
         "unimo-text",
         "vae-seq2seq",
         "word_embedding",
-        "xlnet",
     ]
     examples_second_list = ["model_interpretation", "semantic_indexing", "lexical_analysis", "word_embedding"]
 
+    model_list = os.listdir("legacy/model_zoo")
     model_list = os.listdir("model_zoo")
     examples_list = os.listdir("legacy/examples/")
     app_list = os.listdir("applications/")
diff --git a/scripts/regression/run_ci.sh b/scripts/regression/run_ci.sh
index d4490304b7ee..8ccacf288db3 100644
--- a/scripts/regression/run_ci.sh
+++ b/scripts/regression/run_ci.sh
@@ -28,12 +28,19 @@ export APIcase_list=()
 declare -A Normal_dic
 declare -A all_P0case_dic
 declare -A Build_list
-all_P0case_dic=("msra_ner"]=15 ["glue"]=2 ["bert"]=2 ["skep"]=10 ["bigbird"]=2 ["electra"]=2 ["gpt"]=2 ["ernie-1.0"]=2 ["xlnet"]=2
+all_P0case_dic=("msra_ner"]=15 
+    ["glue"]=2 
+    ["bert"]=2 
+    ["skep"]=10 
+    ["bigbird"]=2
+    ["gpt"]=2 
+    ["ernie-1.0"]=2 
+    ["xlnet"]=2
     ["ofa"]=2 ["albert"]=2 ["SQuAD"]=20 ["lexical_analysis"]=5 ["word_embedding"]=5
-    ["transformer"]=5 ["ernie-doc"]=20 ["transformer-xl"]=5
-    ["question_matching"]=5 ["ernie-csc"]=5 ["ernie-m"]=5 ["taskflow"]=5 ["clue"]=5 ["textcnn"]=5
-    ["fast_generation"]=10 ["ernie-3.0"]=5 ["ernie-layout"]=5 ["uie"]=5 ["ernie-health"]=5 ["llm"]=5
-    ["ernie"]=2 ["ernie_m"]=5 ["ernie_layout"]=5 ["ernie_csc"]=5 ["ernie_ctm"]=5 ["ernie_doc"]=20 ["ernie_health"]=5 ["segment_parallel_utils"]=5 ["ring_flash_attention"]=5)
+    ["transformer"]=5
+    ["question_matching"]=5 ["ernie-csc"]=5  ["taskflow"]=5 ["clue"]=5 ["textcnn"]=5
+    ["fast_generation"]=10 ["ernie-3.0"]=5 ["ernie-layout"]=5 ["uie"]=5  ["llm"]=5
+    ["ernie"]=2 ["ernie_layout"]=5 ["ernie_csc"]=5 ["ernie_ctm"]=5 ["segment_parallel_utils"]=5 ["ring_flash_attention"]=5)
 ####################################
 
 python -m pip config --user set global.index http://pip.baidu-int.com/search/
diff --git a/scripts/regression/run_release.sh b/scripts/regression/run_release.sh
index 354960ea7fe0..0c91c525b6be 100644
--- a/scripts/regression/run_release.sh
+++ b/scripts/regression/run_release.sh
@@ -54,9 +54,9 @@ export all_P0case_time=0
 declare -A all_P0case_dic
 get_diff_TO_P0case(){
 if [[ ${Testcase} =~ "all" ]];then
-    P0case_list=(msra_ner glue bert skep bigbird electra gpt ernie-1.0 xlnet ofa  squad tinybert lexical_analysis seq2seq \
-    word_embedding ernie-ctm distilbert stacl transformer simbert ernie-doc transformer-xl pointer_summarizer question_matching ernie-csc \
-    nptag ernie-m clue taskflow transformers fast_generation ernie-3.0 fast_transformer fast_gpt llama)
+    P0case_list=(msra_ner glue bert skep bigbird gpt ernie-1.0 xlnet ofa  squad tinybert lexical_analysis seq2seq \
+    word_embedding ernie-ctm distilbert stacl transformer simbert pointer_summarizer question_matching ernie-csc \
+    nptag clue taskflow transformers fast_generation ernie-3.0 fast_transformer fast_gpt llama)
 elif [[ ${Testcase} =~ "p0" ]];then
     P0case_list=(glue bert skep gpt ernie-1.0 transformer clue)
 else