diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..86bce3a2
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,26 @@
+__pycache__
+.idea/
+logs/
+test_tokenizer.py
+samples_text2image/
+generate_contexts/
+venv/
+*__pycache__
+.DS_Store
+.vscode
+*.swo
+*.swp
+*log
+build
+dist
+eazybigmodel.egg-info
+flagai.egg-info
+test_report
+/data/
+/tests/*/data
+checkpoints
+state_dict
+checkpoints*
+vocabs
+tensorboard*
+datasets
\ No newline at end of file
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 8bc35940..944c124e 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,21 +1,21 @@
 # Contributing to FlagAI
 
-We are happy to accept your contributions to make `FlagAI` better and more awesome! To avoid unnecessary work on either 
+We are happy to accept your contributions to make `FlagAI` better and more awesome! To avoid unnecessary work on either
 side, please stick to the following process:
 
 1. Check if there is already [an issue](https://github.com/BAAI-WuDao/Sailing/issues) for your concern.
 2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
-3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the 
+3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
 commit guidelines below.
 
 
 ## Git Commit Guidelines
 
-If there is already a ticket, use this number at the start of your commit message. 
+If there is already a ticket, use this number at the start of your commit message.
 Use meaningful commit messages that described what you did.
 
-**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.` 
-**Example:** `ISSUE#123: Fix typo in README.` 
+**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.`
+**Example:** `ISSUE#123: Fix typo in README.`
 
 
 ## Developing locally
@@ -42,7 +42,7 @@ To run all basic tests execute:
 python test.py
 ```
 
-To check the test results in 
+To check the test results in
 ```
 tests/test_report
 ```
diff --git a/README.md b/README.md
index 43c8d2de..aae2c6ea 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 ![FlagAI](logo.png)
-[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052) 
+[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052)
 [简体中文](README_zh.md)
 
 --------------------------------------------------------------------------------
@@ -8,7 +8,7 @@
 FlagAI aims to help researchers and developers to freely train and test large-scale models for NLP tasks.
 <br><br>
 
-* Now it supports GLM, Bert, RoBerta, GPT2, T5 models and models from Huggingface Transformers.
+* Now it supports GLM, Bert, RoBerta, GPT2, T5 and models from Huggingface Transformers.
 
 * It provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub.
 
@@ -76,7 +76,7 @@ Host 127.0.0.1
 
 ## Quick Start
 We provide many models which are trained to perform different tasks. You can load these models by AutoLoader to make prediction.
-## Load model and tokenizer 
+## Load model and tokenizer
 We provide the AutoLoad class to load the model and tokenizer quickly, for example:
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
@@ -93,7 +93,7 @@ tokenizer = auto_loader.get_tokenizer()
 This example is for the `title_generation` task, and you can also model other tasks by modifying the `task_name`.
 Then you can use the model and tokenizer to finetune or test.
 
-## Predictor 
+## Predictor
 We provide the `Predictor` class to predict for different tasks, for example:
 
 ```python
diff --git a/README_zh.md b/README_zh.md
index a129abd6..6fe78597 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -1,8 +1,8 @@
 ![FlagAI](logo.png)
-[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052) 
+[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052)
 [English](README.md)
 
--------------------------------------------------------------------------------- 
+--------------------------------------------------------------------------------
 
 FlagAI 旨在帮助研究人员和开发人员自由地训练和测试用于 NLP 任务的大规模模型。
 <br><br>
@@ -34,7 +34,7 @@ FlagAI 旨在帮助研究人员和开发人员自由地训练和测试用于 NLP
 - [许可证](#许可证)
 
 <!-- tocstop -->
-# 安装 
+# 安装
 * PyTorch version >= 1.8.0
 * Python version >= 3.8
 * 使用GPUs进行训练和测试, 你需要安装CUDA 和 NCCL
@@ -248,7 +248,7 @@ for text in test_data:
 * [所有支持的任务](docs/AllSupportedTasks.md)
 
 
-本节解释了本项目中基础NLP类是如何工作的，如何加载预先训练的模型来标记您的文本，如何使用不同的词或文档嵌入来得到表示，以及如何训练自己的语言模型、序列标注模型和文本分类模型。 
+本节解释了本项目中基础NLP类是如何工作的，如何加载预先训练的模型来标记您的文本，如何使用不同的词或文档嵌入来得到表示，以及如何训练自己的语言模型、序列标注模型和文本分类模型。
 
 
 # 教程
diff --git a/SUPPORT.md b/SUPPORT.md
index eee11293..16d9641a 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -3,7 +3,7 @@ _Use this file to detail how to make get support forthe project. The structure b
 # Getting Help
 
 There are a few ways to connect with the PROJECT NAME project:
-* https://github.com/BAAI-WuDao/Sailing/issues 
+* https://github.com/BAAI-WuDao/Sailing/issues
 
 ## How to Ask for Help
 
diff --git a/doc_zh/APPENDIX_GLM_IO.md b/doc_zh/APPENDIX_GLM_IO.md
index 4579ebc7..2632fe13 100644
--- a/doc_zh/APPENDIX_GLM_IO.md
+++ b/doc_zh/APPENDIX_GLM_IO.md
@@ -2,16 +2,16 @@
 
 a) 如下图所示，原文包含6个token，两个区间被屏蔽：第一个区间包含第3个token，第二个区间包含第5个和第6个token。
 
-![results1](img/glm_io_1.png) 
+![results1](img/glm_io_1.png)
 
 b) 将输入分成两个部分： A 部分 (将遮挡区间遮盖掉后的文本)和B部分(被遮挡的区间). 注意所有被遮挡区间的顺序会被重新打乱
 
-![results1](img/glm_io_2.png) 
+![results1](img/glm_io_2.png)
 
 c) GLM的输入和输出，输入包括tokens和2个位置编码
 
-![results1](img/glm_io_3.png) 
+![results1](img/glm_io_3.png)
 
 d) 下图里的自注意力机制既通过遮挡文本实现了自编码， 也在预测遮挡区间内文本的过程里实现了自回归
 
-![results1](img/glm_io_4.png) 
\ No newline at end of file
+![results1](img/glm_io_4.png)
\ No newline at end of file
diff --git a/doc_zh/AllSupportedTasks.md b/doc_zh/AllSupportedTasks.md
index 5a9f774b..e9f33e3f 100644
--- a/doc_zh/AllSupportedTasks.md
+++ b/doc_zh/AllSupportedTasks.md
@@ -30,11 +30,11 @@ tokenizer = auto_loader.get_tokenizer()
 所有支持的模型都可以在 **model hub** 中找到。
 不同的模型适应不同的任务。
 
-#### Transfomrer编码器: 
+#### Transfomrer编码器:
 
 例如 model_name="bert-base-chinese" or "RoBERTa-wwm-ext"时， 这些模型支持上一节中提到的所有任务
 
-#### Transformer解码器: 
+#### Transformer解码器:
 
 例如 model_name="gpt2-chinese"时, 模型支持 "seq2seq" 任务.
 
diff --git a/doc_zh/EnvironmentSetup.md b/doc_zh/EnvironmentSetup.md
index 87c7dbcd..8712c78f 100644
--- a/doc_zh/EnvironmentSetup.md
+++ b/doc_zh/EnvironmentSetup.md
@@ -1,6 +1,6 @@
 # 预训练分布式环境搭建
 
-# 一.  Docker 
+# 一.  Docker
 
 ## 1.安装docker
 
@@ -117,7 +117,7 @@ systemctl restart docker
 
 ```dockerfile
 #pull base image
-FROM nvidia/cuda:10.2-devel-ubuntu18.04   
+FROM nvidia/cuda:10.2-devel-ubuntu18.04 
 #maintainer
 MAINTAINER deepspeed <gqwang@baai.ac.cn>
 
@@ -135,7 +135,7 @@ RUN mkdir -p ${STAGE_DIR}
 #Installation/Basic Utilities
 ##############################################################################
 RUN  sed -i s@/archive.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
-RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list 
+RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
 RUN apt-get update && \
     DEBIAN_FRONTEND="noninteractive" apt-get install -y --no-install-recommends \
         software-properties-common build-essential autotools-dev \
@@ -149,8 +149,8 @@ RUN apt-get update && \
         libjpeg-dev \
         libpng-dev \
         screen jq psmisc dnsutils lsof musl-dev systemd
-```        
-        
+```      
+      
 ### c.  安装最新版git(创建镜像clone 安装包)
 
 ```dockerfile
@@ -178,7 +178,7 @@ RUN cd ${STAGE_DIR} && \
     PATH=/usr/bin:$PATH ./mlnxofedinstall --user-space-only --without-fw-update --umad-dev-rw --all -q && \
     cd ${STAGE_DIR} && \
     rm -rf ${STAGE_DIR}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64*
-```    
+```  
 
 ### e. 安装 nv_peer_mem
 
@@ -200,7 +200,7 @@ RUN cd ${STAGE_DIR}/nv_peer_memory && \
     apt-get update && \
     apt-get install -y dkms && \
     dpkg-buildpackage -us -uc && \
-    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb   
+    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb 
 ```
 
 ### f. 安装openmpi, 需先安装libevent 依赖包
@@ -237,7 +237,7 @@ RUN mv /usr/local/mpi/bin/mpirun /usr/local/mpi/bin/mpirun.real && \
     echo 'mpirun.real --allow-run-as-root --prefix /usr/local/mpi "$@"' >> /usr/local/mpi/bin/mpirun && \
     chmod a+x /usr/local/mpi/bin/mpirun
 ```
-    
+  
 ### g.安装 python
 
 ```dockerfile
@@ -249,24 +249,24 @@ RUN curl -o ~/miniconda.sh https://mirrors.tuna.tsinghua.edu.cn/anaconda/minicon
      chmod +x ~/miniconda.sh && \
      ~/miniconda.sh -b -p /opt/conda && \
      rm ~/miniconda.sh && \
-     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing 
+     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing
 ```
-    
+  
 ### h.安装 magma-cuda
 
 ```dockerfile
 ###########################################################################
 #Install magma-cuda
 ##############################################################################
-COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR}   
+COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR} 
 RUN  cd ${STAGE_DIR} && \
      /opt/conda/bin/conda install -y -c pytorch --use-local magma-cuda102-2.5.2-1.tar.bz2  && \
      /opt/conda/bin/conda clean -ya
 ####optional#####
 #RUN  /opt/conda/bin/conda install -y -c pytorch  magma-cuda102  && \
-#/opt/conda/bin/conda clean -ya 
+#/opt/conda/bin/conda clean -ya
 ```
-    
+  
 ### i.配置路径
 
 ```dockerfile
@@ -278,7 +278,7 @@ RUN echo "export PATH=/opt/conda/bin:\$PATH" >> /root/.bashrc
 RUN pip install --upgrade pip setuptools
 RUN wget https://tuna.moe/oh-my-tuna/oh-my-tuna.py && python oh-my-tuna.py
 ```
-    
+  
 ### j.安装一些pip 包
 
 ```dockerfile
@@ -314,8 +314,8 @@ RUN pip install psutil \
                 rouge_score \
                 cupy-cuda102\
                 setuptools==60.0.3
-```  
-    
+```
+  
 ### k.安装mpi4py （需下载到本地安装，pip 安装可能因为版本兼容问题报错）
 
 ```dockerfile
@@ -327,7 +327,7 @@ RUN apt-get update && \
 COPY mpi4py-3.1.3.tar.gz ${STAGE_DIR}
 RUN cd ${STAGE_DIR} && tar zxvf mpi4py-3.1.3.tar.gz && \
  cd mpi4py-3.1.3 &&\
- python setup.py build && python setup.py install 
+ python setup.py build && python setup.py install
 ```
 
 ### l.安装pytorch, 版本可替换， 需先下载项目到本地，国内安装容易因为网速原因，造成终止, pytorch git clone 过程中可能有些子包下载过程中会终止。可以多 git clone 几次
@@ -347,8 +347,8 @@ ENV NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu
 ENV NCCL_INCLUDE_DIR=/usr/include
 RUN cd /opt/pytorch && TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
     CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" USE_SYSTEM_NCCL=1 \
-    pip install -v . && rm -rf /opt/pytorch 
- 
+    pip install -v . && rm -rf /opt/pytorch
+
 
 
 ##############################################################################
@@ -380,7 +380,7 @@ RUN cd ${STAGE_DIR}/apex && pip install -v --no-cache-dir --global-option="--cpp
 #Install deepSpeed
 #############################################################################
 RUN pip install  py-cpuinfo
-RUN apt-get install -y libaio-dev 
+RUN apt-get install -y libaio-dev
 ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
 RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
 #COPY DeepSpeed ${STAGE_DIR}/DeepSpeed
@@ -406,7 +406,7 @@ RUN python -c "import deepspeed; print(deepspeed.__version__)"
 #ENV NCCL_VERSION=2.9.6
 ```
 
-### p.配置网络端口、公钥和ssh 
+### p.配置网络端口、公钥和ssh
 
 ```dockerfile
 #############################################################################
@@ -447,7 +447,7 @@ docker load --input deepspeed-cuda102.tar.gz
 ```shell
 docker build -f cuda102.dockerfile  -t deepspeed/cuda102:1221 .
 #cuda102.dockerfile 参考 dockerfile 文件制作流程
-```     
+```   
 # 二. 在每个机器节点构建容器
 
 ```shell
@@ -501,9 +501,9 @@ docker rm 容器名/容器id
 
 ## 2.将各节点容器生成的公钥文件
 
-~/.ssh/id_rsa.pub 
+~/.ssh/id_rsa.pub
 中的内容收集，并同步到各机器的文件
-~/.ssh/authorized_keys 
+~/.ssh/authorized_keys
 
 ## 3.免密登陆
 
diff --git a/doc_zh/Huggingface_t5.md b/doc_zh/Huggingface_t5.md
index 67f728c2..dcb0b18a 100644
--- a/doc_zh/Huggingface_t5.md
+++ b/doc_zh/Huggingface_t5.md
@@ -8,7 +8,7 @@ from transformers import T5ForConditionalGeneration, T5Tokenizer
 from torch.utils.data import Dataset
 import torch
 
-## Inheriant the Trainer 
+## Inheriant the Trainer
 ## overload the forward_step function
 class MyTrainer(Trainer):
 
@@ -116,7 +116,7 @@ def seq2seq_collate_fn(batch):
 
 sents_src, sents_tgt = read_file()
 data_len = len(sents_tgt)
-train_size = int(data_len * 0.8) 
+train_size = int(data_len * 0.8)
 train_src = sents_src[:train_size]
 train_tgt = sents_tgt[:train_size]
 
@@ -141,7 +141,7 @@ trainer.train(model,
 ## 加速训练的技巧
 我们可能不会在V100 32G上运行t5-3b。所以，我们需要一些技巧来减少GPU内存的使用。
 ### 第一步：fp16
-把模型参数变为 `fp16` 
+把模型参数变为 `fp16`
 ```python
 trainer = MyTrainer(
     env_type='pytorch',
@@ -154,7 +154,7 @@ trainer = MyTrainer(
     load_dir=None,
     lr=1e-4,
     fp16=True) # change to `True`
-```    
+```  
 ### 第二步：梯度重计算（checkpoint）
 在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-3b。
 现在，我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-3b。
@@ -171,9 +171,9 @@ trainer = MyTrainer(
     lr=1e-4,
     fp16=True
     checkpoint_activations = True) # setting as `True`
-```    
+```  
 ### 第三步：数据并行(DDP)
-为了增加batch size，我们可以在多个GPU上使用数据并行。 
+为了增加batch size，我们可以在多个GPU上使用数据并行。
 ```python
 trainer = Trainer(
     env_type="pytorchDDP",
@@ -184,7 +184,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # The following six options is for pytorchDDP
     master_ip='127.0.0.1',
@@ -200,7 +200,7 @@ trainer = Trainer(
 
 ```python
 trainer = Trainer(
-    env_type="deepspeed", # env_type 
+    env_type="deepspeed", # env_type
     epochs=1,
     batch_size=1,
     eval_interval=10,
@@ -208,7 +208,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # parallel settings
     master_ip='127.0.0.1',
@@ -218,14 +218,14 @@ trainer = Trainer(
     hostfile='hostfile',
     training_script=__file__,
     # deepspeed
-    deepspeed_config='deepspeed.json' 
+    deepspeed_config='deepspeed.json'
 )
 ```
 ### 第五步：模型并行（deepspeed + megatron-lm）
 
 ```python
 trainer = Trainer(
-    env_type="deepspeed", # env_type 
+    env_type="deepspeed", # env_type
     epochs=1,
     batch_size=1,
     eval_interval=10,
@@ -233,7 +233,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # parallel settings
     master_ip='127.0.0.1',
diff --git a/doc_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md b/doc_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
index ff5aae6e..c5281c48 100644
--- a/doc_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
+++ b/doc_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
@@ -46,16 +46,16 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
 # Autoloader can build the model and tokenizer automatically.
 # 'seq2seq' is the task_name.
-auto_loader = AutoLoader("seq2seq", 
+auto_loader = AutoLoader("seq2seq",
                          model_dir="./state_dict/",
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
@@ -70,7 +70,7 @@ python ./train.py
 Modify the training configuration by this code:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_seq2seq",
diff --git a/doc_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md b/doc_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
index d2e8b473..e58e62ef 100644
--- a/doc_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
+++ b/doc_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
@@ -50,13 +50,13 @@ def read_file(data_path):
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
-model_dir = "./state_dict/" 
+model_dir = "./state_dict/"
 # Autoloader can build the model and tokenizer automatically.
 # 'cls' is the task_name.
-auto_loader = AutoLoader("cls", 
-                         model_dir, 
+auto_loader = AutoLoader("cls",
+                         model_dir,
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
 tokenizer = auto_loader.get_tokenizer()
@@ -70,7 +70,7 @@ python ./train.py
 调整训练参数：
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_semantic_matching",
diff --git a/doc_zh/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md b/doc_zh/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
index 352a78ba..3971743b 100644
--- a/doc_zh/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
+++ b/doc_zh/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
@@ -10,11 +10,11 @@ Autoloader会根据**model_name**从ModelHub中自动搜索预训练模型和Tok
 ## target包含所有目标分类
 ## 0 代表两句话的意思相似
 ## 1 代表两句话有着不同的意思
-target = [0, 1] 
+target = [0, 1]
 auto_loader = AutoLoader(task_name="cls", ## 任务名
                          model_name="RoBERTa-wwm-ext", ## 模型名字
                          model_dir=model_dir, ## 模型下载的目录
-                         load_pretrain_params=True, ## 是否要加载已有的预训练模型参数. 
+                         load_pretrain_params=True, ## 是否要加载已有的预训练模型参数.
                          target_size=len(target) ## 最终输出的维度，用来进行分类任务.
                          )
 ```
@@ -31,11 +31,11 @@ auto_loader = AutoLoader(task_name="cls", ## 任务名
 所有支持的模型都可以在 **model hub** 中找到。
 不同的模型适应不同的任务。
 
-#### Transfomrer编码器: 
+#### Transfomrer编码器:
 
 例如 model_name="bert-base-chinese" or "RoBERTa-wwm-ext"时， 这些模型支持上一节中提到的所有任务
 
-#### Transformer解码器: 
+#### Transformer解码器:
 
 例如 model_name="gpt2-chinese"时, 模型支持 "seq2seq" 任务.
 
diff --git a/doc_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md b/doc_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
index a0203338..2765d4b6 100644
--- a/doc_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
+++ b/doc_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
@@ -83,7 +83,7 @@ python ./train.py
 调整训练参数：
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_ner",
diff --git a/doc_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md b/doc_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
index 95e1f82e..cfa424f7 100644
--- a/doc_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
+++ b/doc_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
@@ -47,7 +47,7 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
@@ -66,7 +66,7 @@ The configuration support multi-gpus training.
 Modify the training configuration by this code:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(
     env_type="deepspeed+mpu",
diff --git a/doc_zh/TUTORIAL_1_BASICS.md b/doc_zh/TUTORIAL_1_BASICS.md
index 6018fbdd..a01707de 100644
--- a/doc_zh/TUTORIAL_1_BASICS.md
+++ b/doc_zh/TUTORIAL_1_BASICS.md
@@ -5,7 +5,7 @@ Transformer可以从输入中提取更好的特征，这点使得基于Transform
 Transformer的另一个特点是它与并行计算的兼容性，这是其与RNN等时序模型相比起来的优点。
 
 Transformer的结构如下图所示
-<div align=center><img src="img/transformer.png" width="400px"></div>  
+<div align=center><img src="img/transformer.png" width="400px"></div>
 
 在编码器步骤中，首先将输入编码与位置编码相加，
 然后将相加的结果传递给多头注意力机制，该机制能够考虑不同位置token之间的相关性信息。
diff --git a/doc_zh/TUTORIAL_2_PROJECT_STRUCTURE.md b/doc_zh/TUTORIAL_2_PROJECT_STRUCTURE.md
index 9e8d611b..46f529b0 100644
--- a/doc_zh/TUTORIAL_2_PROJECT_STRUCTURE.md
+++ b/doc_zh/TUTORIAL_2_PROJECT_STRUCTURE.md
@@ -9,14 +9,14 @@ Sailing/
     |    |--fp16/ # contains fp16 tools
     |    |--mpu/ # fork from megatron-lm
     |    |--docs/ # documentations
-    |    |--trainer.py/ # trainer for pytorch, deepspeed+mpu 
-    |    |--logging.py # 
+    |    |--trainer.py/ # trainer for pytorch, deepspeed+mpu
+    |    |--logging.py #
     |    |--metrics.py/ # contains frequently-used metrics, e.g., accuracy
-    |    |--optimizers.py 
+    |    |--optimizers.py
     |    |--schedulers.py.py
     |    |--test_utils.py # contains tools for testing
     |    |--utils.py
-    |--setup.py 
+    |--setup.py
     |--test.py # for excuting all tests
     |--README.md
     |--requirements.txt
diff --git a/doc_zh/TUTORIAL_4_DATASET.md b/doc_zh/TUTORIAL_4_DATASET.md
index a6fa871f..97b46276 100644
--- a/doc_zh/TUTORIAL_4_DATASET.md
+++ b/doc_zh/TUTORIAL_4_DATASET.md
@@ -2,14 +2,14 @@
 
 ## 支持的下游数据集列表
 
-| 数据集名称                                     | 数据集简称    | 语言  | 所属评测基准   |  
+| 数据集名称                                     | 数据集简称    | 语言  | 所属评测基准   |
 |----------------------------------------------|----------|-----|----------|
 | Broadcoverage Diagnostics                    | BoolQ    | 英文  | SuperGLUE |
 | CommitmentBank                               | CB       | 英文  | SuperGLUE |
-| Choice of Plausible Alternatives             | COPA     | 英文  | SuperGLUE | 
+| Choice of Plausible Alternatives             | COPA     | 英文  | SuperGLUE |
 | Multi-Sentence Reading Comprehension         | MultiRC  | 英文  | SuperGLUE |
-| Recognizing Textual Entailment               | RTE      | 英文  | SuperGLUE | 
-| Words in Context                             | WiC      | 英文  | SuperGLUE |                                                     
+| Recognizing Textual Entailment               | RTE      | 英文  | SuperGLUE |
+| Words in Context                             | WiC      | 英文  | SuperGLUE |                                                   
 | The Winograd Schema Challenge                | WSC      | 英文  | SuperGLUE |
 | Ant Financial Question Matching Corpus       | AFQMC    | 中文  | CLUE     |
 | Short Text Classificaiton for News           | TNEWS    | 中文  | CLUE     |
diff --git a/doc_zh/TUTORIAL_8_TRAINING.md b/doc_zh/TUTORIAL_8_TRAINING.md
index 1d960549..35fc935c 100644
--- a/doc_zh/TUTORIAL_8_TRAINING.md
+++ b/doc_zh/TUTORIAL_8_TRAINING.md
@@ -73,7 +73,7 @@ for task_name in ['afqmc', 'tnews']:
                     log_interval=50,
                     experiment_name='glm_large',
                     load_dir=None
-                    
+                  
                     # parallel settings
                     master_ip='127.0.0.1',
                     master_port=17750,
diff --git a/doc_zh/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md b/doc_zh/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
index 3360c83d..8af7e311 100644
--- a/doc_zh/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
+++ b/doc_zh/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
@@ -36,7 +36,7 @@ def read_file():
     ## src = ["春晓：五言绝句", "标题：五言律诗",......]
     ## tgt = ["春眠不觉晓，处处闻啼鸟。夜来风雨声，花落知多少。", "诗句...", ......]
     ## no matter what data you use, you need to construct the right src and tgt.
-    
+  
     return src,tgt
 ```
 2）定义数据迭代器（DataLoader）：
@@ -52,7 +52,7 @@ class BertSeq2seqDataset(Dataset):
         target_text = self.sents_tgt[i]
         data=tokenizer.encode_plus(source_text,
             target_text=target_text)
-        return data 
+        return data
 
     def __len__(self):
         return len(self.sents_src)
@@ -115,9 +115,9 @@ train_dataset = BertSeq2seqDataset(train_src,
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
-model_dir = "./state_dict/glm/" 
+model_dir = "./state_dict/glm/"
 # Autoloader can build the model and tokenizer automatically.
 # 'seq2seq' is the task_name.
 AutoLoader("seq2seq",model_name="glm_large_ch",model_dir=model_dir)
@@ -138,7 +138,7 @@ trainer = Trainer(
     lr=2e-4,#2e-4
     weight_decay=2e-8,#1e-3
     epochs=100,
-    log_interval=10,    
+    log_interval=10,  
     tensorboard_dir="tbsummary",
     eval_interval=2000000,
     load_dir="",
@@ -159,7 +159,7 @@ trainer = Trainer(
 trainer.train(model,
               train_dataset=train_dataset,
               collate_fn=my_collate_fn)
-```              
+```            
 
 
 
diff --git a/docs/APPENDIX_GLM_IO.md b/docs/APPENDIX_GLM_IO.md
index 81f8b85a..a23c1856 100644
--- a/docs/APPENDIX_GLM_IO.md
+++ b/docs/APPENDIX_GLM_IO.md
@@ -2,16 +2,16 @@
 
 a) The original text contains 6 tokens, and the two spans are masked: first span contains the 3rd token and the second span contains the 5th and 6th token.
 
-![results1](img/glm_io_1.png) 
+![results1](img/glm_io_1.png)
 
 b) Divide the input into 2 parts, part A (corrupted text) and part B (masked spans). The order of spans is shuffled here.
 
-![results1](img/glm_io_2.png) 
+![results1](img/glm_io_2.png)
 
 c) Input and output of GLM, the input includes tokens and 2 positional encodings
 
-![results1](img/glm_io_3.png) 
+![results1](img/glm_io_3.png)
 
 d) The self-attention mask that realizes both autoencoding upon corrupted text and autoregressive upon the masked spans
 
-![results1](img/glm_io_4.png) 
\ No newline at end of file
+![results1](img/glm_io_4.png)
\ No newline at end of file
diff --git a/docs/APPENDIX_TASK.md b/docs/APPENDIX_TASK.md
index f0f65cdd..b8e67deb 100644
--- a/docs/APPENDIX_TASK.md
+++ b/docs/APPENDIX_TASK.md
@@ -1,33 +1,33 @@
 ## Introduction to Pattern-Exploiting Training (PET)
 
-A pre-trained models somehow has the ability to 'understand' natural languages. However, primary pre-trained models can only perform the original 
+A pre-trained models somehow has the ability to 'understand' natural languages. However, primary pre-trained models can only perform the original
 training task (like autoencoding and autoregressive task), which usually does not fit the various tasks we face in reality.
-Therefore, we need to come up with a way to adapt the pre-trained models to downstream tasks. 
+Therefore, we need to come up with a way to adapt the pre-trained models to downstream tasks.
 
-Initially, an extra task-speficic architecture was added to the final layer of the pretrained-model, which is called finetuning. Nonetheless, the performance of 
-pretraining-fituning paradigm is not satisfatory especially when the pretraining task deviate from downstream tasks. As a result, [prompt-learning](https://arxiv.org/abs/2111.01998) is 
-proposed, where the core idea is to design a cloze-style task-specific prompt to combine pretrained-model and downstream tasks. 
+Initially, an extra task-speficic architecture was added to the final layer of the pretrained-model, which is called finetuning. Nonetheless, the performance of
+pretraining-fituning paradigm is not satisfatory especially when the pretraining task deviate from downstream tasks. As a result, [prompt-learning](https://arxiv.org/abs/2111.01998) is
+proposed, where the core idea is to design a cloze-style task-specific prompt to combine pretrained-model and downstream tasks.
 
 [Exploiting Cloze Questions for Few Shot Text Classification and Natural
 Language Inference](https://arxiv.org/abs/2001.07676) proposed PET, which can reconstruct the input texts as cloze-style phrases, so that it is easier for pre-trained models to understand given tasks.
 
 Due to different settings of input texts and labels among different tasks, we need to design appropriate cloze-style patterns. The paper introduced a sentiment classification example as shown below.
 
-The given sentiment analysis task is: given an input context (eg. 'Best pizza ever!'), and limited categories (here there's 2 categories, 
-1 and -1 denotes positive and negative sentiment, respectively). Our model needs to automatically predict the probability of each 
+The given sentiment analysis task is: given an input context (eg. 'Best pizza ever!'), and limited categories (here there's 2 categories,
+1 and -1 denotes positive and negative sentiment, respectively). Our model needs to automatically predict the probability of each
 sentiment category given the context.
 
 <div align=center><img src="img/pet_example.png"></div>
 
-As shown in the picture, firstly the positive and negative labels are verbalized as 'great' and 'bad'. Then the pattern is designed as: 
+As shown in the picture, firstly the positive and negative labels are verbalized as 'great' and 'bad'. Then the pattern is designed as:
 
 context + 'It was' + masked sentiment label + '.'
 
-The probability for filling 'great' and 'bad' into the masked position will be returned by the finetuned model. 
+The probability for filling 'great' and 'bad' into the masked position will be returned by the finetuned model.
 
-Here we can see that it requires users to design a pattern-verbalizer pair (PVP) for each given task. 
-In our project, this part is put in flagai/data/dataset/superglue/pvp.py. Under the class for a given task, 
-normally there exists two functions named 'verbalize' and 'get_parts', which represent the verbalizer and pattern design step. 
+Here we can see that it requires users to design a pattern-verbalizer pair (PVP) for each given task.
+In our project, this part is put in flagai/data/dataset/superglue/pvp.py. Under the class for a given task,
+normally there exists two functions named 'verbalize' and 'get_parts', which represent the verbalizer and pattern design step.
 An example is shown below. Note that for one task, there can be multiple patterns.
 
 ```
diff --git a/docs/AllSupportedTasks.md b/docs/AllSupportedTasks.md
index 197f387f..a763ffd4 100644
--- a/docs/AllSupportedTasks.md
+++ b/docs/AllSupportedTasks.md
@@ -29,11 +29,11 @@ The **task_name** parameter supports:
 All supported models is can be found in **model hub**.
 Different models adapt to different tasks.
 
-#### Transfomrer encoder: 
+#### Transfomrer encoder:
 
 For example, model_name="bert-base-chinese" or "RoBERTa-wwm-ext" These models support all of the tasks mentioned in the previous section, such as NER(sequence labeling), text classification, semantic matching, seq2seq and so on.
 
-#### Transformer decoder: 
+#### Transformer decoder:
 
 For example, model_name="gpt2-chinese", the model support "seq2seq" task. Input a beginning of a sentence, the model can continue writing.
 
diff --git a/docs/ChangeToMegatron.md b/docs/ChangeToMegatron.md
index a3bddb46..329e5768 100644
--- a/docs/ChangeToMegatron.md
+++ b/docs/ChangeToMegatron.md
@@ -7,7 +7,7 @@ Three steps：
 Most of the process in `parallel` are taken from `Megatron-LM`，and is put in `mpu`module
 
 ## 1.Turn MLP layer into column/rowParallel版本
-Location：`flagai/model/layers/embeddings_mpu.py` 
+Location：`flagai/model/layers/embeddings_mpu.py`
 
 Key idea:
 split the two forward1 layers in 1linear1 layer, following the column-fist principle
diff --git a/docs/EnvironmentSetup.md b/docs/EnvironmentSetup.md
index 090a4d4a..49f02aa7 100644
--- a/docs/EnvironmentSetup.md
+++ b/docs/EnvironmentSetup.md
@@ -1,11 +1,11 @@
 #  Pre-training distributed environment setup
 
-# 一.  Docker 
+# 一.  Docker
 
 ## 1.install docker
 
 ```shell
-# Since the docker version in the apt official library in Ubuntu may be relatively low, 
+# Since the docker version in the apt official library in Ubuntu may be relatively low,
 # uninstall the old version with the following command line
 apt-get remove docker docker-engine docker-ce docker.io
 
@@ -30,7 +30,7 @@ apt-get install -y docker-ce
 
 ## 2.Docker source change
 
-### (https://xxxx.mirror.aliyuncs.com) is your own docker source 
+### (https://xxxx.mirror.aliyuncs.com) is your own docker source
 
 ```shell
 mkdir -p /etc/docker
@@ -91,7 +91,7 @@ apt-get install -y nvidia-docker2
    }
 }
 ```
-### final content of /etc/docker/daemon.json 
+### final content of /etc/docker/daemon.json
 
 ```json
 {
@@ -118,7 +118,7 @@ systemctl restart docker
 
 ```dockerfile
 #pull base image
-FROM nvidia/cuda:10.2-devel-ubuntu18.04   
+FROM nvidia/cuda:10.2-devel-ubuntu18.04 
 #maintainer
 MAINTAINER deepspeed <gqwang@baai.ac.cn>
 
@@ -136,7 +136,7 @@ RUN mkdir -p ${STAGE_DIR}
 #Installation/Basic Utilities
 ##############################################################################
 RUN  sed -i s@/archive.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
-RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list 
+RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
 RUN apt-get update && \
     DEBIAN_FRONTEND="noninteractive" apt-get install -y --no-install-recommends \
         software-properties-common build-essential autotools-dev \
@@ -150,8 +150,8 @@ RUN apt-get update && \
         libjpeg-dev \
         libpng-dev \
         screen jq psmisc dnsutils lsof musl-dev systemd
-```        
-        
+```      
+      
 ### c. Install the latest version of GIT (create an image clone installation package)
 
 ```dockerfile
@@ -180,7 +180,7 @@ RUN cd ${STAGE_DIR} && \
     PATH=/usr/bin:$PATH ./mlnxofedinstall --user-space-only --without-fw-update --umad-dev-rw --all -q && \
     cd ${STAGE_DIR} && \
     rm -rf ${STAGE_DIR}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64*
-```    
+```  
 
 ### e. Install nv_peer_mem
 
@@ -202,7 +202,7 @@ RUN cd ${STAGE_DIR}/nv_peer_memory && \
     apt-get update && \
     apt-get install -y dkms && \
     dpkg-buildpackage -us -uc && \
-    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb   
+    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb 
 ```
 
 ### f. Install openmpi, You need to install the libevent dependency package first
@@ -239,7 +239,7 @@ RUN mv /usr/local/mpi/bin/mpirun /usr/local/mpi/bin/mpirun.real && \
     echo 'mpirun.real --allow-run-as-root --prefix /usr/local/mpi "$@"' >> /usr/local/mpi/bin/mpirun && \
     chmod a+x /usr/local/mpi/bin/mpirun
 ```
-    
+  
 ### g.Install python
 
 ```dockerfile
@@ -251,24 +251,24 @@ RUN curl -o ~/miniconda.sh https://mirrors.tuna.tsinghua.edu.cn/anaconda/minicon
      chmod +x ~/miniconda.sh && \
      ~/miniconda.sh -b -p /opt/conda && \
      rm ~/miniconda.sh && \
-     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing 
+     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing
 ```
-    
+  
 ### h.Install magma-cuda
 
 ```dockerfile
 ###########################################################################
 #Install magma-cuda
 ##############################################################################
-COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR}   
+COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR} 
 RUN  cd ${STAGE_DIR} && \
      /opt/conda/bin/conda install -y -c pytorch --use-local magma-cuda102-2.5.2-1.tar.bz2  && \
      /opt/conda/bin/conda clean -ya
 ####optional#####
 #RUN  /opt/conda/bin/conda install -y -c pytorch  magma-cuda102  && \
-#/opt/conda/bin/conda clean -ya 
+#/opt/conda/bin/conda clean -ya
 ```
-    
+  
 ### i.Configuration path
 
 ```dockerfile
@@ -280,7 +280,7 @@ RUN echo "export PATH=/opt/conda/bin:\$PATH" >> /root/.bashrc
 RUN pip install --upgrade pip setuptools
 RUN wget https://tuna.moe/oh-my-tuna/oh-my-tuna.py && python oh-my-tuna.py
 ```
-    
+  
 ### j.Install some packages
 
 ```dockerfile
@@ -316,8 +316,8 @@ RUN pip install psutil \
                 rouge_score \
                 cupy-cuda102\
                 setuptools==60.0.3
-```  
-    
+```
+  
 ### k.Install mpi4py (need to download to local installation, pip installation may report an error due to version compatibility)
 
 ```dockerfile
@@ -329,7 +329,7 @@ RUN apt-get update && \
 COPY mpi4py-3.1.3.tar.gz ${STAGE_DIR}
 RUN cd ${STAGE_DIR} && tar zxvf mpi4py-3.1.3.tar.gz && \
  cd mpi4py-3.1.3 &&\
- python setup.py build && python setup.py install 
+ python setup.py build && python setup.py install
 ```
 
 ### l.Install pytorch, the version can be replaced, need to download locally first. Installation is easy to be terminated due to network speed. Some sub packages may be terminated during the download process of pytorch git clone. You can try few more times.
@@ -349,8 +349,8 @@ ENV NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu
 ENV NCCL_INCLUDE_DIR=/usr/include
 RUN cd /opt/pytorch && TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
     CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" USE_SYSTEM_NCCL=1 \
-    pip install -v . && rm -rf /opt/pytorch 
- 
+    pip install -v . && rm -rf /opt/pytorch
+
 
 
 ##############################################################################
@@ -382,7 +382,7 @@ RUN cd ${STAGE_DIR}/apex && pip install -v --no-cache-dir --global-option="--cpp
 #Install deepSpeed
 #############################################################################
 RUN pip install  py-cpuinfo
-RUN apt-get install -y libaio-dev 
+RUN apt-get install -y libaio-dev
 ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
 RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
 #COPY DeepSpeed ${STAGE_DIR}/DeepSpeed
@@ -449,16 +449,16 @@ docker load --input deepspeed-cuda102.tar.gz
 ```shell
 docker build -f cuda102.dockerfile  -t deepspeed/cuda102:1221 .
 # cuda102.dockerfile Refer to the production process of dockerfile file
-```     
+```   
 # 二. Build containers at each machine node
 
 ```shell
-# Create a container (NVIDIA-docker), 
+# Create a container (NVIDIA-docker),
 # hostname=the host name inside the container
 # network = host share with the several machine
-# ipc = host  This is required for cluster training. 
-# shm_size shared memory, name container external name, 
-# --gpus specifies GPU, multi data volume: 
+# ipc = host  This is required for cluster training.
+# shm_size shared memory, name container external name,
+# --gpus specifies GPU, multi data volume:
 # -v local folder: folder in container -v local folder: folder in container -v local folder: folder in container deepspeed / cuda102:1221 image name: tag
 nvidia-docker run -id  --hostname=glm_dist16  --network=host --ipc=host --shm-size=16gb --name=glm_dist16   --gpus '"device=0,1,2,3"' -v /data1/docker/containers:/data  deepspeed/cuda102:1221
 ```
@@ -506,7 +506,7 @@ docker rm container_name/container_id
 
 ~/.ssh/id_rsa.pub
 The contents in are collected and synchronized to the files of each machine
-~/.ssh/authorized_keys 
+~/.ssh/authorized_keys
 
 ## 3.login without password
 
diff --git a/docs/Huggingface_t5.md b/docs/Huggingface_t5.md
index ed44fe59..dce3553b 100644
--- a/docs/Huggingface_t5.md
+++ b/docs/Huggingface_t5.md
@@ -8,7 +8,7 @@ from transformers import T5ForConditionalGeneration, T5Tokenizer
 from torch.utils.data import Dataset
 import torch
 
-## Inheriant the Trainer 
+## Inheriant the Trainer
 ## overload the forward_step function
 class MyTrainer(Trainer):
 
@@ -116,7 +116,7 @@ def seq2seq_collate_fn(batch):
 
 sents_src, sents_tgt = read_file()
 data_len = len(sents_tgt)
-train_size = int(data_len * 0.8) 
+train_size = int(data_len * 0.8)
 train_src = sents_src[:train_size]
 train_tgt = sents_tgt[:train_size]
 
@@ -141,7 +141,7 @@ trainer.train(model,
 ## Tricks for speedup training
 We may not run a t5-3b on a V100 32G. So, we need some tricks to cut down the GPU memory usage.
 ### step1.fp16
-Model parameters turned to `fp16` 
+Model parameters turned to `fp16`
 ```python
 trainer = MyTrainer(
     env_type='pytorch',
@@ -154,7 +154,7 @@ trainer = MyTrainer(
     load_dir=None,
     lr=1e-4,
     fp16=True) # change to `True`
-```    
+```  
 ### step2.gradient recomputation(checkpoints)
 Do not save the itermedia results in forward stage. Now you may run t5-3b with `batch size`=1.
 Now, we can train/finetune a t5-3b with `gradient_accumulation_steps`.
@@ -171,9 +171,9 @@ trainer = MyTrainer(
     lr=1e-4,
     fp16=True
     checkpoint_activations = True) # setting as `True`
-```    
+```  
 ### step3. data parallel (DDP)
-To multiply your batch size, we can use data paralle on multiple GPUs. 
+To multiply your batch size, we can use data paralle on multiple GPUs.
 ```python
 trainer = Trainer(
     env_type="pytorchDDP",
@@ -184,7 +184,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # The following six options is for pytorchDDP
     master_ip='127.0.0.1',
@@ -199,7 +199,7 @@ trainer = Trainer(
 With `cpuoffload` and `stage2`, increase the `batch size` on single gpu to `4`.
 ```python
 trainer = Trainer(
-    env_type="deepspeed", # env_type 
+    env_type="deepspeed", # env_type
     epochs=1,
     batch_size=1,
     eval_interval=10,
@@ -207,7 +207,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # parallel settings
     master_ip='127.0.0.1',
@@ -217,7 +217,7 @@ trainer = Trainer(
     hostfile='hostfile',
     training_script=__file__,
     # deepspeed
-    deepspeed_config='deepspeed.json' 
+    deepspeed_config='deepspeed.json'
 )
 ```
 
@@ -225,7 +225,7 @@ trainer = Trainer(
 Open your imagenation.
 ```python
 trainer = Trainer(
-    env_type="deepspeed", # env_type 
+    env_type="deepspeed", # env_type
     epochs=1,
     batch_size=1,
     eval_interval=10,
@@ -233,7 +233,7 @@ trainer = Trainer(
     experiment_name='t5-3b',
     load_dir=None,
     lr=1e-4,
-    fp16=True 
+    fp16=True
     checkpoint_activations=False,
     # parallel settings
     master_ip='127.0.0.1',
diff --git a/docs/QuickTour.md b/docs/QuickTour.md
index 64e14c66..f5485750 100644
--- a/docs/QuickTour.md
+++ b/docs/QuickTour.md
@@ -103,7 +103,7 @@ for text_pair in test_data:
     print(predictor.predict_cls_classifier(text_pair))
 
 ```
-## Load model and tokenizer 
+## Load model and tokenizer
 We provide the AutoLoad class to load the model and tokenizer quickly, for example:
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
@@ -117,7 +117,7 @@ tokenizer = auto_loader.get_tokenizer()
 This example is for the Classification task, and you can also model other tasks by modifying the task_name.
 Target_size is the number of categories for a classification task. Then you can use the model and tokenizer to finetune or test.
 
-## Predictor 
+## Predictor
 We provide the Predictor class to predict for different tasks, for example:
 ```python
 predictor = Predictor(model, tokenizer)
diff --git a/docs/Seq2seqMethod.md b/docs/Seq2seqMethod.md
index c945b671..a0e0be5e 100644
--- a/docs/Seq2seqMethod.md
+++ b/docs/Seq2seqMethod.md
@@ -1,4 +1,4 @@
-# Seq2seq Method 
+# Seq2seq Method
 
 ## Encoder model
 We provide the encoder model to perform the seq2seq task, for example, Bert, Roberta, GLM, and so on.
@@ -12,7 +12,7 @@ The inputs to this model are two sentences: [cls] sentence_1 [sep] sentence_2 [s
 Where, sentence_1 does not use mask, and sentence_2 uses autoregressive mask.
 
 
-## Decoder model 
+## Decoder model
 
 We also provide the decoder model for seq2seq task, such as gpt-2 models.
 
@@ -20,7 +20,7 @@ We also provide the decoder model for seq2seq task, such as gpt-2 models.
 
 Giving a start text, this model can be a good continuation of the text.
 
-## Encoder-Decoder model 
+## Encoder-Decoder model
 We also provide the encoder-decoder model for seq2seq task, such as T5 models.
 
 ![encoder_decoder_mask](./img/encoder_decoder_mask.png)
diff --git a/docs/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md b/docs/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
index 60640743..41bf023a 100644
--- a/docs/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
+++ b/docs/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
@@ -46,16 +46,16 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
 # Autoloader can build the model and tokenizer automatically.
 # 'seq2seq' is the task_name.
-auto_loader = AutoLoader("seq2seq", 
+auto_loader = AutoLoader("seq2seq",
                          model_dir="./state_dict/",
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
@@ -70,7 +70,7 @@ python ./train.py
 Modify the training configuration by this code:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_seq2seq",
diff --git a/docs/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md b/docs/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
index 1c1a1f09..8830c64c 100644
--- a/docs/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
+++ b/docs/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
@@ -46,18 +46,18 @@ def read_file(data_path):
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flash_tran.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
-model_dir = "./state_dict/" 
+model_dir = "./state_dict/"
 # Autoloader can build the model and tokenizer automatically.
 # 'cls' is the task_name.
-auto_loader = AutoLoader("cls", 
-                         model_dir, 
+auto_loader = AutoLoader("cls",
+                         model_dir,
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
 tokenizer = auto_loader.get_tokenizer()
@@ -71,7 +71,7 @@ python ./train.py
 Modify the training configuration by this code:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_semantic_matching",
diff --git a/docs/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md b/docs/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
index e53d3cfe..81c879f7 100644
--- a/docs/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
+++ b/docs/TUTORIAL_12_INSTRUCTIONS_FOR_AutoLoader.md
@@ -8,7 +8,7 @@ Take semantic matching tasks as an example:
 ## Target contains all target categories
 ## 0 means that two sentences have the same meaning
 ## 1 means that two sentences have different meanings
-target = [0, 1] 
+target = [0, 1]
 auto_loader = AutoLoader(task_name="cls", ## The task name
                          model_name="RoBERTa-wwm-ext", ## The model name.
                          model_dir=model_dir, ## Model download folder
@@ -29,11 +29,11 @@ The **task_name** parameter supports:
 All supported models is can be found in **model hub**.
 Different models adapt to different tasks.
 
-#### Transfomrer encoder: 
+#### Transfomrer encoder:
 
 For example, model_name="bert-base-chinese" or "RoBERTa-wwm-ext" These models support all of the tasks mentioned in the previous section
 
-#### Transformer decoder: 
+#### Transformer decoder:
 
 For example, model_name="gpt2-chinese", the model support "seq2seq" task.
 
diff --git a/docs/TUTORIAL_13_INSTRUCTIONS_FOR_PREDICTOR.md b/docs/TUTORIAL_13_INSTRUCTIONS_FOR_PREDICTOR.md
index 0646a7de..30126815 100644
--- a/docs/TUTORIAL_13_INSTRUCTIONS_FOR_PREDICTOR.md
+++ b/docs/TUTORIAL_13_INSTRUCTIONS_FOR_PREDICTOR.md
@@ -1,7 +1,7 @@
 ## Predictor
 
 ### Get prediction result quickly by the Predictor
-By the Predictor, you only need to enter a **text**, and you can get the output of the corresponding task directly. 
+By the Predictor, you only need to enter a **text**, and you can get the output of the corresponding task directly.
 
 Take gpt2 writing tasks as an example:
 ```python
diff --git a/docs/TUTORIAL_14_BERT_EXAMPLE_NER.md b/docs/TUTORIAL_14_BERT_EXAMPLE_NER.md
index 2eeb634d..0508bd64 100644
--- a/docs/TUTORIAL_14_BERT_EXAMPLE_NER.md
+++ b/docs/TUTORIAL_14_BERT_EXAMPLE_NER.md
@@ -3,7 +3,7 @@
 ## Background
 The task of named entity recognition is to judge whether there are different entities in the input sentences and mark them.
 
-The BERT model supports three ways to complete NER tasks: 
+The BERT model supports three ways to complete NER tasks:
 1. Sequence labeling
 2. Sequence labeling + CRF
 3. GlobalPointer
@@ -83,7 +83,7 @@ python ./train.py
 Adjust training parameters:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_ner",
diff --git a/docs/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md b/docs/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
index 18dd2ecd..0df023a0 100644
--- a/docs/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
+++ b/docs/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
@@ -47,7 +47,7 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
@@ -66,7 +66,7 @@ The configuration support multi-gpus training.
 Modify the training configuration by this code:
 ```python
 from flagai.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(
     env_type="deepspeed+mpu",
diff --git a/docs/TUTORIAL_1_BASICS.md b/docs/TUTORIAL_1_BASICS.md
index d6b59245..18e5e613 100644
--- a/docs/TUTORIAL_1_BASICS.md
+++ b/docs/TUTORIAL_1_BASICS.md
@@ -1,18 +1,18 @@
 ## Transformer basics
-Briefly, the Transformer, as proposed in [Attention Is All You Need](https://arxiv.org/abs/1706.03762), 
-can be regarded as an integration of an encoder block and a decoder block. With the use of attention mechanism, 
-Transformer can extract better features from input, and consequently Transformer-based models have achieved state-of-the-art 
-performance in most of the language tasks. Another feature of transformer is its compatibility with parallel computing, which is a major 
-superiority compared with sequential models like RNN. 
+Briefly, the Transformer, as proposed in [Attention Is All You Need](https://arxiv.org/abs/1706.03762),
+can be regarded as an integration of an encoder block and a decoder block. With the use of attention mechanism,
+Transformer can extract better features from input, and consequently Transformer-based models have achieved state-of-the-art
+performance in most of the language tasks. Another feature of transformer is its compatibility with parallel computing, which is a major
+superiority compared with sequential models like RNN.
 
 The Transformer architecture is shown below
 
 <div align=center><img src="img/transformer.png" width="400px"></div>
 
-In the encoder step, firstly the embedding encoded from token positions is added to embeddings encoded from tokens themselves. 
-Then the input embeddings are passed to a multi-head attention mechanism, which is able to take mutual information among tokens into consideration. 
-The computed output will add the original input embeddings, followed by a layer normalization. 
-The rest of the encoder part is a feed-forward layer with addition and a layer normalization. 
+In the encoder step, firstly the embedding encoded from token positions is added to embeddings encoded from tokens themselves.
+Then the input embeddings are passed to a multi-head attention mechanism, which is able to take mutual information among tokens into consideration.
+The computed output will add the original input embeddings, followed by a layer normalization.
+The rest of the encoder part is a feed-forward layer with addition and a layer normalization.
 
 The decoder architecture is similar to the encoder, with the following differences:
 
@@ -26,11 +26,11 @@ Step-by-step details of transformers can be viewed in [this article](https://tow
 
 ## Transformer Applications
 
-Currently, there exist several different pre-training model architectures: autoencoding models that only implement encoder architecture (e.g., BERT), 
+Currently, there exist several different pre-training model architectures: autoencoding models that only implement encoder architecture (e.g., BERT),
 autoregressive models that only implement decoder (e.g., GPT), and encoder-decoder models that implement both encoder and decoder (e.g., T5).
 
-The **GLM model**, proposed in [All NLP Tasks Are Generation Tasks: A General Pretraining Framework](https://arxiv.org/abs/2103.10360), 
-claims to have good performance in classification, unconditional generation, and conditional generation tasks with the use of a new pre-training method. 
+The **GLM model**, proposed in [All NLP Tasks Are Generation Tasks: A General Pretraining Framework](https://arxiv.org/abs/2103.10360),
+claims to have good performance in classification, unconditional generation, and conditional generation tasks with the use of a new pre-training method.
 
 The key features of GLM include:
 
@@ -40,7 +40,7 @@ The key features of GLM include:
   - [Pre-LN](http://proceedings.mlr.press/v119/xiong20b.html)
   - 2D positional encoding: Each token has two positional encodings: the global position in the sentence, and the local position inside the masked span.
   - Feed-forward network is replaced with a linear layer
-  
+
 An example showing the autoencoding and autoregressive of GLM can be viewed [here](APPENDIX_GLM_IO.md).
 
 
diff --git a/docs/TUTORIAL_2_PROJECT_STRUCTURE.md b/docs/TUTORIAL_2_PROJECT_STRUCTURE.md
index c8d56026..881c918d 100644
--- a/docs/TUTORIAL_2_PROJECT_STRUCTURE.md
+++ b/docs/TUTORIAL_2_PROJECT_STRUCTURE.md
@@ -9,14 +9,14 @@ Sailing/
     |    |--fp16/ # contains fp16 tools
     |    |--mpu/ # fork from megatron-lm
     |    |--docs/ # documentations
-    |    |--trainer.py/ # trainer for pytorch, deepspeed+mpu 
-    |    |--logging.py # 
+    |    |--trainer.py/ # trainer for pytorch, deepspeed+mpu
+    |    |--logging.py #
     |    |--metrics.py/ # contains frequently-used metrics, e.g., accuracy
-    |    |--optimizers.py 
+    |    |--optimizers.py
     |    |--schedulers.py.py
     |    |--test_utils.py # contains tools for testing
     |    |--utils.py
-    |--setup.py 
+    |--setup.py
     |--test.py # for excuting all tests
     |--README.md
     |--requirements.txt
diff --git a/docs/TUTORIAL_4_DATASET.md b/docs/TUTORIAL_4_DATASET.md
index 646e1c3f..ce593e3b 100644
--- a/docs/TUTORIAL_4_DATASET.md
+++ b/docs/TUTORIAL_4_DATASET.md
@@ -3,20 +3,20 @@
 ## Supported downstream datasets
 Our project now supports 12 datasets as listed below:
 
-| Dataset Name           | Identifier | Language | Source Benchmark |  
+| Dataset Name           | Identifier | Language | Source Benchmark |
 |------------------------|------------|----------|------------------|
 | Broadcoverage Diagnostics                  | BoolQ      | English  | SuperGLUE        |
 | CommitmentBank                     | CB         | English  | SuperGLUE        |
-| Choice of Plausible Alternatives                 | COPA       | English  | SuperGLUE        | 
+| Choice of Plausible Alternatives                 | COPA       | English  | SuperGLUE        |
 | Multi-Sentence Reading Comprehension              | MultiRC    | English  | SuperGLUE        |
-| Recognizing Textual Entailment                  | RTE        | English  | SuperGLUE        | 
-| Words in Context | WiC        | English  | SuperGLUE        |                                                     
+| Recognizing Textual Entailment                  | RTE        | English  | SuperGLUE        |
+| Words in Context | WiC        | English  | SuperGLUE        |                                                   
 | The Winograd Schema Challenge                 | WSC        | English  | SuperGLUE        |
 | Ant Financial Question Matching Corpus            | AFQMC      | Chinese  | CLUE             |
 | Short Text Classificaiton for News             | TNEWS      | Chinese  | CLUE             |
 | Reading Comprehension for Simplified Chinese      | CMRC2018   | Chinese  | CLUE             |
 
-## Introduction to prompt learning 
+## Introduction to prompt learning
 
 ## Load datasets
 
@@ -51,7 +51,7 @@ loader = torch.utils.data.DataLoader(dataset,
                                           pin_memory=False,
                                           collate_fn=collate_fn)
 
-# Iterating loader 
+# Iterating loader
 it = iter(loader)
 next(it)
 batch = next(it)
diff --git a/docs/TUTORIAL_8_TRAINING.md b/docs/TUTORIAL_8_TRAINING.md
index 235c4cb6..386587ae 100644
--- a/docs/TUTORIAL_8_TRAINING.md
+++ b/docs/TUTORIAL_8_TRAINING.md
@@ -6,7 +6,7 @@ Let's run finetuning over SuperGlue Dataset, as following `train.py`:
 ```python
 from flagai.trainer import Trainer
 from flagai.model.glm_model import GLMModel,GLMForSequenceClassification
-from flagai.data.tokenizer import GLMBertWordPieceTokenizer 
+from flagai.data.tokenizer import GLMBertWordPieceTokenizer
 from flagai.metrics import accuracy_metric
 from flagai.data.dataset import SuperGlueDataset
 from flagai.test_utils import CollateArguments
@@ -39,7 +39,7 @@ valid_dataset = SuperGlueDataset(task_name="boolq", data_dir='/mnt/datasets/yan/
 
 
 trainer.train(model, collate_fn=collate_fn,
-                train_dataset=train_dataset, valid_dataset=valid_dataset, 
+                train_dataset=train_dataset, valid_dataset=valid_dataset,
                 eval_metrics=accuracy_metric)
 ```
 
@@ -51,7 +51,7 @@ Done! The  results may be around 80%
 
 ## Finetunine with deepspeed
 
-we only change some settings in Trainer & run 
+we only change some settings in Trainer & run
 
 ```python
  trainer = Trainer(env_type='deepspeed', # change env_type
@@ -61,7 +61,7 @@ we only change some settings in Trainer & run
                     log_interval=50,
                     experiment_name='glm_large',
                     load_dir=None
-                    
+                  
                     # parallel settings
                     master_ip='127.0.0.1',
                     master_port=17750,
diff --git a/docs/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md b/docs/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
index a2ac9c3c..2e60cdda 100644
--- a/docs/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
+++ b/docs/TUTORIAL_9_GLM_EXAMPLE_PEOTRY_GENERATION.md
@@ -36,7 +36,7 @@ def read_file():
     ## src = ["春晓：五言绝句", "标题：五言律诗",......]
     ## tgt = ["春眠不觉晓，处处闻啼鸟。夜来风雨声，花落知多少。", "诗句...", ......]
     ## no matter what data you use, you need to construct the right src and tgt.
-    
+  
     return src,tgt
 ```
 2）Define the DataLoader:
@@ -52,7 +52,7 @@ class BertSeq2seqDataset(Dataset):
         target_text = self.sents_tgt[i]
         data=tokenizer.encode_plus(source_text,
             target_text=target_text)
-        return data 
+        return data
 
     def __len__(self):
         return len(self.sents_src)
@@ -115,9 +115,9 @@ train_dataset = BertSeq2seqDataset(train_src,
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
-model_dir = "./state_dict/glm/" 
+model_dir = "./state_dict/glm/"
 # Autoloader can build the model and tokenizer automatically.
 # 'seq2seq' is the task_name.
 AutoLoader("seq2seq",model_name="glm_large_ch",model_dir=model_dir)
@@ -138,7 +138,7 @@ trainer = Trainer(
     lr=2e-4,#2e-4
     weight_decay=2e-8,#1e-3
     epochs=100,
-    log_interval=10,    
+    log_interval=10,  
     tensorboard_dir="tbsummary",
     eval_interval=2000000,
     load_dir="",
@@ -159,7 +159,7 @@ Pass the model, data, and collate_fn into the trainer to start training:
 trainer.train(model,
               train_dataset=train_dataset,
               collate_fn=my_collate_fn)
-```              
+```            
 
 
 
diff --git a/docs/docs_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md b/docs/docs_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
index 053aa4a5..8f3af7d8 100644
--- a/docs/docs_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
+++ b/docs/docs_zh/TUTORIAL_10_BERT_EXAMPLE_TITLE_GENERATION.md
@@ -46,16 +46,16 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flash_tran.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
 # Autoloader can build the model and tokenizer automatically.
 # 'seq2seq' is the task_name.
-auto_loader = AutoLoader("seq2seq", 
+auto_loader = AutoLoader("seq2seq",
                          model_dir="./state_dict/",
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
@@ -70,7 +70,7 @@ python ./train.py
 Modify the training configuration by this code:
 ```python
 from flash_tran.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_seq2seq",
diff --git a/docs/docs_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md b/docs/docs_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
index 73d0e552..1c989f09 100644
--- a/docs/docs_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
+++ b/docs/docs_zh/TUTORIAL_11_BERT_EXAMPLE_SEMANTIC_MATCHING.md
@@ -50,13 +50,13 @@ def read_file(data_path):
 ```python
 from flash_tran.auto_model.auto_loader import AutoLoader
 
-# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt, 
+# the model dir, which contains the 1.config.json, 2.pytorch_model.bin, 3.vocab.txt,
 # or we will download these files from the model hub to this dir.
-model_dir = "./state_dict/" 
+model_dir = "./state_dict/"
 # Autoloader can build the model and tokenizer automatically.
 # 'cls' is the task_name.
-auto_loader = AutoLoader("cls", 
-                         model_dir, 
+auto_loader = AutoLoader("cls",
+                         model_dir,
                          model_name="RoBERTa-wwm-ext")
 model = auto_loader.get_model()
 tokenizer = auto_loader.get_tokenizer()
@@ -70,7 +70,7 @@ python ./train.py
 调整训练参数：
 ```python
 from flash_tran.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_semantic_matching",
diff --git a/docs/docs_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md b/docs/docs_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
index 672014ee..fb6294fd 100644
--- a/docs/docs_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
+++ b/docs/docs_zh/TUTORIAL_14_BERT_EXAMPLE_NER.md
@@ -83,7 +83,7 @@ python ./train.py
 调整训练参数：
 ```python
 from flash_tran.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(env_type="pytorch",
                   experiment_name="roberta_ner",
diff --git a/docs/docs_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md b/docs/docs_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
index 02d33312..fa15992b 100644
--- a/docs/docs_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
+++ b/docs/docs_zh/TUTORIAL_16_T5_EXAMPLE_TITLE_GENERATION.md
@@ -47,7 +47,7 @@ def read_file():
     return src,tgt
 ```
 
-### 2.Load model and tokenizer 
+### 2.Load model and tokenizer
 
 ```python
 from flash_tran.auto_model.auto_loader import AutoLoader
@@ -66,7 +66,7 @@ The configuration support multi-gpus training.
 Modify the training configuration by this code:
 ```python
 from flash_tran.trainer import Trainer
-import torch 
+import torch
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 trainer = Trainer(
     env_type="deepspeed+mpu",
diff --git a/examples/bert_faq/1_construct_data.py b/examples/bert_faq/1_construct_data.py
index 0cc0b82b..18af2df7 100644
--- a/examples/bert_faq/1_construct_data.py
+++ b/examples/bert_faq/1_construct_data.py
@@ -1,5 +1,5 @@
-## construct data
-## data from https://github.com/murufeng/ChineseNlpCorpus
+# construct data
+# data from https://github.com/murufeng/ChineseNlpCorpus
 import torch
 from flagai.auto_model.auto_loader import AutoLoader
 from flagai.model.predictor.predictor import Predictor
diff --git a/examples/bert_faq/README.md b/examples/bert_faq/README.md
index 204f497b..9e0b900b 100644
--- a/examples/bert_faq/README.md
+++ b/examples/bert_faq/README.md
@@ -7,7 +7,7 @@ The task consists of two steps: 1. construct database 2. index by the faiss pack
 
 faiss package is : https://github.com/facebookresearch/faiss
 
-## Construct database 
+## Construct database
 
 In "1_construct_data.py", you need to set some parameters, such as "faq_data_path", "answer_save_path", "embedding_save_path".
 
diff --git a/examples/bert_ner/train_crf.py b/examples/bert_ner/train_crf.py
index 678a1857..25b963ea 100644
--- a/examples/bert_ner/train_crf.py
+++ b/examples/bert_ner/train_crf.py
@@ -69,7 +69,7 @@ def load_data(filename):
 print(f"target is {target}")
 
 
-## 自定义dataset
+# custom dataset
 class NERDataset(Dataset):
 
     def __init__(self, data):
diff --git a/examples/bert_title_generation/train_multi_gpu.py b/examples/bert_title_generation/train_multi_gpu.py
index 85a3c648..ab7bd5df 100644
--- a/examples/bert_title_generation/train_multi_gpu.py
+++ b/examples/bert_title_generation/train_multi_gpu.py
@@ -31,14 +31,12 @@
     num_gpus=2,
     checkpoint_activations=False,
     model_parallel_size=1,
-    hostfile=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/hostfile',
-    deepspeed_config=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/deepspeed.json',
+    hostfile='./hostfile',
+    deepspeed_config='./deepspeed.json',
     training_script=__file__,
 )
-src_dir = '/data/auto_title/train.src'
-tgt_dir = '/data/auto_title/train.tgt'
+src_dir = './data/train.src'
+tgt_dir = './data/train.tgt'
 model_dir = "./state_dict/"
 os.makedirs(model_dir, exist_ok=True)
 maxlen = 256
diff --git a/examples/glm_title_generation/train.py b/examples/glm_title_generation/train.py
index 1999c186..0bc42ded 100644
--- a/examples/glm_title_generation/train.py
+++ b/examples/glm_title_generation/train.py
@@ -27,10 +27,8 @@
     master_port=17750,
     num_nodes=1,
     num_gpus=2,
-    hostfile=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/hostfile',
-    deepspeed_config=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/deepspeed.json',
+    hostfile='./hostfile',
+    deepspeed_config='./deepspeed.json',
     training_script=__file__,
 )
 src_dir = './examples/glm_title_generation/data/train.src'
@@ -102,7 +100,6 @@ def __getitem__(self, i):
         target_tokens = target_tokens + [eop_id]
         if len(target_tokens) > self.max_tgt_length:
             target_tokens = target_tokens[:self.max_tgt_length]
-            target_truncated = True
         loss_mask = [1] * len(target_tokens)
         if len(target_tokens) < self.max_tgt_length:
             loss_mask += [0] * (self.max_tgt_length - len(target_tokens))
diff --git a/examples/gpt2_title_generation/train_multi_gpu.py b/examples/gpt2_title_generation/train_multi_gpu.py
index 2cdb263a..fcffdea1 100644
--- a/examples/gpt2_title_generation/train_multi_gpu.py
+++ b/examples/gpt2_title_generation/train_multi_gpu.py
@@ -31,10 +31,8 @@
     num_gpus=2,
     checkpoint_activations=False,
     model_parallel_size=1,
-    hostfile=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/hostfile',
-    deepspeed_config=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/deepspeed.json',
+    hostfile='./hostfile',
+    deepspeed_config='./deepspeed.json',
     training_script=__file__,
 )
 src_dir = '/data/auto_title/train.src'
diff --git a/examples/t5_huggingface/run_train.sh b/examples/t5_huggingface/run_train.sh
index 4503c557..08015933 100644
--- a/examples/t5_huggingface/run_train.sh
+++ b/examples/t5_huggingface/run_train.sh
@@ -6,5 +6,5 @@ python -m launch --launcher distributed_deepspeed \
        --num_nodes 1 \
        ./train_t5.py \
        --deepspeed_config ./deepspeed.json \
-       --model_parallel_size 1 
+       --model_parallel_size 1
 
diff --git a/examples/t5_kgclue/finetune_t5_triple2question.py b/examples/t5_kgclue/finetune_t5_triple2question.py
index 34607561..02ef78fe 100644
--- a/examples/t5_kgclue/finetune_t5_triple2question.py
+++ b/examples/t5_kgclue/finetune_t5_triple2question.py
@@ -67,7 +67,6 @@ def __getitem__(self, index):
         input_ids = source["input_ids"].squeeze()
         attention_mask = source["attention_mask"].squeeze()
         target_ids = target["input_ids"].squeeze()
-        target_mask = target["attention_mask"].squeeze()
         decoder_input_ids = target_ids[:-1]
         labels = target_ids[1:].clone().detach()
         labels[target_ids[1:] == tokenizer.pad_token_id]
diff --git a/examples/t5_title_generation/train.py b/examples/t5_title_generation/train.py
index 82517690..d2cc1a23 100644
--- a/examples/t5_title_generation/train.py
+++ b/examples/t5_title_generation/train.py
@@ -30,10 +30,8 @@
     master_port=17750,
     num_nodes=1,
     num_gpus=2,
-    hostfile=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/hostfile',
-    deepspeed_config=
-    '/data/liuguang/test_Sailing/Sailing/examples/bert_title_generation/deepspeed.json',
+    hostfile='./hostfile',
+    deepspeed_config='./deepspeed.json',
     training_script=__file__,
 )
 src_dir = '/data/auto_title/train.src'
diff --git a/flagai/auto_model/auto_loader.py b/flagai/auto_model/auto_loader.py
index 47d3bf14..b2bec2d7 100644
--- a/flagai/auto_model/auto_loader.py
+++ b/flagai/auto_model/auto_loader.py
@@ -3,7 +3,6 @@
 
 
 class LazyImport(object):
-
     def __init__(self, name):
         self.cache = {}
         self.mod_name = name
@@ -68,7 +67,6 @@ def __getattr__(self, name):
 
 
 class AutoLoader:
-
     def __init__(self,
                  task_name: str,
                  model_name: str,
@@ -77,20 +75,32 @@ def __init__(self,
                  **kwargs):
         """
         Args:
-            task_name: The task name, for example, "cls" for classification, "sequence_labeling" for ner, part-of-speech tagging and so on, "seq2seq" for sequence to sequence task.
-            model_name: The model name, for example, "bert-base-chinese", "RoBERTa-wwm-ext", "gpt2-chinese", "t5-base-chinese" and so on.
+            task_name: The task name, for example, "cls" for classification,
+                      "sequence_labeling" for ner, part-of-speech tagging
+                       and so on, "seq2seq" for sequence to sequence task.
+            model_name: The model name, for example, "bert-base-chinese",
+                        "RoBERTa-wwm-ext", "gpt2-chinese",
+                        "t5-base-chinese" and so on.
             model_dir: The first level of the model download directory.
             load_pretrain_params: Whether to load the downloaded parameters.
             target_size: For the classification task, all labels size.
-            inner_dim: For global pointer ner task, inner_dim is the representation dim of start and end tokens.
+            inner_dim: For global pointer ner task, inner_dim is the
+                       representation dim of start and end tokens.
         Examples::
 
-            # load bert-base-chinese model and tokenizer to do the two classification task of text.
-            # Then the download path of config, model, vocab files is the "./checkpoints/bert-base-chinese"
-            >>> auto_loader = AutoLoader(task_name, model_name="bert-base-chinese", model_dir="checkpoints", load_pretrain_params=True, class_num=2)
+            # load bert-base-chinese model and tokenizer to do the two
+            # classification task of text.
+            # Then the download path of config, model, vocab files is the
+            # "./checkpoints/bert-base-chinese"
+            >>> auto_loader = AutoLoader(task_name,
+                                         model_name="bert-base-chinese",
+                                         model_dir="checkpoints",
+                                         load_pretrain_params=True,
+                                         class_num=2)
 
         """
-        # Get the brief_model_name by the model_name, to decide the model to use.
+        # Get the brief_model_name by the model_name,
+        # to decide the model to use.
         brief_model_name = ""
         if model_name not in MODEL_DICT:
             for k in MODEL_DICT.keys():
@@ -100,7 +110,7 @@ def __init__(self,
         else:
             brief_model_name = model_name
 
-        ## The dir to save config, vocab and model.
+        # The dir to save config, vocab and model.
         download_path = os.path.join(model_dir, model_name)
         os.makedirs(download_path, exist_ok=True)
 
@@ -112,13 +122,11 @@ def __init__(self,
                 tokenizer_name = brief_model_name
 
             vocab_file = os.path.join(model_dir, model_name, "vocab.txt")
-            #model_path = os.path.join(model_dir, model_name,
-            #                          "pytorch_model.bin")
-
             self.model = ALL_TASK.get(f"{brief_model_name}_{task_name}", None)
             if self.model is None:
                 print(
-                    f"For the model_name: {model_name}, task_name: {task_name} is not be supported."
+                    f"For the model_name: {model_name}, task_name: {task_name} \
+                    is not be supported."
                 )
                 os._exit(0)
             self.model = getattr(LazyImport(self.model[0]),
diff --git a/flagai/data/collate_utils.py b/flagai/data/collate_utils.py
index cbc93ac5..2cad78c4 100644
--- a/flagai/data/collate_utils.py
+++ b/flagai/data/collate_utils.py
@@ -12,16 +12,14 @@ def padding(indice, max_length, pad_idx=0):
 
 def bert_sequence_label_gp_collate_fn(batch):
     """
-    动态padding， batch为一部分sample
+    Dynamic padding
     """
-
     def sequence_padding(inputs,
                          length=None,
                          value=0,
                          seq_dims=1,
                          mode='post'):
-        """Numpy函数，将序列padding到同一长度
-        """
+        """ padding sequence to the same lenght"""
         if length is None:
             length = np.max([np.shape(x)[:seq_dims] for x in inputs], axis=0)
         elif not hasattr(length, '__getitem__'):
@@ -96,7 +94,7 @@ def bert_cls_collate_fn(batch):
 
 def bert_sequence_label_collate_fn(batch):
     """
-    动态padding， batch为一部分sample
+    dynamical padding
     """
 
     token_ids = [data["input_ids"] for data in batch]
diff --git a/flagai/data/dataset/block/blocklm_utils.py b/flagai/data/dataset/block/blocklm_utils.py
index c02f0187..7c806779 100644
--- a/flagai/data/dataset/block/blocklm_utils.py
+++ b/flagai/data/dataset/block/blocklm_utils.py
@@ -28,7 +28,6 @@ def index_in_list(lst, val, start=None):
 
 
 class ConstructBlockStrategy:
-
     def __init__(self,
                  tokenizer,
                  max_seq_length,
@@ -505,8 +504,7 @@ def __call__(self, samples):
                 multiple_doc = index_in_list(
                     sample['text'],
                     self.tokenizer.get_command('eos').Id) not in [
-                        -1, len(sample['text']) - 1
-                    ]
+                        -1, len(sample['text']) - 1]
                 if multiple_doc or rng.random() < self.infill_prob:
                     division = len(sample['text']) - generation_length
                     tokens, loss_masks = sample['text'], sample['loss_mask']
diff --git a/flagai/data/dataset/block/corpora.py b/flagai/data/dataset/block/corpora.py
index ffff2c7e..c656b049 100644
--- a/flagai/data/dataset/block/corpora.py
+++ b/flagai/data/dataset/block/corpora.py
@@ -153,7 +153,6 @@ def write_result(data, writers):
 
 
 class PromptDataset(Dataset):
-
     def __init__(self,
                  prompt_loader,
                  text_loader,
diff --git a/flagai/data/dataset/block/data_utils.py b/flagai/data/dataset/block/data_utils.py
index 46ebb409..ea686c8c 100644
--- a/flagai/data/dataset/block/data_utils.py
+++ b/flagai/data/dataset/block/data_utils.py
@@ -42,7 +42,6 @@ class SplitDataset(Dataset):
         ds (Dataset or array-like): List of datasets to be subindexed
         split_inds (1D array-like): List of indices part of subset
     """
-
     def __init__(self, ds, split_inds, **kwargs):
         self.split_inds = list(split_inds)
         self.wrapped_data = ds
diff --git a/flagai/data/dataset/block/dataset.py b/flagai/data/dataset/block/dataset.py
index 3ae54b01..a1641215 100644
--- a/flagai/data/dataset/block/dataset.py
+++ b/flagai/data/dataset/block/dataset.py
@@ -18,7 +18,6 @@
 
 
 class BlockDataset(Dataset):
-
     def __init__(self,
                  ds,
                  tokenizer,
diff --git a/flagai/data/dataset/block/lazy_loader.py b/flagai/data/dataset/block/lazy_loader.py
index 1cd19a1c..861e7247 100644
--- a/flagai/data/dataset/block/lazy_loader.py
+++ b/flagai/data/dataset/block/lazy_loader.py
@@ -60,7 +60,6 @@ def exists_scatter(path, scatter_num=64, data_type='data'):
 
 
 class LazyWriter:
-
     def __init__(self,
                  path,
                  data_type,
@@ -93,11 +92,7 @@ def write(self, s):
             encoded = s.encode('utf-8')
             self.output.write(encoded)
             self.lengths.append(len(encoded))
-        # print(self.datapath)
-        # with open(self.datapath, 'rb') as f:
-        #     file = f.read()
-        # file = np.ndarray(shape=(len(file) // data_type_size,), dtype=array_data_type, buffer=self.file,
-        #                        order='C')
+
     def close(self):
         self.output.close()
         with open(self.lenpath, 'wb') as f:
@@ -119,7 +114,6 @@ class ProcessorTokenizer:
     callable class that runs a preprocessing, as well as tokenization step,
     on input text.
     """
-
     def __init__(self, tokenizer, process_fn=None):
         self.tokenizer = tokenizer
         self.process_fn = process_fn
@@ -150,7 +144,6 @@ class LazyLoader(object):
         data_type2
         data_type2.len.pkl
     """
-
     def __init__(self,
                  path,
                  data_type='data',
diff --git a/flagai/data/dataset/data_collator/collate_fn.py b/flagai/data/dataset/data_collator/collate_fn.py
index 3e8e7e15..8d79340b 100644
--- a/flagai/data/dataset/data_collator/collate_fn.py
+++ b/flagai/data/dataset/data_collator/collate_fn.py
@@ -76,7 +76,6 @@ def pad_choice_dim(data, choice_num):
 
 
 class ConstructSuperglueStrategy:
-
     def __init__(self, args, tokenizer, task_name):
         # pattern_id, seq_length, num_prompt_tokens, multi_token, segment_length, fast_decode, dataset_type, cloze_val=True
         self.tokenizer = tokenizer
@@ -108,7 +107,6 @@ def __call__(self, examples):
 
 
 class ConstructSeq2seqStrategy:
-
     def __init__(self, args, tokenizer, task_name):
         # pattern_id, seq_length, num_prompt_tokens, multi_token, segment_length, fast_decode, dataset_type, cloze_val=True
         self.tokenizer = tokenizer
@@ -129,14 +127,10 @@ def __init__(self, args, tokenizer, task_name):
 
     def __call__(self, examples):
         samples = []
-        for example in examples:
-            source_text = example.texta + ' ' + example.textb
-            target_text = example['meta']['answer']['text']
         return my_collate(samples)
 
 
 class ConstructBlockStrategy:
-
     def __init__(self,
                  tokenizer,
                  max_seq_length,
@@ -614,8 +608,7 @@ def __call__(self, samples):
                 multiple_doc = index_in_list(
                     sample['input_ids'],
                     self.tokenizer.get_command('eos').Id) not in [
-                        -1, len(sample['input_ids']) - 1
-                    ]
+                        -1, len(sample['input_ids']) - 1]
                 if multiple_doc or rng.random() < self.infill_prob:
                     division = len(sample['input_ids']) - generation_length
                     tokens, loss_masks = sample['input_ids'], sample[
diff --git a/flagai/data/dataset/data_utils.py b/flagai/data/dataset/data_utils.py
index 9c8bcf8d..3b7fb125 100755
--- a/flagai/data/dataset/data_utils.py
+++ b/flagai/data/dataset/data_utils.py
@@ -21,14 +21,12 @@
 import numpy as np
 import random
 import torch
-import sys
-
-sys.path.append("../../../")
 from flagai import mpu
-
-from flagai.data.dataset.block.lazy_loader import LazyWriter, LazyLoader, exists_lazy
-from flagai.data.dataset.block.corpora import WuDaoCorpus, PromptDataset
-#
+try:
+    import deepspeed
+except:
+    pass
+import os
 
 
 def clean_text(text):
@@ -44,7 +42,6 @@ def clean_text(text):
 
 class InputExample(object):
     """A raw input example consisting of one or two segments of text and a label"""
-
     def __init__(self,
                  guid,
                  text_a,
@@ -239,7 +236,6 @@ def build_decoder_input(enc_ids, answer_ids, max_seq_length,
     mask_id = tokenizer.get_command('MASK').Id
     eos_id = tokenizer.get_command('eos').Id
     sop_id = tokenizer.get_command('sop').Id
-    enc_len = len(enc_ids)
     masks = []
     # TODO: it probably takes too much memory
     # for i in range(max_dec_seq_length):
@@ -332,24 +328,6 @@ def build_decoder_sample(sample, dec_ids, dec_position, dec_masks, dec_target,
     return sample
 
 
-#
-#
-#
-#
-#
-# class FakeDataloader:
-#     def __init__(self, num_iters):
-#         self.num_iters = num_iters
-#
-#     def __iter__(self):
-#         if self.num_iters is not None:
-#             for _ in range(self.num_iters):
-#                 yield None
-#         else:
-#             while True:
-#                 yield None
-#
-#
 def punctuation_standardization(string: str):
     punctuation_dict = {
         "\u201c": "\"",
@@ -363,16 +341,6 @@ def punctuation_standardization(string: str):
     return string
 
 
-#
-#
-#
-try:
-    import deepspeed
-except:
-    pass
-import os
-
-
 def set_deepspeed_activation_checkpointing(args):
     deepspeed.checkpointing.configure(mpu,
                                       deepspeed_config=args.deepspeed_config,
@@ -421,138 +389,3 @@ def set_random_seed(seed):
         np.random.seed(seed)
         torch.manual_seed(seed)
         mpu.model_parallel_cuda_manual_seed(seed)
-
-
-#
-#
-# from operator import itemgetter
-# from torch.utils.data import Dataset
-# class SplitDataset(Dataset):
-#     """
-#     Dataset wrapper to access a subset of another dataset.
-#     Purpose: useful to index into existing datasets, possibly
-#     large-scale datasets as the subindexing operation is done in an
-#     on-the-fly manner.
-#     Arguments:
-#         ds (Dataset or array-like): List of datasets to be subindexed
-#         split_inds (1D array-like): List of indices part of subset
-#     """
-#
-#     def __init__(self, ds, split_inds, **kwargs):
-#         self.split_inds = list(split_inds)
-#         self.wrapped_data = ds
-#         self.is_lazy = isinstance(ds, LazyLoader) or (hasattr(ds, 'is_lazy') and ds.is_lazy)
-#         self._X = None
-#         self._Y = None
-#
-#     def __len__(self):
-#         return len(self.split_inds)
-#
-#     def get_text_len(self, idx):
-#         return self.wrapped_data.get_text_len(self.split_inds[idx])
-#
-#     def __getitem__(self, index):
-#         return self.wrapped_data[self.split_inds[index]]
-#
-#     def SetTokenizer(self, tokenizer):
-#         self.wrapped_data.SetTokenizer(tokenizer)
-#
-#     def GetTokenizer(self):
-#         return self.wrapped_data.GetTokenizer()
-#
-#     @property
-#     def X(self):
-#         if self._X is None:
-#             self._X = itemgetter(*self.split_inds)(self.wrapped_data.X)
-#         return self._X
-#
-#     @property
-#     def Y(self):
-#         if self._Y is None:
-#             self._Y = np.array(itemgetter(*self.split_inds)(self.wrapped_data.Y))
-#         return self._Y
-#
-#     def __iter__(self):
-#         for idx in self.split_inds:
-#             yield self.wrapped_data[idx]
-#
-#
-# def split_ds(ds, split=None, shuffle=True, save_splits=None, load_splits=None):
-#     """
-#     Split a dataset into subsets given proportions of how
-#     much to allocate per split. If a split is 0% returns None for that split.
-#     Purpose: Useful for creating train/val/test splits
-#     Arguments:
-#         ds (Dataset or array-like): Data to be split.
-#         split (1D array-like): proportions to split `ds`. `sum(splits) != 0`
-#         shuffle (boolean): Randomly split dataset. Default: True
-#         save_splits: save split indices to file
-#         load_splits: load split indices from file
-#     """
-#     if split is None:
-#         split = [.8, .2, .0]
-#     split_sum = sum(split)
-#     if split_sum == 0:
-#         raise Exception('Split cannot sum to 0.')
-#     split = np.array(split)
-#     split /= split_sum
-#     ds_len = len(ds)
-#     inds = np.arange(ds_len)
-#     if shuffle:
-#         rng = np.random.RandomState(1234)
-#         rng.shuffle(inds)
-#     # if load_splits is not None:
-#     #     inds = np.load(load_splits)
-#     #     assert len(inds) == ds_len
-#     # elif save_splits is not None:
-#     #     if torch.distributed.get_rank() == 0:
-#     #         np.save(save_splits, inds)
-#     #         print(f"Save split indices to {save_splits}")
-#     start_idx = 0
-#     residual_idx = 0
-#     rtn_ds = [None] * len(split)
-#     for i, f in enumerate(split):
-#         if f != 0:
-#             proportion = ds_len * split[i]
-#             residual_idx += proportion % 1
-#             split_ = int(int(proportion) + residual_idx)
-#             split_inds = inds[start_idx:start_idx + max(split_, 1)]
-#             rtn_ds[i] = SplitDataset(ds, split_inds)
-#             start_idx += split_
-#             residual_idx %= 1
-#     return rtn_ds
-#
-#
-# def add_args(ds_args, tokenizer):
-#     ds_args.eod_token = tokenizer.get_command('eos').Id
-#     assert ds_args.eod_token == tokenizer.get_command('pad').Id
-#
-#     if ds_args.block_lm:
-#         ds_args.data_set_type = "Block"
-#     if ds_args.sentinel_token:
-#         ds_args.add_sentinel_token = ds_args.max_position_embeddings
-#     return ds_args
-#
-#
-# def get_dataset_lazy(path, tokenizer, pre_tokenize,num_processes, no_lazy_loader=False):
-#     if not (exists_lazy(path, data_type='prompt') and exists_lazy(path, data_type='text')):
-#         # print(f"Creating lazy loader for dataset {name}")
-#         prompt_writer = LazyWriter(path, data_type='prompt', is_array=pre_tokenize)
-#         text_writer = LazyWriter(path, data_type='text', is_array=pre_tokenize)
-#         writers = {'prompt': prompt_writer, 'text': text_writer}
-#         reader = WuDaoCorpus(writers=writers, tokenizer=tokenizer, tokenize=pre_tokenize, path=path)
-#         reader.process(num_processes)
-#         prompt_writer.close()
-#         text_writer.close()
-#
-#     map_fn = (lambda x: x.tolist()) if pre_tokenize else None
-#
-#     prompts = LazyLoader(path, data_type='prompt', map_fn=map_fn, mem_map=True,
-#                          is_array=pre_tokenize, load_memory=no_lazy_loader)
-#     texts = LazyLoader(path, data_type='text', map_fn=map_fn, mem_map=True,
-#                        is_array=pre_tokenize, load_memory=no_lazy_loader)
-#
-#     text = PromptDataset(prompt_loader=prompts, text_loader=texts, tokenizer=tokenizer,
-#                          to_tokenize=not pre_tokenize)
-#
-#     return text
diff --git a/flagai/data/dataset/language_model/dataset.py b/flagai/data/dataset/language_model/dataset.py
index a6a2ad46..28e1a7fa 100755
--- a/flagai/data/dataset/language_model/dataset.py
+++ b/flagai/data/dataset/language_model/dataset.py
@@ -10,7 +10,6 @@
 
 
 class LMDataset(torch.utils.data.Dataset):
-
     def __init__(self, args, documents, tokenizer, num_original_tokens,
                  num_tokenized_tokens):
         self.args = args
@@ -100,7 +99,6 @@ def __getitem__(self, idx):
 
 
 class LambadaDataset(torch.utils.data.Dataset):
-
     def __init__(self, args, tokenizer, strict=True):
         data_path = args.valid_data[0]
         print_rank_0(
diff --git a/flagai/data/dataset/seq2seq/dataset.py b/flagai/data/dataset/seq2seq/dataset.py
index 8ba433a4..f1bca0c6 100755
--- a/flagai/data/dataset/seq2seq/dataset.py
+++ b/flagai/data/dataset/seq2seq/dataset.py
@@ -81,7 +81,6 @@ def blanklm_detokenize(string, is_target=False):
 
 
 class SummmaryProcessor:
-
     def __init__(self, task_name, data_dir, tokenizer):
         self.task_name = task_name
         self.data_dir = data_dir
@@ -147,7 +146,6 @@ def create_examples(self, split):
 
 
 class WSCProcessor:
-
     def __init__(self, data_dir, tokenizer):
         self.data_dir = data_dir
         self.tokenizer = tokenizer
@@ -189,7 +187,6 @@ def create_examples(self, split):
 
 
 class CMRCProcessor:
-
     def __init__(self, data_dir, tokenizer):
         self.data_dir = data_dir
         self.tokenizer = tokenizer
@@ -242,7 +239,6 @@ def create_examples(self, split):
 
 
 class SQuADProcessor:
-
     def __init__(self, data_dir, tokenizer):
         self.data_dir = data_dir
         self.tokenizer = tokenizer
@@ -299,7 +295,6 @@ def create_examples(self, split):
 
 
 class XSumProcessor:
-
     def __init__(self, data_dir, tokenizer):
         self.data_dir = data_dir
         self.tokenizer = tokenizer
@@ -374,7 +369,6 @@ def create_examples(self, split):
 
 
 class Seq2SeqDataset(torch.utils.data.Dataset):
-
     def __init__(self,
                  task_name='lang8_hsk',
                  data_dir='./data',
@@ -512,7 +506,6 @@ def sub_finder(mylist, pattern):
             target_tokens = target_tokens + [eop_id]
             if len(target_tokens) > self.max_tgt_length:
                 target_tokens = target_tokens[:self.max_tgt_length]
-                target_truncated = True
             loss_mask = [1] * len(target_tokens)
             if len(target_tokens) < self.max_tgt_length:
                 loss_mask += [0] * (self.max_tgt_length - len(target_tokens))
@@ -550,7 +543,6 @@ def sub_finder(mylist, pattern):
 
 
 class ExtractionDataset(torch.utils.data.Dataset):
-
     def __init__(self, args, split, tokenizer):
         self.args = args
         task = args.task.lower()
@@ -674,7 +666,6 @@ def pad_to(text, max_len, pad_id):
 
 
 class BlankLMDataset(torch.utils.data.Dataset):
-
     def __init__(self, args, split, tokenizer):
         self.args = args
         task, data_dir = args.task.lower(), args.data_dir
@@ -734,7 +725,7 @@ def __len__(self):
 
     def __getitem__(self, idx):
         example = self.example_list[idx]
-        source_text, target_text = example.text_a, example.text_b
+        source_text = example.text_a
         mask_token = 'gMASK' if self.args.task_mask else 'MASK'
         mask_id = self.tokenizer.get_command(mask_token).Id
         sop_id = self.tokenizer.get_command('sop').Id
diff --git a/flagai/data/dataset/superglue/control.py b/flagai/data/dataset/superglue/control.py
index f88156e1..6cddeecd 100644
--- a/flagai/data/dataset/superglue/control.py
+++ b/flagai/data/dataset/superglue/control.py
@@ -118,7 +118,6 @@
 
 
 class SuperGlueProcessor:
-
     def __init__(self):
         self.processdict = PROCESSOR_DICT
 
@@ -142,13 +141,13 @@ def _download_data(self, dirname, dname):
                 os.makedirs(dirname)
             with open(zip_file, "wb") as code:
                 code.write(r.content)
-        except Exception as e:
+        except Exception:
             raise ConnectionError('Dataset downloading failure!')
 
         try:
             self._unzip_file(zip_file, dirname)
             os.remove(zip_file)
-        except Exception as e:
+        except Exception:
             raise ValueError('file unzip failure!')
         files = [f for f in os.listdir(dirname)]
 
@@ -178,7 +177,6 @@ def get_processor(self, dirname, dname):
 
 
 class ExampleProcessor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         # Assign the filename of train set
         return self._create_examples(os.path.join(data_dir, "train.tsv"),
diff --git a/flagai/data/dataset/superglue/dataset.py b/flagai/data/dataset/superglue/dataset.py
index 74f5309f..18074883 100644
--- a/flagai/data/dataset/superglue/dataset.py
+++ b/flagai/data/dataset/superglue/dataset.py
@@ -16,7 +16,6 @@
 
 
 class SuperGlueDataset(Dataset):
-
     def __init__(self,
                  task_name,
                  data_dir,
@@ -43,7 +42,7 @@ def __init__(self,
         """
         try:
             from datasets import load_dataset
-        except Exception as e:
+        except Exception:
             raise Exception("datasets is required! pip install datasets")
 
         self.processor = SuperGlueProcessor().get_processor(
diff --git a/flagai/data/dataset/superglue/processor.py b/flagai/data/dataset/superglue/processor.py
index 6471b4ec..13b6cc28 100644
--- a/flagai/data/dataset/superglue/processor.py
+++ b/flagai/data/dataset/superglue/processor.py
@@ -31,7 +31,6 @@ class DataProcessor(ABC):
     Abstract class that provides methods for loading training, testing, development and unlabeled examples for a given
     task
     """
-
     def __init__(self, few_superglue):
         self.num_truncated = 0
 
@@ -120,7 +119,6 @@ def encode(self, example: InputExample, tokenizer, seq_length, args):
 
 
 class SuperGLUEProcessor(DataProcessor):
-
     def __init__(self, few_superglue):
         super(SuperGLUEProcessor, self).__init__(few_superglue)
         self.few_superglue = few_superglue
@@ -155,7 +153,6 @@ def _create_examples(self, *args, **kwargs):
 
 class RteProcessor(SuperGLUEProcessor):
     """Processor for the RTE data set."""
-
     def get_labels(self):
         return ["entailment", "not_entailment"]
 
@@ -195,7 +192,6 @@ def _create_examples(self,
 
 class AxGProcessor(RteProcessor):
     """Processor for the AX-G diagnostic data set."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "AX-g.jsonl"),
                                      "train")
@@ -207,7 +203,6 @@ def get_test_examples(self, data_dir):
 
 class AxBProcessor(RteProcessor):
     """Processor for the AX-B diagnostic data set."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "AX-b.jsonl"),
                                      "train")
@@ -227,14 +222,12 @@ def _create_examples(self,
 
 class CbProcessor(RteProcessor):
     """Processor for the CB data set."""
-
     def get_labels(self):
         return ["entailment", "contradiction", "neutral"]
 
 
 class WicProcessor(SuperGLUEProcessor):
     """Processor for the WiC data set."""
-
     def get_labels(self):
         return ["false", "true"]
 
@@ -319,7 +312,7 @@ def _create_examples(self,
                     # candidates = list(set(candidates))
                     filtered = []
                     for i, cand in enumerate(candidates):
-                        if not cand in candidates[:i]:
+                        if cand not in candidates[:i]:
                             filtered.append(cand)
                     candidates = filtered
 
@@ -394,7 +387,6 @@ def _create_examples(self,
 
 class BoolQProcessor(SuperGLUEProcessor):
     """Processor for the BoolQ data set."""
-
     def get_labels(self):
         return ["false", "true"]
 
@@ -423,7 +415,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class CopaProcessor(SuperGLUEProcessor):
     """Processor for the COPA data set."""
-
     def get_labels(self):
         return [0, 1]
 
@@ -530,7 +521,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class MultiRcProcessor(SuperGLUEProcessor):
     """Processor for the MultiRC data set."""
-
     def get_labels(self):
         return [0, 1]
 
@@ -620,7 +610,6 @@ def get_classifier_input(self, example: InputExample, tokenizer):
 
 
 class RaceProcessor(DataProcessor):
-
     @property
     def variable_num_choices(self):
         return True
@@ -690,7 +679,6 @@ def clean_text(text):
 
 class RecordProcessor(SuperGLUEProcessor):
     """Processor for the ReCoRD data set."""
-
     def get_dev_examples(self, data_dir, for_train=False):
         return self._create_examples(os.path.join(data_dir, "val.jsonl"),
                                      "dev",
@@ -869,7 +857,6 @@ def _create_examples(path,
 
 class MnliProcessor(DataProcessor):
     """Processor for the MultiNLI data set (GLUE version)."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.tsv"),
                                      "train")
@@ -908,7 +895,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class CLUEProcessor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.json"),
                                      "train")
@@ -971,7 +957,6 @@ def encode(self, example: InputExample, tokenizer, seq_length, args):
 
 class TNewsProcessor(CLUEProcessor):
     """Processor for the TNews data set (CLUE version)."""
-
     def get_labels(self):
         return [
             "100", "101", "102", "103", "104", "106", "107", "108", "109",
@@ -1001,7 +986,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class AFQMCProcessor(CLUEProcessor):
     """Processor for the AFQMC data set (CLUE version)."""
-
     def get_labels(self):
         return ["0", "1"]
 
@@ -1028,7 +1012,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class MnliMismatchedProcessor(MnliProcessor):
     """Processor for the MultiNLI mismatched data set (GLUE version)."""
-
     def get_dev_examples(self, data_dir, for_train=False):
         return self._create_examples(
             os.path.join(data_dir, "dev_mismatched.tsv"), "dev_mismatched")
@@ -1040,7 +1023,6 @@ def get_test_examples(self, data_dir) -> List[InputExample]:
 
 class AgnewsProcessor(DataProcessor):
     """Processor for the AG news data set."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.csv"),
                                      "train")
@@ -1081,7 +1063,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class YahooAnswersProcessor(DataProcessor):
     """Processor for the Yahoo Answers data set."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.csv"),
                                      "train")
@@ -1126,7 +1107,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class YelpPolarityProcessor(DataProcessor):
     """Processor for the YELP binary classification set."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.csv"),
                                      "train")
@@ -1163,7 +1143,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class YelpFullProcessor(YelpPolarityProcessor):
     """Processor for the YELP full classification set."""
-
     def get_test_examples(self, data_dir) -> List[InputExample]:
         raise NotImplementedError()
 
@@ -1173,7 +1152,6 @@ def get_labels(self):
 
 class XStanceProcessor(DataProcessor):
     """Processor for the X-Stance data set."""
-
     def __init__(self, args, language: str = None):
         super().__init__(args)
         if language is not None:
@@ -1220,7 +1198,6 @@ def _create_examples(self, path: str) -> List[InputExample]:
 
 
 class Sst2Processor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.tsv"),
                                      "train")
@@ -1251,7 +1228,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class ColaProcessor(Sst2Processor):
-
     def get_labels(self):
         return ["0", "1"]
 
@@ -1278,7 +1254,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class MrpcProcessor(Sst2Processor):
-
     def get_labels(self):
         return ["0", "1"]
 
@@ -1302,7 +1277,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class QqpProcessor(Sst2Processor):
-
     def get_labels(self):
         return ["0", "1"]
 
@@ -1326,7 +1300,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class QnliProcessor(Sst2Processor):
-
     def get_labels(self):
         return ["entailment", "not_entailment"]
 
@@ -1350,7 +1323,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class SquadProcessor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train-v2.0.json"),
                                      "train")
@@ -1386,7 +1358,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 
 class CLUEWSCProcessor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.json"),
                                      "train")
@@ -1438,7 +1409,6 @@ def _create_examples(self,
 
 
 class CMRCProcessor(DataProcessor):
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.json"),
                                      "train")
@@ -1487,7 +1457,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class WankeProcessor(CLUEProcessor):
     """Processor for the TNews data set (CLUE version)."""
-
     def get_train_examples(self, data_dir):
         return self._create_examples(os.path.join(data_dir, "train.tsv"),
                                      "train")
@@ -1514,9 +1483,7 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
                 if idx == 0:
                     continue
                 guid = f"{set_type}-{idx}"
-                heads = fin.readline().split('\t')
                 items = [e.strip() for e in line.split('\t')]
-                # data = json.loads(line)
                 text_a = items[0]
                 label = items[1]
                 example = InputExample(guid=guid,
@@ -1529,7 +1496,6 @@ def _create_examples(path: str, set_type: str) -> List[InputExample]:
 
 class CRMC2018Processor(SuperGLUEProcessor):
     """Processor for the MultiRC data set."""
-
     def get_labels(self):
         return [0, 1]
 
diff --git a/flagai/data/dataset/superglue/pvp.py b/flagai/data/dataset/superglue/pvp.py
index c0a1cfa4..26fc3b3e 100644
--- a/flagai/data/dataset/superglue/pvp.py
+++ b/flagai/data/dataset/superglue/pvp.py
@@ -34,7 +34,6 @@ class PVP(ABC):
     This class contains functions to apply patterns and verbalizers as required by PET. Each task requires its own
     custom implementation of a PVP.
     """
-
     def __init__(self,
                  args,
                  tokenizer,
@@ -558,7 +557,6 @@ def verbalize(label) -> List[str]:
 
 
 class CopaPVP(PVP):
-
     @staticmethod
     def available_patterns():
         return [0, 1]
@@ -705,7 +703,6 @@ def encode(self,
 
 
 class WscPVP(PVP):
-
     @staticmethod
     def available_patterns():
         return [0, 1, 2]
@@ -864,7 +861,6 @@ def verbalize(self, label) -> List[str]:
 
 
 class RecordPVP(PVP):
-
     @property
     def is_multi_token(self):
         return True
@@ -887,7 +883,6 @@ def verbalize(self, label) -> List[str]:
 
 
 class RacePVP(PVP):
-
     @property
     def is_multi_token(self):
         return True
@@ -1543,7 +1538,6 @@ def verbalize(self, label) -> List[str]:
 
 
 class SquadPVP(PVP):
-
     @property
     def is_multi_token(self):
         return True
@@ -1617,7 +1611,6 @@ def verbalize(self, label) -> List[str]:
 
 
 class CMRCPVP(PVP):
-
     @staticmethod
     def available_patterns():
         return [0, 1]
diff --git a/flagai/data/file_utils.py b/flagai/data/file_utils.py
index af053efe..d0d9b317 100644
--- a/flagai/data/file_utils.py
+++ b/flagai/data/file_utils.py
@@ -8,7 +8,6 @@
 """
 from __future__ import (absolute_import, division, print_function,
                         unicode_literals)
-
 import json
 import logging
 import os
@@ -133,7 +132,6 @@ def s3_request(func):
     Wrapper function for s3 requests in order to create more helpful error
     messages.
     """
-
     @wraps(func)
     def wrapper(url, *args, **kwargs):
         try:
diff --git a/flagai/data/tokenizer/bert/bert_tokenizer.py b/flagai/data/tokenizer/bert/bert_tokenizer.py
index d18944e6..f9f795f3 100644
--- a/flagai/data/tokenizer/bert/bert_tokenizer.py
+++ b/flagai/data/tokenizer/bert/bert_tokenizer.py
@@ -42,7 +42,6 @@ class BertWordPieceTokenizer(Tokenizer):
     Loads a pretrained WordPiece tokenizer from `cache_dir` for tokenization
     in BERT training. Default to bert-large-uncased tokenizer.
     """
-
     def __init__(self, tokenizer_model_type=None, cache_dir=None):
         # default to bert-large-uncased tokenizer
         if tokenizer_model_type not in PRETRAINED_VOCAB_ARCHIVE_MAP:
@@ -84,7 +83,6 @@ def encode_plus(
         truncation=True,
         max_length=None,
     ):
-
         def get_input_ids(text):
             tokens = self.text_tokenizer.tokenize(text)
             return self.text_tokenizer.convert_tokens_to_ids(tokens)
diff --git a/flagai/data/tokenizer/bert/wordpiece.py b/flagai/data/tokenizer/bert/wordpiece.py
index ec9cbc15..7f95cad5 100644
--- a/flagai/data/tokenizer/bert/wordpiece.py
+++ b/flagai/data/tokenizer/bert/wordpiece.py
@@ -73,7 +73,6 @@ def whitespace_tokenize(text):
 
 class BertTokenizer(object):
     """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
-
     def __init__(self,
                  vocab_file,
                  do_lower_case=True,
@@ -190,7 +189,6 @@ def from_pretrained(cls,
             if cache_dir is None:
                 cache_dir = os.path.join(os.path.dirname(__file__), 'vocabs')
             # TODO: the logic of file cache is not clear
-            #resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
             resolved_vocab_file = os.path.join(cache_dir, vocab_file)
         except EnvironmentError:
             logger.error(
@@ -279,7 +277,6 @@ def encode_plus(
         truncation=True,
         max_length=None,
     ):
-
         def get_input_ids(text):
             tokens = self.tokenize(text)
             return self.convert_tokens_to_ids(tokens)
@@ -420,7 +417,6 @@ def _is_special(ch):
 
 class BasicTokenizer(object):
     """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
-
     def __init__(self,
                  do_lower_case=True,
                  never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
@@ -554,7 +550,6 @@ def _clean_text(self, text):
 
 class WordpieceTokenizer(object):
     """Runs WordPiece tokenization."""
-
     def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
         self.vocab = vocab
         self.unk_token = unk_token
diff --git a/flagai/data/tokenizer/glm_10b_en/glm_10b_en_bpe_tokenizer.py b/flagai/data/tokenizer/glm_10b_en/glm_10b_en_bpe_tokenizer.py
index 52902d1f..b6ad2190 100644
--- a/flagai/data/tokenizer/glm_10b_en/glm_10b_en_bpe_tokenizer.py
+++ b/flagai/data/tokenizer/glm_10b_en/glm_10b_en_bpe_tokenizer.py
@@ -20,7 +20,6 @@
 
 
 class GLM10bENBPETokenizer(GLMTokenizer):
-
     def __init__(self,
                  tokenizer_model_type="gpt2",
                  add_block_symbols=False,
diff --git a/flagai/data/tokenizer/glm_10b_en/glm_10b_en_tokenizer.py b/flagai/data/tokenizer/glm_10b_en/glm_10b_en_tokenizer.py
index b621fe00..1e354924 100644
--- a/flagai/data/tokenizer/glm_10b_en/glm_10b_en_tokenizer.py
+++ b/flagai/data/tokenizer/glm_10b_en/glm_10b_en_tokenizer.py
@@ -105,7 +105,6 @@ class GLM10bENTokenizer(object):
     GPT-2 BPE tokenizer. Peculiarities:
         - Byte-level BPE
     """
-
     @classmethod
     def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
         """
diff --git a/flagai/data/tokenizer/glm_large_ch/glm_large_ch.py b/flagai/data/tokenizer/glm_large_ch/glm_large_ch.py
index c0dd6578..02680289 100644
--- a/flagai/data/tokenizer/glm_large_ch/glm_large_ch.py
+++ b/flagai/data/tokenizer/glm_large_ch/glm_large_ch.py
@@ -161,9 +161,10 @@ def get_encoder(encoder_file, bpe_file):
 def from_pretrained(pretrained_model_file=None):
     vocab_file = 'cog-pretrain.vocab'
     model_file = 'cog-pretrain.model'
-    cache_dir = os.path.join(os.path.dirname(__file__), 'vocabs')
-    model_id = _get_model_id("glm_large_ch")
-    _get_vocab_path(cache_dir, vocab_file, model_id, rank=0)
-    _get_vocab_path(cache_dir, model_file, model_id, rank=0)
+    if pretrained_model_file is None:
+        cache_dir = os.path.join(os.path.dirname(__file__), 'vocabs')
+        model_id = _get_model_id("glm_large_ch")
+        _get_vocab_path(cache_dir, vocab_file, model_id, rank=0)
+        _get_vocab_path(cache_dir, model_file, model_id, rank=0)
 
     return get_encoder(pretrained_model_file, "")
diff --git a/flagai/data/tokenizer/glm_large_ch/glm_large_ch_tokenizer.py b/flagai/data/tokenizer/glm_large_ch/glm_large_ch_tokenizer.py
index f3faec65..cde22a39 100644
--- a/flagai/data/tokenizer/glm_large_ch/glm_large_ch_tokenizer.py
+++ b/flagai/data/tokenizer/glm_large_ch/glm_large_ch_tokenizer.py
@@ -24,6 +24,7 @@
 class GLMLargeChTokenizer(GLMTokenizer):
 
     def __init__(self,
+                 vocab_path=None,
                  add_block_symbols=True,
                  add_task_mask=True,
                  add_decoder_mask=True,
@@ -40,7 +41,7 @@ def __init__(self,
             fix_command_token: (bool)
                 When add_task_mask, setting fix_command
         """
-        self.text_tokenizer = glm_large_ch.from_pretrained()
+        self.text_tokenizer = glm_large_ch.from_pretrained(vocab_path)
         self.num_command_tokens = 0
         self.num_text_tokens = self.text_tokenizer.sp.vocab_size()
         self.num_tokens = self.num_text_tokens
@@ -133,59 +134,58 @@ def __init__(self,
     def _encode(self, text):
         ids = self.text_tokenizer.encode(text)
         return ids
+    def encode_plus( #for Seq2seq
+        self,
+        source_text, 
+        target_text=None,
+        ):
 
-    def encode_plus(  #for Seq2seq
-            self,
-            source_text,
-            target_text=None,
-    ):
-
-        sop_id = self.get_command('sop').Id  #start of piece
-        eop_id = self.get_command('eop').Id  #end of piece
-        sep_id = self.get_command('sep').Id  #seperation
-
+        sop_id = self.get_command('sop').Id #start of piece
+        eop_id = self.get_command('eop').Id #end of piece
+        sep_id = self.get_command('sep').Id #seperation
+        
         source_tokens = self.EncodeAsIds(source_text)
-        source_tokens = [sop_id] + source_tokens + [sep_id]
-
+        source_tokens = [sop_id] + source_tokens + [sep_id] 
+        
         # no pading for consistency
-        len_source = len(source_tokens)
-        sop_pos = source_tokens.index(sop_id)
-        loss_mask = [0] * len_source
-        block_position_ids = [0] * len_source
+        len_source = len(source_tokens) 
+        sop_pos = source_tokens.index(sop_id) 
+        loss_mask = [0]*len_source
+        block_position_ids = [0]*len_source
         position_ids = list(range(len_source))
 
         if target_text:
             target_tokens = self.EncodeAsIds(target_text)
             target_tokens = target_tokens + [eop_id]
-            loss_mask += [1] * len(target_tokens)
-            block_position_ids += [0] * len(target_tokens)
-            position_ids += [x + len_source for x in range(len(target_tokens))]
+            loss_mask += [1] * len(target_tokens) 
+            block_position_ids += [0]*len(target_tokens)
+            position_ids += [x+len_source for x in range(len(target_tokens))]
             tokens = source_tokens + target_tokens
             position_ids = [position_ids[:-1], block_position_ids[:-1]]
             sample = {
                 'input_ids': tokens[:-1],
                 'target_ids': tokens[1:],
                 'attention_mask': sop_pos,
-                'loss_mask': loss_mask[:-1],
+                'loss_mask': loss_mask[:-1], 
                 "position_ids": position_ids
             }
         else:
             position_ids = [position_ids, block_position_ids]
             sample = {
-                'input_ids': source_tokens,
-                'attention_mask': sop_pos,
-                "position_ids": position_ids,
+                'input_ids': source_tokens, 
+                'attention_mask': sop_pos, 
+                "position_ids": position_ids, 
                 'loss_mask': loss_mask,
             }
-        return sample
+        return sample      
 
-    def MultiWordId(self, exception=None):
+    def MultiWordId(self,exception=None):
         #get multi word tokens' ids
         #return ids list
         #exception token: string list
-        result = []
+        result=[]
         for i in range(self.num_text_tokens):
-            word = self.IdToToken(i)
+            word=self.IdToToken(i)
             if exception:
                 if word not in exception and len(word) > 2:
                     result.append(i)
@@ -193,17 +193,15 @@ def MultiWordId(self, exception=None):
                 if len(word) > 2:
                     result.append(i)
         return result
-
-    def CommandTokenIds(self, exception=None):
+    def CommandTokenIds(self,exception=None):
         #get command tokens' ids
         #return ids list
         #exception token: string list
-        result = []
+        result=[]
         for s in self._command_tokens:
             if not exception or (exception and s.name not in exception):
                 result.append(s.Id)
-        return (result)
-
+        return(result)
     def EncodeAsTokens(self, text, process_fn=None):
         processed_text = text
         if process_fn is not None:
diff --git a/flagai/data/tokenizer/glm_large_en/glm_large_en_tokenizer.py b/flagai/data/tokenizer/glm_large_en/glm_large_en_tokenizer.py
index 6ba07365..4e4081c8 100644
--- a/flagai/data/tokenizer/glm_large_en/glm_large_en_tokenizer.py
+++ b/flagai/data/tokenizer/glm_large_en/glm_large_en_tokenizer.py
@@ -23,7 +23,6 @@ class GLMLargeEnWordPieceTokenizer(GLMTokenizer):
     Loads a pretrained WordPiece tokenizer from `cache_dir` for tokenization
     in BERT training. Default to bert-large-uncased tokenizer.
     """
-
     def __init__(self,
                  tokenizer_model_type='glm_large_en',
                  cache_dir=None,
diff --git a/flagai/data/tokenizer/glm_large_en/wordpiece.py b/flagai/data/tokenizer/glm_large_en/wordpiece.py
index ecd90ab5..49a021d2 100644
--- a/flagai/data/tokenizer/glm_large_en/wordpiece.py
+++ b/flagai/data/tokenizer/glm_large_en/wordpiece.py
@@ -53,7 +53,6 @@ def whitespace_tokenize(text):
 
 class GLMLargeEnTokenizer(object):
     """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
-
     def __init__(self,
                  vocab_file,
                  do_lower_case=True,
@@ -130,7 +129,7 @@ def from_pretrained(cls,
         Download and cache the pre-trained model file if needed.
         """
         vocab_file = 'vocab.txt'
-        if cache_dir == None:
+        if cache_dir is None:
             cache_dir = os.path.join(os.path.dirname(__file__), 'vocabs')
 
         model_id = _get_model_id(pretrained_model_name_or_path)
@@ -147,7 +146,6 @@ def from_pretrained(cls,
 
 class BasicTokenizer(object):
     """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
-
     def __init__(self,
                  do_lower_case=True,
                  never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
@@ -263,7 +261,6 @@ def _clean_text(self, text):
 
 class WordpieceTokenizer(object):
     """Runs WordPiece tokenization."""
-
     def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
         self.vocab = vocab
         self.unk_token = unk_token
diff --git a/flagai/data/tokenizer/roberta/roberta_tokenizer.py b/flagai/data/tokenizer/roberta/roberta_tokenizer.py
index 4ed21cb6..ad118c6f 100644
--- a/flagai/data/tokenizer/roberta/roberta_tokenizer.py
+++ b/flagai/data/tokenizer/roberta/roberta_tokenizer.py
@@ -21,7 +21,6 @@
 
 
 class ROBERTATokenizer(Tokenizer):
-
     def __init__(self, tokenizer_model_type="roberta-base", cache_dir=None):
         self.text_tokenizer = RobertaTokenizer.from_pretrained(
             tokenizer_model_type, cache_dir=cache_dir)
diff --git a/flagai/data/tokenizer/t5/t5_pegasus_tokenizer.py b/flagai/data/tokenizer/t5/t5_pegasus_tokenizer.py
index 8a062670..87cc7020 100644
--- a/flagai/data/tokenizer/t5/t5_pegasus_tokenizer.py
+++ b/flagai/data/tokenizer/t5/t5_pegasus_tokenizer.py
@@ -4,7 +4,6 @@
 
 
 class T5PegasusTokenizer(Tokenizer):
-
     def __init__(self,
                  vocab_path,
                  pre_tokenizer=lambda x: jieba.cut(x, HMM=False),
@@ -34,9 +33,8 @@ def encode_plus(
                                    truncation, max_length)
 
 
-## TODO T5BatchPegasusTokenizer could be mereged into T5PegasusTokenizer
+# TODO T5BatchPegasusTokenizer could be mereged into T5PegasusTokenizer
 class T5BatchPegasusTokenizer(BertTokenizer):
-
     def __init__(self, pre_tokenizer=lambda x: jieba.cut(x), *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.pre_tokenizer = pre_tokenizer
diff --git a/flagai/data/tokenizer/t5/t5_tokenizer.py b/flagai/data/tokenizer/t5/t5_tokenizer.py
index f7752750..97ac7d4c 100644
--- a/flagai/data/tokenizer/t5/t5_tokenizer.py
+++ b/flagai/data/tokenizer/t5/t5_tokenizer.py
@@ -25,7 +25,6 @@
 
 
 class T5BPETokenizer(Tokenizer):
-
     def __init__(self, tokenizer_model_type="t5-base", cache_dir=None):
 
         self.text_tokenizer = T5Tokenizer.from_pretrained(tokenizer_model_type,
@@ -47,8 +46,7 @@ def _encode(self, text):
         return ids
 
 
-class T5BPETokenizer(Tokenizer):
-
+class T5KGBPETokenizer(Tokenizer):
     def __init__(self, tokenizer_model_type="t5-base", cache_dir=None):
         """初始化
         """
@@ -186,7 +184,6 @@ def _tokenize(self, text):
 
 
 class T5JiebaTokenizer(T5BPETokenizer):
-
     def __init__(self,
                  token_dict,
                  pre_tokenizer=lambda x: jieba.cut(x, HMM=False)):
@@ -203,9 +200,7 @@ def __init__(self,
                 setattr(self, "_token_" + str(token) + "_id", _token_id)
                 self.token_start_id = self._token_cls_id
                 self.token_end_id = self._token_sep_id
-            except Exception as e:
-                # print(e)
-                # print("err")
+            except Exception:
                 pass
         self._vocab_size = len(token_dict)
 
@@ -251,7 +246,7 @@ def decode(self, ids, tokens=None):
         tokens = tokens or self.ids_to_tokens(ids)
         tokens = [token for token in tokens if not self._is_special(token)]
 
-        text, flag = '', False
+        text = ''
         for i, token in enumerate(tokens):
             if token[:2] == '##':
                 text += token[2:]
diff --git a/flagai/data/tokenizer/tokenizer.py b/flagai/data/tokenizer/tokenizer.py
index 25b9a944..8bec4e8c 100644
--- a/flagai/data/tokenizer/tokenizer.py
+++ b/flagai/data/tokenizer/tokenizer.py
@@ -31,7 +31,6 @@ def prep_command_tokens(tokenlist, token_format=token_format):
 
 
 class CommandToken(object):
-
     def __init__(self, name, token, Id, lstrip=False, rstrip=False):
         self.name = name
         self.token = token
@@ -67,7 +66,6 @@ def prep_type_tokens(tokenlist, token_format=token_format):
 
 
 class TypeToken(object):
-
     def __init__(self, name, token, Id):
         self.name = name
         self.token = token
@@ -103,7 +101,6 @@ class GLMTokenizer(object):
 
     Token types are stored in a separate mapping of size `len(type_tokens)`.
     """
-
     def __init__(self, text_tokenizer, command_tokens=None, type_tokens=None):
         # set text tokenizer
         self.text_tokenizer = text_tokenizer
@@ -377,7 +374,6 @@ class Tokenizer(object):
 
     Token types are stored in a separate mapping of size `len(type_tokens)`.
     """
-
     def __init__(self, text_tokenizer):
         # set text tokenizer
         self.text_tokenizer = text_tokenizer
@@ -457,7 +453,6 @@ class TextTokenizer(object):
     """
     Interface for text tokenizer
     """
-
     def __init__(self):
         if not hasattr(self, 'num_text_tokens'):
             self.num_text_tokens = 0
diff --git a/flagai/fp16/fp16.py b/flagai/fp16/fp16.py
index 6db7605a..435e8012 100755
--- a/flagai/fp16/fp16.py
+++ b/flagai/fp16/fp16.py
@@ -38,7 +38,6 @@ def conversion_helper(val, conversion):
 
 def fp32_to_fp16(val):
     """Convert fp32 `val` to fp16"""
-
     def half_conversion(val):
         val_typecheck = val
         if isinstance(val_typecheck, (Parameter, Variable)):
@@ -52,7 +51,6 @@ def half_conversion(val):
 
 def fp16_to_fp32(val):
     """Convert fp16 `val` to fp32"""
-
     def float_conversion(val):
         val_typecheck = val
         if isinstance(val_typecheck, (Parameter, Variable)):
@@ -65,7 +63,6 @@ def float_conversion(val):
 
 
 class FP16_Module(nn.Module):
-
     def __init__(self, module):
         super(FP16_Module, self).__init__()
         self.add_module('module', module.half())
@@ -86,7 +83,7 @@ def load_state_dict(self, state_dict, strict=True):
 # TODO:  Update overflow check + downscale to use Carl's fused kernel.
 class FP16_Optimizer(object):
     """
-    :class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer, 
+    :class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer,
     and manage static or dynamic loss scaling and master weights in a manner transparent to the user.
     For standard use, only two lines must be changed:  creating the :class:`FP16_Optimizer` instance,
     and changing the call to ``backward``.
@@ -109,45 +106,45 @@ class FP16_Optimizer(object):
         optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
                                    # optional arg to control dynamic loss scaling behavior
                                    # dynamic_loss_args={'scale_window' : 500})
-                                   # Usually, dynamic_loss_args is not necessary. 
+                                   # Usually, dynamic_loss_args is not necessary.
 
     Args:
-        init_optimizer (torch.optim.optimizer):  Existing optimizer created with the parameters to optimize.  Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones.  :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.  
+        init_optimizer (torch.optim.optimizer):  Existing optimizer created with the parameters to optimize.  Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones.  :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
         static_loss_scale (float, optional, default=1.0):  Loss scale used internally to scale gradients computed by the model.  Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
         dynamic_loss_scale (bool, optional, default=False):  Use dynamic loss scaling.  If True, this will override any ``static_loss_scale`` option.
         dynamic_loss_args (dict, optional, default=None):  Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor.  Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor.  If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
         verbose (bool, optional, default=True):  By default, FP16_Optimizer's constructor prints out the parameters and parameter groups it is ingesting, as a sanity check.  If this becomes annoying (e.g. for large models), it can be disabled by passing ``verbose=False``.  ``verbose=False`` will not disable printing when the loss scale is readjusted during dynamic loss scaling.
 
-    ``init_optimizer`` is expected to have been constructed in the ordinary way.  
-    It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be 
-    named to replace ``init_optimizer``, for two reasons:  
+    ``init_optimizer`` is expected to have been constructed in the ordinary way.
+    It is recommended (although not required) that the newly constructed :class:`FP16_Optimizer` instance be
+    named to replace ``init_optimizer``, for two reasons:
     First, it means that references to the same name
-    later in the file will not have to change.  
-    Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to 
+    later in the file will not have to change.
+    Second, :class:`FP16_Optimizer` reserves the right (as an implementation detail) to
     modify ``init_optimizer``.  If you do choose a unique name for the new
     :class:`FP16_Optimizer` instance, you should only work with this new instance,
     because the preexisting optimizer might no longer behave as expected.
 
-    ``init_optimizer`` may be any Pytorch optimizer. 
-    It may contain a mixture of fp16 and fp32 parameters organized into any number of 
-    ``param_groups`` with different hyperparameters.  The :class:`FP16_Optimizer` constructor will 
-    ingest these ``param_groups`` and remember them. 
+    ``init_optimizer`` may be any Pytorch optimizer.
+    It may contain a mixture of fp16 and fp32 parameters organized into any number of
+    ``param_groups`` with different hyperparameters.  The :class:`FP16_Optimizer` constructor will
+    ingest these ``param_groups`` and remember them.
 
     Calls to ::
 
-        loss.backward() 
+        loss.backward()
 
     must be replaced with ::
 
-        optimizer.backward(loss)  
+        optimizer.backward(loss)
 
-    because :class:`FP16_Optimizer` requires ownership of the backward pass to implement 
+    because :class:`FP16_Optimizer` requires ownership of the backward pass to implement
     loss scaling and copies to master gradients.
 
     .. note::
         Loss scaling, either static or dynamic, is orthogonal to learning rate, because gradients
         are downscaled before being applied.  This means that adjusting the loss scale, or using
-        dynamic loss scaling, should not require retuning the learning rate or any other 
+        dynamic loss scaling, should not require retuning the learning rate or any other
         hyperparameters.
 
 
@@ -157,7 +154,7 @@ class FP16_Optimizer(object):
     See docstring for :attr:`step`.
 
     **Gradient clipping**:  Use :attr:`clip_master_grads`.
-    
+  
     **Multiple losses**:  If your model accumulates gradients from multiple losses,
     this can be made more efficient by supplying ``update_master_grads=False``
     to :attr:`backward`.  See docstring for :attr:`backward`.
@@ -168,16 +165,15 @@ class FP16_Optimizer(object):
         optimizer.loss_scale = new_loss_scale
 
     For static loss scaling, manually adjusting the loss scale over time is a reasonable
-    thing to do.  During later epochs, gradients may become smaller, and a 
+    thing to do.  During later epochs, gradients may become smaller, and a
     higher loss scale may be required, analogous to scheduling the learning rate.  Dynamic loss
-    scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting 
+    scaling is more subtle (see :class:`DynamicLossScaler`) and in this case, manually adjusting
     the loss scale is not recommended.
 
     **Multi_GPU training**:  If the wrapped ``init_optimizer`` was created from a model wrapped in
-    Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer` 
+    Pytorch DistributedDataParallel or Apex DistributedDataParallel, :class:`FP16_Optimizer`
     should still work as intended.
     """
-
     def __init__(self,
                  init_optimizer,
                  static_loss_scale=1.0,
@@ -381,9 +377,9 @@ def state_dict(self):
 
     def load_state_dict(self, state_dict):
         """
-        Loads a state_dict created by an earlier call to state_dict(). 
-        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``, 
-        whose parameters in turn came from ``model``, it is expected that the user 
+        Loads a state_dict created by an earlier call to state_dict().
+        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
+        whose parameters in turn came from ``model``, it is expected that the user
         will call ``model.load_state_dict()`` before
         ``fp16_optimizer_instance.load_state_dict()`` is called.
 
@@ -425,14 +421,14 @@ def load_state_dict(self, state_dict):
 
     def step(self, closure=None):  # could add clip option.
         """
-        If no closure is supplied, :attr:`step` should be called after 
+        If no closure is supplied, :attr:`step` should be called after
         ``fp16_optimizer_obj.backward(loss)``.
         :attr:`step` updates the fp32 master copy of parameters using the optimizer supplied to
         :class:`FP16_Optimizer`'s constructor, then copies the updated fp32 params into the fp16 params
         originally referenced by :class:`FP16_Optimizer`'s constructor, so the user may immediately run
         another forward pass using their model.
 
-        If a closure is supplied, :attr:`step` may be called without a prior call to 
+        If a closure is supplied, :attr:`step` may be called without a prior call to
         :attr:`backward(loss)`.
         This control flow is identical to `ordinary Pytorch optimizer use`_ with closures.
         However, the user should take care that any ``loss.backward()`` call within the closure
@@ -443,7 +439,7 @@ def step(self, closure=None):  # could add clip option.
 
         Example with closure::
 
-            # optimizer is assumed to be an FP16_Optimizer object, previously constructed from an 
+            # optimizer is assumed to be an FP16_Optimizer object, previously constructed from an
             # existing pytorch optimizer.
             for input, target in dataset:
                 def closure():
@@ -481,7 +477,6 @@ def closure():
         return retval
 
     def _step_with_closure(self, closure):
-
         def wrapped_closure():
             # helpful for debugging
             # print("Calling wrapped_closure, first_closure_call_this_step = {}"
@@ -522,7 +517,7 @@ def wrapped_closure():
         return retval
 
     def backward(self, loss, update_master_grads=True, retain_graph=False):
-        """ 
+        """
         :attr:`backward` performs the following conceptual steps:
 
         1. fp32_loss = loss.float() (see first Note below)
@@ -536,19 +531,19 @@ def backward(self, loss, update_master_grads=True, retain_graph=False):
 
         .. note::
             :attr:`backward` internally converts the loss to fp32 before applying the loss scale.
-            This provides some additional safety against overflow if the user has supplied an 
-            fp16 loss value.  
+            This provides some additional safety against overflow if the user has supplied an
+            fp16 loss value.
             However, for maximum overflow safety, the user should
-            compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to 
+            compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
             :attr:`backward`.
 
         .. warning::
-            The gradients found in a model's leaves after the call to 
-            :attr:`backward` should not be regarded as valid in general, 
-            because it's possible 
-            they have been scaled (and in the case of dynamic loss scaling, 
-            the scale factor may change over time).  
-            If the user wants to inspect gradients after a call to :attr:`backward`,  
+            The gradients found in a model's leaves after the call to
+            :attr:`backward` should not be regarded as valid in general,
+            because it's possible
+            they have been scaled (and in the case of dynamic loss scaling,
+            the scale factor may change over time).
+            If the user wants to inspect gradients after a call to :attr:`backward`,
             only the master gradients should be regarded as valid.  These can be retrieved via
             :attr:`inspect_master_grad_data()`.
 
@@ -563,13 +558,13 @@ def backward(self, loss, update_master_grads=True, retain_graph=False):
             optimizer.backward(loss)
 
             # Naive operation with multiple losses (technically valid, but less efficient):
-            # fp32 grads will be correct after the second call,  but 
+            # fp32 grads will be correct after the second call,  but
             # the first call incurs an unnecessary fp16->fp32 grad copy.
             optimizer.backward(loss1)
             optimizer.backward(loss2)
 
             # More efficient way to handle multiple losses:
-            # The fp16->fp32 grad copy is delayed until fp16 grads from all 
+            # The fp16->fp32 grad copy is delayed until fp16 grads from all
             # losses have been accumulated.
             optimizer.backward(loss1, update_master_grads=False)
             optimizer.backward(loss2, update_master_grads=False)
@@ -585,26 +580,27 @@ def backward(self, loss, update_master_grads=True, retain_graph=False):
 
     def update_master_grads(self):
         """
-        Copy the ``.grad`` attribute from stored references to fp16 parameters to 
-        the ``.grad`` attribute of the fp32 master parameters that are directly 
+        Copy the ``.grad`` attribute from stored references to fp16 parameters to
+        the ``.grad`` attribute of the fp32 master parameters that are directly
         updated by the optimizer.  :attr:`update_master_grads` only needs to be called if
         ``fp16_optimizer_obj.backward`` was called with ``update_master_grads=False``.
         """
         if self.dynamic_loss_scale:
             self._check_overflow()
-            if self.overflow: return
+            if self.overflow:
+                return
         self._model_grads_to_master_grads()
         self._downscale_master()
 
     def inspect_master_grad_data(self):
         """
-        When running with :class:`FP16_Optimizer`, 
+        When running with :class:`FP16_Optimizer`,
         ``.grad`` attributes of a model's fp16 leaves should not be
-        regarded as truthful, because they might be scaled.  
+        regarded as truthful, because they might be scaled.
         After a call to :attr:`fp16_optimizer_obj.backward(loss)`, if no overflow was encountered,
         the fp32 master params' ``.grad``
-        attributes will contain valid gradients properly divided by the loss scale.  However, 
-        because :class:`FP16_Optimizer` flattens some parameters, accessing them may be 
+        attributes will contain valid gradients properly divided by the loss scale.  However,
+        because :class:`FP16_Optimizer` flattens some parameters, accessing them may be
         nonintuitive.  :attr:`inspect_master_grad_data`
         allows those gradients to be viewed with shapes corresponding to their associated model leaves.
 
diff --git a/flagai/fp16/fp16util.py b/flagai/fp16/fp16util.py
index efe29fc2..4a4b1837 100644
--- a/flagai/fp16/fp16util.py
+++ b/flagai/fp16/fp16util.py
@@ -26,7 +26,6 @@ class tofp16(nn.Module):
         def forward(self, input):
             return input.half()
     """
-
     def __init__(self):
         super(tofp16, self).__init__()
 
@@ -90,7 +89,6 @@ class FP16Model(nn.Module):
     """
     Convert model to half precision in a batchnorm-safe way.
     """
-
     def __init__(self, network):
         super(FP16Model, self).__init__()
         self.network = convert_network(network, dtype=torch.half)
@@ -162,7 +160,7 @@ def model_grads_to_master_grads(model_params,
                                 master_params,
                                 flat_master=False):
     """
-    Copy model gradients to master gradients.  
+    Copy model gradients to master gradients.
 
     Args:
         model_params:  List of model parameters created by :func:`prep_param_lists`.
@@ -222,7 +220,3 @@ def to_python_float(t):
     import torch.nn.utils as mpu
 
 clip_grad_norm = mpu.clip_grad_norm
-#elif TORCH_MAJOR == 0 and TORCH_MINOR <= 4:
-#    clip_grad_norm = torch.nn.utils.clip_grad_norm
-#else:
-#    clip_grad_norm = torch.nn.utils.clip_grad_norm_
diff --git a/flagai/fp16/loss_scaler.py b/flagai/fp16/loss_scaler.py
index dbc003aa..1c310dcb 100755
--- a/flagai/fp16/loss_scaler.py
+++ b/flagai/fp16/loss_scaler.py
@@ -30,13 +30,12 @@ class LossScaler:
     Class that manages a static loss scale.  This class is intended to interact with
     :class:`FP16_Optimizer`, and should not be directly manipulated by the user.
 
-    Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to 
+    Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
     :class:`FP16_Optimizer`'s constructor.
 
     Args:
         scale (float, optional, default=1.0):  The loss scale.
     """
-
     def __init__(self, scale=1):
         self.cur_scale = scale
 
@@ -66,7 +65,7 @@ def backward(self, loss, retain_graph=False):
 class DynamicLossScaler:
     """
     Class that manages dynamic loss scaling.  It is recommended to use :class:`DynamicLossScaler`
-    indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of 
+    indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
     :class:`FP16_Optimizer`.  However, it's important to understand how :class:`DynamicLossScaler`
     operates, because the default options can be changed using the
     the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
@@ -74,21 +73,20 @@ class DynamicLossScaler:
     Loss scaling is designed to combat the problem of underflowing gradients encountered at long
     times when training fp16 networks.  Dynamic loss scaling begins by attempting a very high loss
     scale.  Ironically, this may result in OVERflowing gradients.  If overflowing gradients are
-    encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has 
+    encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
     occurred.
     :class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
-    and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.  
+    and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
     If a certain number of iterations occur without overflowing gradients detected,
     :class:`DynamicLossScaler` increases the loss scale once more.
-    In this way :class:`DynamicLossScaler` attempts to "ride the edge" of 
+    In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
     always using the highest loss scale possible without incurring overflow.
 
     Args:
         init_scale (float, optional, default=2**32):  Initial loss scale attempted by :class:`DynamicLossScaler.`
-        scale_factor (float, optional, default=2.0):  Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``.  If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``. 
+        scale_factor (float, optional, default=2.0):  Factor used when adjusting the loss scale. If an overflow is encountered, the loss scale is readjusted to loss scale/``scale_factor``.  If ``scale_window`` consecutive iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
         scale_window (int, optional, default=1000):  Number of consecutive iterations without an overflow to wait before increasing the loss scale.
     """
-
     def __init__(self,
                  init_scale=2**32,
                  scale_factor=2.,
@@ -187,58 +185,3 @@ def scale_gradient(self, module, grad_in, grad_out):
     def backward(self, loss, retain_graph=False):
         scaled_loss = loss * self.loss_scale
         scaled_loss.backward(retain_graph=retain_graph)
-
-
-##############################################################
-# Example usage below here -- assuming it's in a separate file
-##############################################################
-"""
-TO-DO separate out into an example.
-if __name__ == "__main__":
-    import torch
-    from torch.autograd import Variable
-    from dynamic_loss_scaler import DynamicLossScaler
-
-    # N is batch size; D_in is input dimension;
-    # H is hidden dimension; D_out is output dimension.
-    N, D_in, H, D_out = 64, 1000, 100, 10
-
-    # Create random Tensors to hold inputs and outputs, and wrap them in Variables.
-    x = Variable(torch.randn(N, D_in), requires_grad=False)
-    y = Variable(torch.randn(N, D_out), requires_grad=False)
-
-    w1 = Variable(torch.randn(D_in, H), requires_grad=True)
-    w2 = Variable(torch.randn(H, D_out), requires_grad=True)
-    parameters = [w1, w2]
-
-    learning_rate = 1e-6
-    optimizer = torch.optim.SGD(parameters, lr=learning_rate)
-    loss_scaler = DynamicLossScaler()
-
-    for t in range(500):
-        y_pred = x.mm(w1).clamp(min=0).mm(w2)
-        loss = (y_pred - y).pow(2).sum() * loss_scaler.loss_scale
-        print('Iter {} loss scale: {}'.format(t, loss_scaler.loss_scale))
-        print('Iter {} scaled loss: {}'.format(t, loss.data[0]))
-        print('Iter {} unscaled loss: {}'.format(t, loss.data[0] / loss_scaler.loss_scale))
-
-        # Run backprop
-        optimizer.zero_grad()
-        loss.backward()
-        
-        # Check for overflow
-        has_overflow = DynamicLossScaler.has_overflow(parameters)
-        
-        # If no overflow, unscale grad and update as usual
-        if not has_overflow:
-            for param in parameters:
-                param.grad.data.mul_(1. / loss_scaler.loss_scale)
-            optimizer.step()
-        # Otherwise, don't do anything -- ie, skip iteration
-        else:
-            print('OVERFLOW!')
-
-        # Update loss scale for next iteration
-        loss_scaler.update_scale(has_overflow)
-
-"""
diff --git a/flagai/launch.py b/flagai/launch.py
deleted file mode 100644
index ca8856d7..00000000
--- a/flagai/launch.py
+++ /dev/null
@@ -1,230 +0,0 @@
-# Copyright 2020 The Microsoft DeepSpeed Team
-"""
-sailing runner is the main front-end to launching multi-worker
-training jobs with DeepSpeed. By default this uses pdsh to parallel
-ssh into multiple worker nodes and launch all the necessary processes
-per rank for training.
-"""
-
-import os
-import sys
-import json
-import subprocess
-import collections
-import socket
-from flagai.logger import log_dist
-
-
-def fetch_hostfile(hostfile_path):
-    if not os.path.isfile(hostfile_path):
-        log_dist("Unable to find hostfile, will proceed with training "
-                 "with local resources only.")
-        return None
-    # e.g., worker-0 slots=16
-    with open(hostfile_path, 'r') as fd:
-        resource_pool = collections.OrderedDict()
-        for line in fd.readlines():
-            line = line.strip()
-            if line == '':
-                # skip empty lines
-                continue
-            try:
-                hostname, slots = line.split()
-                _, slot_count = slots.split("=")
-                slot_count = int(slot_count)
-            except ValueError as err:
-                raise err
-            if hostname in resource_pool:
-                raise ValueError(f"host {hostname} is already defined")
-            resource_pool[hostname] = slot_count
-
-    return resource_pool
-
-
-def cmd_load_hyperparam(config_path=None, format="json", encoding="utf-8"):
-    """
-    shell load arguments form argparse and config file
-    """
-    # config_path='config/config_block_large_chinese.json'
-    format = config_path.rsplit('.')[-1]
-    with open(config_path, 'r', encoding=encoding) as f:
-        if format == "json":
-            config_dict = json.load(f)
-        else:
-            raise NameError("current format%s for hyperparam file is invalid" %
-                            format)
-    config_cmd = []
-    for key in config_dict:
-        if len(str(config_dict[key])) == 0:
-            config_cmd.append('--' + key)
-        else:
-            config_cmd.append('--' + key)
-            config_cmd.append(str(config_dict[key]))
-    return config_cmd
-
-
-def launch_dist(launcher='distributed_deepspeed',
-                num_nodes=1,
-                gpus_per_node=1,
-                master_addr='localhost',
-                master_port=17500,
-                hostfile='hostfile',
-                nccl_info=False,
-                training_script='train.py',
-                training_script_paras=None):
-    try:
-        resource_pool = fetch_hostfile(hostfile)
-    except:
-        raise RuntimeError("hostfile is not valid")
-    # respect CUDA_VISIBLE_DEVICES for a single node and no explicit resource filters
-    cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", "0")
-
-    if num_nodes > 1 and launcher == 'distributed_torch':
-        node_rank = 0
-        for host, slots in resource_pool.items():
-            cmd_launch = ['pdsh', '-f', '1024', '-w']
-            cmd_launch.append('ssh:' + host)
-            cmd_launch.append('"')
-            if nccl_info:
-                cmd_launch.extend([
-                    'export NCCL_DEBUG=info;', 'export NCCL_IB_DISABLE=0;',
-                    'export NCCL_NET_GDR_LEVEL=2;'
-                ])
-            cmd_launch.extend([
-                'export NUM_NODES=' + str(num_nodes) + ';',
-                'export GPUS_PER_NODE=' + str(gpus_per_node) + ';',
-                'export NCCL_NET_GDR_LEVEL=2;', sys.executable, '-m',
-                'torch.distributed.launch'
-            ])
-            torch_distributed_args = [
-                '--nproc_per_node',
-                str(gpus_per_node),
-                '--nnodes',
-                str(num_nodes),
-                '--node_rank',
-                str(node_rank),
-                '--master_addr',
-                master_addr,
-                '--master_port',
-                str(master_port),
-            ]
-            cmd_launch.extend(torch_distributed_args)
-            cmd_launch.append(training_script)
-
-            for para in training_script_paras:
-                if 'training_script_config' in para:
-                    para_index = training_script_paras.index(para)
-                    training_script_args = cmd_load_hyperparam(
-                        training_script_paras[para_index + 1])
-                    cmd_launch.extend(training_script_args)
-                    del training_script_paras[para_index:para_index + 2]
-            if len(training_script_paras) > 0:
-                cmd_launch.extend(training_script_paras)
-            cmd_launch.append('--not_call_launch')
-            cmd_launch.append('"')
-            run_cmd = ' '.join(cmd_launch)
-            log_dist(run_cmd)
-            subprocess.Popen(run_cmd, shell=True)
-            node_rank += 1
-
-    elif num_nodes == 1 and launcher == 'distributed_torch':
-        cmd_launch = []
-        cmd_launch.extend([
-            'export NUM_NODES=' + str(num_nodes) + ';',
-            'export GPUS_PER_NODE=' + str(gpus_per_node) + ';', sys.executable,
-            '-m', 'torch.distributed.launch'
-        ])
-        torch_distributed_args = [
-            '--nproc_per_node',
-            str(gpus_per_node),
-            '--nnodes',
-            str(num_nodes),
-            '--node_rank',
-            str(0),
-            '--master_addr',
-            master_addr,
-            '--master_port',
-            str(master_port),
-        ]
-        cmd_launch.extend(torch_distributed_args)
-        cmd_launch.append(training_script)
-        cmd_launch.append('--not_call_launch')
-        run_cmd = ' '.join(cmd_launch)
-        log_dist(run_cmd)
-        subprocess.Popen(run_cmd, shell=True)
-
-    elif launcher == 'distributed_deepspeed':
-        if hostfile is None:
-            log_dist(
-                'Unable to find hostfile, will proceed with training with local resources only.'
-            )
-
-            with open('/tmp/hostfile', 'w') as w:
-                w.write(socket.gethostname() + ' slots=2')
-            hostfile = '/tmp/hostfile'
-
-        if nccl_info:
-            cmd_launch = [
-                'NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed'
-            ]
-        else:
-            cmd_launch = ['deepspeed']
-
-        cmd_launch.extend([
-            '--master_port',
-            str(master_port),
-            '--num_nodes',
-            str(num_nodes),
-            '--num_gpus',
-            str(gpus_per_node),
-            '--hostfile',
-            hostfile,
-        ])
-
-        cmd_launch.append(training_script)
-        if training_script_paras:
-            for para in training_script_paras:
-                if 'training_script_config' in para:
-                    para_index = training_script_paras.index(para)
-                    training_script_args = cmd_load_hyperparam(
-                        training_script_paras[para_index + 1])
-                    cmd_launch.extend(training_script_args)
-                    del training_script_paras[para_index:para_index + 2]
-            if len(training_script_paras) > 0:
-                cmd_launch.extend(training_script_paras)
-
-        cmd_launch.append('--not_call_launch')
-        run_cmd = ' '.join(cmd_launch)
-        log_dist(run_cmd)
-        subprocess.Popen(run_cmd, shell=True)
-
-    elif num_nodes == 1 and launcher == 'simple_torch':
-        # This launcher
-        for gpu_id in range(gpus_per_node):
-            cmd_launch = []
-            cmd_launch.extend([
-                'export MASTER_ADDR=' + str(master_addr) + ';',
-                'export MASTER_PORT=' + str(master_port) + ';', sys.executable
-            ])
-            cmd_launch.append(training_script)
-            torch_distributed_args = [
-                '--gpu_nums',
-                str(gpus_per_node), '--local_rank',
-                str(gpu_id)
-            ]
-            cmd_launch.extend(torch_distributed_args)
-            for para in training_script_paras:
-                if 'training_script_config' in para:
-                    para_index = training_script_paras.index(para)
-                    training_script_args = cmd_load_hyperparam(
-                        training_script_paras[para_index + 1])
-                    cmd_launch.extend(training_script_args)
-                    del training_script_paras[para_index:para_index + 2]
-            if len(training_script_paras) > 0:
-                cmd_launch.extend(training_script_paras)
-
-            run_cmd = ' '.join(cmd_launch)
-            log_dist(run_cmd)
-            subprocess.Popen(run_cmd, shell=True)
-    else:
-        raise Exception('No aviable launcher')
diff --git a/flagai/logger.py b/flagai/logger.py
index 2dbbecb6..61ea1045 100644
--- a/flagai/logger.py
+++ b/flagai/logger.py
@@ -13,7 +13,6 @@
 
 
 class LoggerFactory:
-
     @staticmethod
     def create_logger(name=None, level=logging.INFO):
         """create a logger
@@ -44,7 +43,7 @@ def create_logger(name=None, level=logging.INFO):
 if 'logger' not in dir():
     logger = LoggerFactory.create_logger(name="DeepSpeed", level=logging.INFO)
 while len(logger.handlers) > 1:
-    #Why is this happening?
+    # Why is this happening?
     logger.removeHandler(logger.handlers[-1])
 
 
diff --git a/flagai/metrics.py b/flagai/metrics.py
index 9b15e267..2e51d334 100644
--- a/flagai/metrics.py
+++ b/flagai/metrics.py
@@ -71,7 +71,6 @@ def multirc_em(predictions, labels, meta):
 
 def normalize_answer(s):
     """Lower text and remove punctuation, articles and extra whitespace."""
-
     def remove_articles(text):
         return re.sub(r'\b(a|an|the)\b', ' ', text)
 
diff --git a/flagai/model/base_model.py b/flagai/model/base_model.py
index 605f599b..9f89e3f7 100644
--- a/flagai/model/base_model.py
+++ b/flagai/model/base_model.py
@@ -10,7 +10,6 @@
 
 # The base model for models
 class BaseModel(Module):
-
     def __init__(self, config, **kwargs):
         super().__init__()
         self.config = config
@@ -51,11 +50,11 @@ def from_pretrain(cls,
         config_path = None
         download_path = os.path.join(download_path, model_name)
         checkpoint_path = os.path.join(download_path, "pytorch_model.bin")
-        ## prepare the download path
-        ## downloading the files
+        # prepare the download path
+        # downloading the files
         model: Union[Module, None]
         if model_id != "null":
-            vocab_path = _get_vocab_path(download_path, "vocab.txt", model_id)
+            _get_vocab_path(download_path, "vocab.txt", model_id)
             if not only_download_config:
                 checkpoint_path = _get_checkpoint_path(download_path,
                                                        'pytorch_model.bin',
diff --git a/flagai/model/bert_model.py b/flagai/model/bert_model.py
index 08b1bf2d..c8a08df3 100755
--- a/flagai/model/bert_model.py
+++ b/flagai/model/bert_model.py
@@ -41,7 +41,6 @@ def init_bert_weights(module):
 
 
 class BertStack(torch.nn.Module):
-
     def __init__(self, num_hidden_layers, hidden_size, num_attention_heads,
                  attention_probs_dropout_prob, initializer_range,
                  layernorm_epsilon, hidden_dropout_prob, intermediate_size,
@@ -67,7 +66,6 @@ def forward(self,
             if checkpoint_fn is not None:
 
                 def create_custom_forward(module):
-
                     def custom_forward(*inputs):
                         return module(*inputs)
 
@@ -88,7 +86,6 @@ def custom_forward(*inputs):
 
 
 class BertModel(BaseModel):
-
     def __init__(self, config, **kwargs):
 
         super(BertModel, self).__init__(config, **kwargs)
@@ -177,7 +174,7 @@ def load_huggingface_weights(self, checkpoint_path):
         checkpoint = torch.load(checkpoint_path,
                                 map_location=torch.device("cpu"))
         if "module" in checkpoint:
-            ## ddp
+            # ddp
             checkpoint = checkpoint["module"]
         checkpoint_new = self.convert_checkpoint_to_load(checkpoint)
         self.load_state_dict(checkpoint_new, strict=False)
@@ -233,7 +230,6 @@ def convert_checkpoint_to_load(self, checkpoint):
 
 
 class Predictions(nn.Module):
-
     def __init__(self, vocab_size, hidden_size, layer_nrom_eps, hidden_act):
         super().__init__()
         self.transform = BertPredictionHeadTransform(hidden_size,
@@ -249,7 +245,6 @@ def forward(self, x):
 
 
 class CLS(nn.Module):
-
     def __init__(self, vocab_size, hidden_size, layer_norm_eps, hidden_act):
         super().__init__()
         self.predictions = Predictions(vocab_size, hidden_size, layer_norm_eps,
@@ -260,7 +255,6 @@ def forward(self, x):
 
 
 class BertPredictionHeadTransform(nn.Module):
-
     def __init__(self, hidden_size, layer_norm_eps, hidden_act):
         super().__init__()
         self.dense = nn.Linear(hidden_size, hidden_size)
@@ -306,7 +300,6 @@ def load_extend_layer_weight(self, checkpoints, extend_layer: List[str]):
 
 
 class BertForSeq2seq(BaseModel):
-
     def __init__(self, config, **kwargs):
         super(BertForSeq2seq, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -377,7 +370,6 @@ def load_weights(self, model_path):
 
 
 class BertForMaskLM(BaseModel):
-
     def __init__(self, config, **kwargs):
         super(BertForMaskLM, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -424,7 +416,6 @@ def load_weights(self, model_path):
 
 
 class BertForClsClassifier(BaseModel):
-
     def __init__(self, config, **kwargs):
         super(BertForClsClassifier, self).__init__(config, **kwargs)
         assert config['class_num'] != -1 and config['class_num'] is not None
@@ -473,7 +464,6 @@ def load_weights(self, checkpoint_path):
 
 
 class BertForSequenceLabeling(BaseModel):
-
     def __init__(self, config, **kwargs):
         super(BertForSequenceLabeling, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -524,7 +514,6 @@ def load_weights(self, model_path):
 class BertForSequenceLabelingCRF(BaseModel):
     """
     """
-
     def __init__(self, config, **kwargs):
         super(BertForSequenceLabelingCRF, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -576,7 +565,6 @@ def load_weights(self, model_path):
 class BertForSequenceLabelingGP(BaseModel):
     """
     """
-
     def __init__(self, config, **kwargs):
         super(BertForSequenceLabelingGP, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -614,7 +602,6 @@ def load_weights(self, model_path):
 
 
 class BertForEmbedding(BaseModel):
-
     def __init__(self, config, **kwargs):
         super(BertForEmbedding, self).__init__(config, **kwargs)
         self.model = BertModel(config)
@@ -638,4 +625,4 @@ def forward(self, **data):
         return return_data
 
     def load_weights(self, model_path):
-        checkpoints = self.model.load_huggingface_weights(model_path)
+        self.model.load_huggingface_weights(model_path)
diff --git a/flagai/model/blocks/bert_block.py b/flagai/model/blocks/bert_block.py
index 151ee93f..6ce58c9e 100644
--- a/flagai/model/blocks/bert_block.py
+++ b/flagai/model/blocks/bert_block.py
@@ -25,7 +25,6 @@
 
 
 class BertOutput(torch.nn.Module):
-
     def __init__(self, intermediate_size, hidden_size, layernorm_epsilon,
                  hidden_dropout_prob, initializer_range):
         super(BertOutput, self).__init__()
@@ -57,7 +56,6 @@ def forward(self, hidden_states, input_tensor):
 
 
 class BertIntermediate(torch.nn.Module):
-
     def __init__(self, hidden_size, intermediate_size, initializer_range,
                  hidden_act):
         super(BertIntermediate, self).__init__()
@@ -85,7 +83,6 @@ def forward(self, hidden_states):
 
 
 class BertBlock(torch.nn.Module):
-
     def __init__(self, hidden_size, num_attention_heads,
                  attention_probs_dropout_prob, initializer_range,
                  layernorm_epsilon, hidden_dropout_prob, intermediate_size,
diff --git a/flagai/model/blocks/glm_block.py b/flagai/model/blocks/glm_block.py
index 64d76143..b66f2adc 100644
--- a/flagai/model/blocks/glm_block.py
+++ b/flagai/model/blocks/glm_block.py
@@ -54,7 +54,6 @@ class GLMBlock(torch.nn.Module):
                                   mlp output) initialization. If None,
                                   use `init_method`.
     """
-
     def __init__(self,
                  hidden_size,
                  num_attention_heads,
@@ -154,7 +153,6 @@ class GLMDecoderBlock(torch.nn.Module):
                                   mlp output) initialization. If None,
                                   use `init_method`.
     """
-
     def __init__(self,
                  hidden_size,
                  num_attention_heads,
diff --git a/flagai/model/blocks/gpt2_block.py b/flagai/model/blocks/gpt2_block.py
index 1bad7b40..ca8c4041 100644
--- a/flagai/model/blocks/gpt2_block.py
+++ b/flagai/model/blocks/gpt2_block.py
@@ -4,7 +4,6 @@
 
 
 class GPT2Block(nn.Module):
-
     def __init__(self, n_ctx, config, scale=False):
         super().__init__()
         hidden_size = config['n_embd']
diff --git a/flagai/model/blocks/t5_block.py b/flagai/model/blocks/t5_block.py
index 7e8f9858..8569d061 100644
--- a/flagai/model/blocks/t5_block.py
+++ b/flagai/model/blocks/t5_block.py
@@ -6,7 +6,6 @@
 
 
 class T5Block(nn.Module):
-
     def __init__(self, config, has_relative_attention_bias=False):
         super().__init__()
         self.is_decoder = config['is_decoder']
diff --git a/flagai/model/config.py b/flagai/model/config.py
index 9cae27e0..3176a243 100644
--- a/flagai/model/config.py
+++ b/flagai/model/config.py
@@ -8,7 +8,6 @@
 
 
 class Config(object):
-
     def __init__(self):
         super().__init__()
 
diff --git a/flagai/model/file_utils.py b/flagai/model/file_utils.py
index 85837976..9d7a0c14 100644
--- a/flagai/model/file_utils.py
+++ b/flagai/model/file_utils.py
@@ -17,12 +17,12 @@ def download_from_url(url, size=0, rank=0, to_path=None, file_pname=None):
     """
     try:
         requests.get(url, stream=True, verify=True)
-    except Exception as e:
+    except Exception:
         raise ValueError('please check the download file names')
     total_size = size
     if to_path is None:
         to_path = './checkpoints/'
-    if file_pname == None:
+    if file_pname is None:
         file_path = os.path.join(to_path, url.split('/')[-1])
     else:
         file_path = os.path.join(to_path, file_pname)
diff --git a/flagai/model/glm_model.py b/flagai/model/glm_model.py
index 2159539d..9884fb81 100755
--- a/flagai/model/glm_model.py
+++ b/flagai/model/glm_model.py
@@ -17,18 +17,16 @@
 import torch
 from torch import nn
 import torch.nn.functional as F
-
 from flagai.model.blocks.glm_block import GLMBlock
-from flagai.model.file_utils import _get_config_path, _get_checkpoint_path, _get_model_id
 from flagai.model.utils import scaled_init_method, divide, unscaled_init_method
-
-print_rank_0 = print
 from flagai.model.layers.embeddings import VocabParallelEmbedding
 from flagai.model.base_model import BaseModel
 from flagai.model.layers.embeddings import PositionalEmbedding
 from flagai.mpu.random import checkpoint
 from flagai.model.utils import normal_init_method
 from torch.nn import LayerNorm
+
+print_rank_0 = print
 if os.getenv('ENV_TYPE') == 'deepspeed+mpu':
     from flagai.mpu import copy_to_model_parallel_region
 
@@ -67,7 +65,6 @@ class GLMStack(torch.nn.Module):
                                             scaling for the output weights (
                                             output of self attention and mlp).
     """
-
     def __init__(
         self,
         num_layers,
@@ -210,9 +207,7 @@ def build_mask_matrix(seq_length, sep, memory_length=0):
                     m = m.masked_fill(mask.unsqueeze(1).expand_as(m), 1)
                 if memory_length > 0:
                     m = m.expand(batch_size, -1, -1)
-                    m = torch.cat((hidden_states.new_ones(
-                        (batch_size, seq_length, memory_length)), m),
-                                  dim=2)
+                    m = torch.cat((hidden_states.new_ones((batch_size, seq_length, memory_length)), m), dim=2)
                 m = m.unsqueeze(1)
                 return m
 
@@ -257,13 +252,10 @@ def check_detach(_hidden_states):
             mem_layers = []
 
         for i, layer in enumerate(self.layers):
-            args = [hidden_states, attention_mask
-                    ] if not self.use_decoder_layer else [
-                        hidden_states, encoder_states, attention_mask
-                    ]
+            args = [hidden_states, attention_mask] if not self.use_decoder_layer else \
+                   [hidden_states, encoder_states, attention_mask]
 
             def create_custom_forward(module):
-
                 def custom_forward(*inputs):
                     # None for past_key_value
                     return module(*inputs)
@@ -321,7 +313,6 @@ class GLMModel(BaseModel):
     The output of the forward method are the logits (parallel or
     serial depending on the `parallel_output` flag.
     """
-
     def __init__(self, config, **kwargs):
 
         super(GLMModel, self).__init__(config, **kwargs)
@@ -448,7 +439,7 @@ def load_weights_glm(self, checkpoint_path):
         checkpoint = torch.load(checkpoint_path,
                                 map_location=torch.device("cpu"))
         if "module" in checkpoint:
-            ## ddp
+            # ddp
             checkpoint = checkpoint["module"]
         checkpoint_load = {}
         for k, v in checkpoint.items():
@@ -464,7 +455,6 @@ def load_weights(self, checkpoint_path):
 
 
 class GLMForMultiTokenCloze(BaseModel):
-
     def __init__(self,
                  config,
                  take_softmax=True,
@@ -506,7 +496,7 @@ def forward(self,
         target_ids
 
         '''
-        if target_ids == None:
+        if target_ids is None:
             return self.model(input_ids, position_ids, attention_mask)
 
         num_choices = 1  # modified from none
@@ -574,7 +564,7 @@ def compute_loss(self,
                                     detach_memory=detach_memory,
                                     prompt_pos=prompt_pos,
                                     **kwargs)
-        assert labels != None, "labels must not None!"
+        assert labels is not None, "labels must not None!"
 
         logits = model_output['logits']
         loss = F.cross_entropy(logits.contiguous().float(), labels.long())
@@ -586,7 +576,6 @@ def compute_loss(self,
 
 
 class GLMForMultiTokenClozeFast(BaseModel):
-
     def __init__(self,
                  config,
                  take_softmax=True,
@@ -680,7 +669,6 @@ def build_dec_mask_matrix(seq_length, sep, memory_length=0):
 
 
 class GLMForSingleTokenCloze(BaseModel):
-
     def __init__(self, config, take_softmax=False, **kwargs):
         super().__init__(config, **kwargs)
         self.config = config
@@ -775,7 +763,7 @@ def compute_loss(self,
                                     detach_memory=detach_memory,
                                     prompt_pos=prompt_pos,
                                     **kwargs)
-        assert labels != None, "labels must not None!"
+        assert labels is not None, "labels must not None!"
         logits = model_output['logits']
         loss = F.cross_entropy(logits.contiguous().float(), labels.long())
         return {
@@ -786,7 +774,6 @@ def compute_loss(self,
 
 
 class GLMForSequenceClassification(BaseModel):
-
     def __init__(self, config, hidden_dropout=0.1, pool_token='cls', **kwargs):
         super().__init__(config, **kwargs)
         self.config = config
@@ -856,7 +843,7 @@ def compute_loss(self,
         model_output = self.forward(input_ids=input_ids,
                                     position_ids=position_ids,
                                     attention_mask=attention_mask)
-        assert labels != None, "labels must not None!"
+        assert labels is not None, "labels must not None!"
         logits = model_output['logits']
         loss = F.cross_entropy(logits.contiguous().float(), labels.long())
         return {
@@ -871,7 +858,6 @@ def load_weights(self, checkpoint_path):
 
 
 class GLMForSeq2Seq(BaseModel):
-
     def __init__(self, config, take_softmax=True, **kwargs):
         super().__init__(config, **kwargs)
         self.config = config
diff --git a/flagai/model/gpt2_model.py b/flagai/model/gpt2_model.py
index 880a8257..9d2825e7 100644
--- a/flagai/model/gpt2_model.py
+++ b/flagai/model/gpt2_model.py
@@ -14,7 +14,6 @@
 
 
 class GPT2Stack(nn.Module):
-
     def __init__(self, config):
         self.config = config
         super().__init__()
@@ -115,7 +114,6 @@ def forward(
 
 
 class GPT2Model(BaseModel):
-
     def __init__(self, config, **kwargs):
         super().__init__(config, **kwargs)
         self.config = config
@@ -180,7 +178,7 @@ def load_weights(self, checkpoint_path):
         checkpoint = torch.load(checkpoint_path,
                                 map_location=torch.device("cpu"))
         if "module" in checkpoint:
-            ## ddp
+            # ddp
             checkpoint = checkpoint["module"]
 
         checkpoint = self.transpose_weight(checkpoint)
diff --git a/flagai/model/layers/activations.py b/flagai/model/layers/activations.py
index 6ba38cf4..7bc0fd08 100644
--- a/flagai/model/layers/activations.py
+++ b/flagai/model/layers/activations.py
@@ -23,7 +23,6 @@ class NewGELUActivation(nn.Module):
     Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
     the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
     """
-
     def forward(self, input: Tensor) -> Tensor:
         return 0.5 * input * (1.0 + torch.tanh(
             math.sqrt(2.0 / math.pi) *
@@ -37,7 +36,6 @@ class GELUActivation(nn.Module):
     torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
     Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
     """
-
     def __init__(self, use_gelu_python: bool = False):
         super().__init__()
         if version.parse(
@@ -57,7 +55,6 @@ class FastGELUActivation(nn.Module):
     """
     Applies GELU approximation that is slower than QuickGELU but more accurate. See: https://github.com/hendrycks/GELUs
     """
-
     def forward(self, input: Tensor) -> Tensor:
         return 0.5 * input * (1.0 +
                               torch.tanh(input * 0.7978845608 *
@@ -68,7 +65,6 @@ class QuickGELUActivation(nn.Module):
     """
     Applies GELU approximation that is fast but somewhat inaccurate. See: https://github.com/hendrycks/GELUs
     """
-
     def forward(self, input: Tensor) -> Tensor:
         return input * torch.sigmoid(1.702 * input)
 
@@ -83,7 +79,6 @@ class ClippedGELUActivation(nn.Module):
     For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
     torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))). See https://arxiv.org/abs/1606.08415
     """
-
     def __init__(self, min: float, max: float):
         if min > max:
             raise ValueError(
@@ -105,7 +100,6 @@ class SiLUActivation(nn.Module):
     Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
     later.
     """
-
     def __init__(self):
         super().__init__()
         if version.parse(torch.__version__) < version.parse("1.7"):
@@ -125,7 +119,6 @@ class MishActivation(nn.Module):
     See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
     visit the official repository for the paper: https://github.com/digantamisra98/Mish
     """
-
     def __init__(self):
         super().__init__()
         if version.parse(torch.__version__) < version.parse("1.9"):
@@ -144,7 +137,6 @@ class LinearActivation(nn.Module):
     """
     Applies the linear activation function, i.e. forwarding input directly to output.
     """
-
     def forward(self, input: Tensor) -> Tensor:
         return input
 
diff --git a/flagai/model/layers/attentions.py b/flagai/model/layers/attentions.py
index ea810054..ba0ea828 100755
--- a/flagai/model/layers/attentions.py
+++ b/flagai/model/layers/attentions.py
@@ -39,7 +39,6 @@
 
 
 class GPT2Attention(nn.Module):
-
     def __init__(self,
                  nx,
                  n_ctx,
@@ -174,7 +173,6 @@ def forward(
 
 
 class T5Attention(nn.Module):
-
     def __init__(self, config, has_relative_attention_bias=False):
         super().__init__()
         self.is_decoder = config['is_decoder']
@@ -415,7 +413,6 @@ def project(hidden_states, proj_layer, key_value_states,
 
 
 class T5LayerSelfAttention(nn.Module):
-
     def __init__(self, config, has_relative_attention_bias=False):
         super().__init__()
         self.SelfAttention = T5Attention(
@@ -451,7 +448,6 @@ def forward(
 
 
 class T5LayerCrossAttention(nn.Module):
-
     def __init__(self, config):
         super().__init__()
         self.EncDecAttention = T5Attention(config,
@@ -492,7 +488,6 @@ def forward(
 
 class ParallelCrossAttention(torch.nn.Module):
     """Parallel cross-attention layer for Transformer"""
-
     def __init__(self,
                  hidden_size,
                  num_attention_heads,
@@ -631,7 +626,6 @@ class ParallelSelfAttention(torch.nn.Module):
         b: batch size
         s: sequence length
     """
-
     def __init__(self,
                  hidden_size,
                  num_attention_heads,
@@ -857,7 +851,6 @@ class BertParallelSelfAttention(torch.nn.Module):
         b: batch size
         s: sequence length
     """
-
     def __init__(self,
                  hidden_size,
                  num_attention_heads,
@@ -960,7 +953,6 @@ def forward(self, hidden_states, attention_mask):
 
 
 class BertSelfOutput(torch.nn.Module):
-
     def __init__(self, hidden_size, initializer_range, layernorm_epsilon,
                  hidden_dropout_prob):
         super(BertSelfOutput, self).__init__()
@@ -986,7 +978,6 @@ def forward(self, hidden_states, input_tensor):
 
 
 class BertAttention(torch.nn.Module):
-
     def __init__(self, hidden_size, num_attention_heads,
                  attention_probs_dropout_prob, initializer_range,
                  layernorm_epsilon, hidden_dropout_prob):
diff --git a/flagai/model/layers/crf.py b/flagai/model/layers/crf.py
index 4eda9d13..412c43a9 100644
--- a/flagai/model/layers/crf.py
+++ b/flagai/model/layers/crf.py
@@ -6,7 +6,6 @@
 class CRFLayer(nn.Module):
     """
     """
-
     def __init__(self, output_dim):
         super(CRFLayer, self).__init__()
 
diff --git a/flagai/model/layers/embeddings.py b/flagai/model/layers/embeddings.py
index 9bac7eb8..875f2b44 100755
--- a/flagai/model/layers/embeddings.py
+++ b/flagai/model/layers/embeddings.py
@@ -39,14 +39,12 @@
 
 
 class PositionalEmbedding(torch.nn.Module):
-
     def __init__(self, hidden_size):
         super(PositionalEmbedding, self).__init__()
 
         self.hidden_size = hidden_size
 
-        inv_freq = 1 / (10000
-                        **(torch.arange(0.0, hidden_size, 2.0) / hidden_size))
+        inv_freq = 1 / (10000**(torch.arange(0.0, hidden_size, 2.0) / hidden_size))
         self.register_buffer('inv_freq', inv_freq)
 
     def forward(self, pos_seq, bsz=None):
@@ -62,7 +60,6 @@ class WordEmbedding(nn.Module):
     """
     input embeddin only has word embedding
     """
-
     def __init__(self, args, vocab_size):
         super(WordEmbedding, self).__init__()
         self.remove_embedding_layernorm = args.remove_embedding_layernorm
@@ -134,7 +131,6 @@ class VocabParallelEmbedding(torch.nn.Module):
         embedding_dim: size of hidden state.
         init_method: method to initialize weights.
     """
-
     def __init__(self,
                  num_embeddings,
                  embedding_dim,
@@ -206,7 +202,6 @@ class ParallelEmbedding(torch.nn.Module):
         embedding_dim: size of hidden state.
         init_method: method to initialize weights.
     """
-
     def __init__(self,
                  num_embeddings,
                  embedding_dim,
@@ -266,7 +261,6 @@ def forward(self, input_):
 class BertEmbeddings(nn.Module):
     """Construct the embeddings from word, position and token_type embeddings.
     """
-
     def __init__(self, vocab_size, hidden_size, initializer_range,
                  max_position_embeddings, type_vocab_size, layernorm_epsilon,
                  hidden_dropout_prob):
diff --git a/flagai/model/layers/feedforward.py b/flagai/model/layers/feedforward.py
index ebbe3d06..1bae1f50 100644
--- a/flagai/model/layers/feedforward.py
+++ b/flagai/model/layers/feedforward.py
@@ -4,11 +4,9 @@
 import torch.nn as nn
 from torch.nn.parameter import Parameter
 from torch.nn import init
-from .layer_norm import T5LayerNorm, BertLayerNorm
+from .layer_norm import T5LayerNorm
 from .activations import gelu_impl, relu, gelu_new, ACT2FN
-from ..utils import normal_init_method
 import torch.nn.functional as F
-
 from flagai.mpu.initialize import get_model_parallel_rank
 from flagai.mpu.initialize import get_model_parallel_world_size
 from flagai.mpu.mappings import copy_to_model_parallel_region
@@ -16,9 +14,6 @@
 from flagai.mpu.mappings import reduce_from_model_parallel_region
 from flagai.mpu.mappings import scatter_to_model_parallel_region
 from flagai.mpu.utils import divide
-from flagai.mpu.utils import VocabUtility
-from flagai.model.utils import normal_init_method
-from .activations import ACT2FN
 
 
 def _initialize_affine_weight(weight,
@@ -67,7 +62,6 @@ def _initialize_affine_weight(weight,
 
 
 class GPT2MLP(nn.Module):
-
     def __init__(self,
                  n_state,
                  config,
@@ -101,7 +95,6 @@ def forward(self, x):
 
 
 class T5DenseReluDense(nn.Module):
-
     def __init__(self, config):
         super().__init__()
         if os.getenv('ENV_TYPE') == 'deepspeed+mpu':
@@ -128,7 +121,6 @@ def forward(self, hidden_states):
 
 
 class T5DenseGatedGeluDense(nn.Module):
-
     def __init__(self, config):
         super().__init__()
         if os.getenv('ENV_TYPE') == 'deepspeed+mpu':
@@ -167,7 +159,6 @@ def forward(self, hidden_states):
 
 
 class T5LayerFF(nn.Module):
-
     def __init__(self, config):
         super().__init__()
         if config['feed_forward_proj'] == "relu":
@@ -191,7 +182,6 @@ def forward(self, hidden_states):
 
 
 class MLPForward(torch.nn.Module):
-
     def __init__(
         self,
         hidden_size,
@@ -236,7 +226,6 @@ def forward(self, hidden_states, **kw_args):
 
 
 class BertPooler(nn.Module):
-
     def __init__(self, hidden_size):
         super(BertPooler, self).__init__()
         self.dense = nn.Linear(hidden_size, hidden_size)
@@ -271,7 +260,6 @@ class ColumnParallelLinear(torch.nn.Module):
                                      set to False. It returns the master weights
                                      used for initialization.
     """
-
     def __init__(self,
                  input_size,
                  output_size,
@@ -355,7 +343,6 @@ class RowParallelLinear(torch.nn.Module):
                                      set to False. It returns the master weights
                                      used for initialization.
     """
-
     def __init__(self,
                  input_size,
                  output_size,
diff --git a/flagai/model/layers/global_pointer.py b/flagai/model/layers/global_pointer.py
index df43a7f6..a42d0ed3 100644
--- a/flagai/model/layers/global_pointer.py
+++ b/flagai/model/layers/global_pointer.py
@@ -20,7 +20,6 @@ def multilabel_crossentropy(y_pred, y_true):
 
 
 class GlobalPointer(nn.Module):
-
     def __init__(self, hidden_size, ent_type_size, inner_dim, RoPE=True):
         super().__init__()
         self.ent_type_size = ent_type_size
diff --git a/flagai/model/layers/layer_norm.py b/flagai/model/layers/layer_norm.py
index eab08f33..413f6550 100755
--- a/flagai/model/layers/layer_norm.py
+++ b/flagai/model/layers/layer_norm.py
@@ -5,7 +5,6 @@
 
 
 class LayerNorm(nn.Module):
-
     def __init__(self, hidden_size, eps=1e-6):
         super(LayerNorm, self).__init__()
         self.eps = eps
@@ -22,7 +21,6 @@ def forward(self, x):
 
 
 class T5LayerNorm(nn.Module):
-
     def __init__(self, hidden_size, eps=1e-6):
         """
         Construct a layernorm module in the T5 style No bias and no subtraction of mean.
@@ -45,7 +43,6 @@ def forward(self, hidden_states):
 
 
 class BertLayerNorm(nn.Module):
-
     def __init__(self, hidden_size, eps=1e-12):
         """Construct a layernorm module in the TF style (epsilon inside the square root).
         """
diff --git a/flagai/model/predictor/predictor.py b/flagai/model/predictor/predictor.py
index b1748c75..8ca1c826 100644
--- a/flagai/model/predictor/predictor.py
+++ b/flagai/model/predictor/predictor.py
@@ -1,8 +1,9 @@
 import numpy as np
 import torch
 import torch.nn.functional as F
-from flagai.model.predictor.utils import viterbi_decode, decode_labels, bert_beamsearch, t5_random_sample, gpt_random_sample, \
-    t5_beamsearch, gpt_beamsearch, bert_random_sample,glm_beamsearch,glm_random_sample
+from flagai.model.predictor.utils import viterbi_decode, decode_labels, bert_beamsearch,\
+    t5_random_sample, gpt_random_sample, \
+    t5_beamsearch, gpt_beamsearch, bert_random_sample, glm_beamsearch, glm_random_sample
 from typing import List, Union, Dict, Tuple, Any
 from flagai.model.bert_model import BertModel, BertForMaskLM, BertForSeq2seq, BertForSequenceLabeling, \
     BertForSequenceLabelingGP, BertForSequenceLabelingCRF, BertForClsClassifier
@@ -14,7 +15,6 @@
 
 
 class Predictor:
-
     def __init__(self,
                  model: Union[BertModel, GPT2Model, BertForSequenceLabelingGP,
                               BertForSequenceLabelingCRF, BertForClsClassifier,
@@ -142,7 +142,7 @@ def predict_ner(self,
 
         trans = model.state_dict().get("crf_layer.trans", None)
         if trans is not None:
-            ## crf
+            # crf
             trans = trans.cpu()
             with torch.no_grad():
                 out = model(**{"input_ids": token_ids})["logits"][0].cpu()
@@ -156,13 +156,13 @@ def predict_ner(self,
             with torch.no_grad():
                 scores = model(
                     **{"input_ids": token_ids})["logits"].cpu().numpy()[0]
-            ## global pointer
+            # global pointer
             scores[:, [0, -1]] -= np.inf
             scores[:, :, [0, -1]] -= np.inf
-            for l, start, end in zip(*np.where(scores > 0)):
+            for pos_t, start, end in zip(*np.where(scores > 0)):
                 if mapping[start] and mapping[end]:
                     entities.append(
-                        (mapping[start][0], mapping[end][-1], target[l]))
+                        (mapping[start][0], mapping[end][-1], target[pos_t]))
             return entities
         else:
             with torch.no_grad():
diff --git a/flagai/model/predictor/utils.py b/flagai/model/predictor/utils.py
index e320d0eb..3da1e506 100644
--- a/flagai/model/predictor/utils.py
+++ b/flagai/model/predictor/utils.py
@@ -17,7 +17,6 @@ def load_config(config_path):
 
 class LogitsProcessor:
     """Abstract base class for all logit processors that can be applied during generation."""
-
     def __call__(self, input_ids: torch.LongTensor,
                  scores: torch.FloatTensor) -> torch.FloatTensor:
         """Torch method for processing logits."""
@@ -34,7 +33,6 @@ class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
             The parameter for repetition penalty. 1.0 means no penalty. See `this paper
             <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details.
     """
-
     def __init__(self, penalty: float):
         if not isinstance(penalty, float) or not (penalty > 0):
             raise ValueError(
@@ -63,7 +61,6 @@ class TemperatureLogitsProcessor(LogitsProcessor):
         temperature (:obj:`float`):
             The value used to module the logits distribution.
     """
-
     def __init__(self, temperature: float):
         if not isinstance(temperature, float) or not (temperature > 0):
             raise ValueError(
@@ -91,7 +88,6 @@ class TopPLogitsProcessor(LogitsProcessor):
         min_tokens_to_keep (:obj:`int`, `optional`, defaults to 1):
             Minimum number of tokens that cannot be filtered.
     """
-
     def __init__(self,
                  top_p: float,
                  filter_value: float = -float("Inf"),
@@ -139,7 +135,6 @@ class TopKLogitsProcessor(LogitsProcessor):
         min_tokens_to_keep (:obj:`int`, `optional`, defaults to 1):
             Minimum number of tokens that cannot be filtered.
     """
-
     def __init__(self,
                  top_k: int,
                  filter_value: float = -float("Inf"),
@@ -165,7 +160,6 @@ def __call__(self, input_ids: torch.LongTensor,
 
 
 class ListProcessor(LogitsProcessor):
-
     def __init__(self, list_processor: List[LogitsProcessor]) -> None:
         super().__init__()
         self.list_processor = list_processor
@@ -191,9 +185,9 @@ def viterbi_decode(nodes, trans):
     seq_len = nodes.shape[0]
     labels = torch.arange(0, target_size).view(1, -1)
     path = labels
-    for l in range(1, seq_len):
+    for pos_t in range(1, seq_len):
         scores = scores.view(-1, 1)
-        M = scores + trans + nodes[l].view(1, -1)
+        M = scores + trans + nodes[pos_t].view(1, -1)
         scores, ids = M.max(0)
         path = torch.cat((path[:, ids], labels), dim=0)
 
@@ -259,8 +253,11 @@ def bert_beam_search(model,
         token_type_ids = np.zeros_like(token_ids).astype(np.int64)
 
     output_ids = None
+
     with torch.no_grad():
         output_scores = np.zeros([1])
+        new_token_type_ids = token_type_ids
+        new_input_ids = token_ids
         for step in range(out_max_length):
             if step == 0:
                 scores = bert_predict_generate(model, token_ids,
@@ -332,6 +329,7 @@ def glm_beam_search(model,
     list_processor = ListProcessor(lp)
     with torch.no_grad():
         output_scores = np.zeros([1])
+        new_input_ids = input_ids
         for step in range(out_max_length):
             if step == 0:
                 scores = model(
diff --git a/flagai/model/t5_model.py b/flagai/model/t5_model.py
index 06376149..d4a18949 100644
--- a/flagai/model/t5_model.py
+++ b/flagai/model/t5_model.py
@@ -15,28 +15,18 @@
 """ PyTorch T5 model. """
 
 import copy
-import os
 import warnings
-
 import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss
-
+from torch import Tensor, device
+from typing import Tuple, Optional
 from flagai.model.base_model import BaseModel
 from flagai.model.blocks.t5_block import T5Block
 from flagai.model.layers.layer_norm import T5LayerNorm
-from flagai.model.layers.attentions import T5Attention, T5LayerSelfAttention, T5LayerCrossAttention
-from flagai.model.layers.feedforward import T5DenseReluDense, T5DenseGatedGeluDense, T5LayerFF
-from flagai.model.file_utils import _get_vocab_path
-from torch import Tensor, device
-from flagai.model.file_utils import _get_model_id, _get_config_path, _get_checkpoint_path
-from typing import Tuple, Optional
-import copy
-import json
-import numpy as np
-from collections import defaultdict
+from flagai.model.layers.attentions import T5Attention
+from flagai.model.layers.feedforward import T5DenseReluDense, T5DenseGatedGeluDense
 from flagai.data.tokenizer.t5.t5_tokenizer import T5JiebaTokenizer
-from flagai.model.layers.activations import ACT2FN
 
 # Warning message for FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
 __HEAD_MASK_WARNING_MSG = """
@@ -52,7 +42,6 @@ class T5PreTrainedModel(BaseModel):
     An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
     models.
     """
-
     def __init__(self, config, **kwargs):
         super(T5PreTrainedModel, self).__init__(config, **kwargs)
 
@@ -142,7 +131,6 @@ def _shift_right(self, input_ids):
 
 
 class T5Stack(nn.Module):
-
     def __init__(self, config, embed_tokens=None):
         super().__init__()
         self.config = config
@@ -263,7 +251,6 @@ def forward(
             if checkpoint is not None:
 
                 def create_custom_forward(module):
-
                     def custom_forward(*inputs):
                         return module(*inputs)
 
@@ -718,7 +705,7 @@ def load_weights(self, checkpoint_path):
         checkpoint = torch.load(checkpoint_path,
                                 map_location=torch.device("cpu"))
         if "module" in checkpoint:
-            ## ddp
+            # ddp
             checkpoint = checkpoint["module"]
         self.load_state_dict(checkpoint, strict=True)
         return checkpoint
@@ -1098,7 +1085,6 @@ class T5Config():
         use_cache (`bool`, *optional*, defaults to `True`):
             Whether or not the model should return the last key/values attentions (not used by all models).
     """
-
     def __init__(self,
                  vocab_size=32128,
                  d_model=512,
@@ -1281,7 +1267,6 @@ def num_hidden_layers(self):
 
 
 class T5UERModel(nn.Module):
-
     def __init__(self, word2idx, size="base"):
         super().__init__()
         self.device = torch.device("cpu")
diff --git a/flagai/model/utils.py b/flagai/model/utils.py
index 6e6464d1..3b9c2e0d 100644
--- a/flagai/model/utils.py
+++ b/flagai/model/utils.py
@@ -40,19 +40,18 @@ def split_tensor_along_last_dim(tensor,
 
 def unscaled_init_method(sigma):
     """Init method based on N(0, sigma)."""
-
     def init_(tensor):
         return torch.nn.init.normal_(tensor, mean=0.0, std=sigma)
 
     return init_
 
 
-def scaled_init_method(sigma, num_layers):
+def scaled_init_method(mean, sigma, num_layers):
     """Init method based on N(0, sigma/sqrt(2*num_layers)."""
     std = sigma / math.sqrt(2.0 * num_layers)
 
     def init_(tensor):
-        return torch.nn.init.normal_(tensor, mean=0.0, std=std)
+        return torch.nn.init.normal_(tensor, mean=mean, std=std)
 
     return init_
 
@@ -62,17 +61,6 @@ def sqrt(x):
 
 
 def normal_init_method(mean=0.0, std=0.02):
-
-    def init_(tensor):
-        return torch.nn.init.normal_(tensor, mean=mean, std=std)
-
-    return init_
-
-
-def scaled_init_method(mean, std, num_layers):
-    """Init method based on N(0, sigma/sqrt(2*num_layers)."""
-    std = std / math.sqrt(2.0 * num_layers)
-
     def init_(tensor):
         return torch.nn.init.normal_(tensor, mean=mean, std=std)
 
diff --git a/flagai/mpu/cross_entropy.py b/flagai/mpu/cross_entropy.py
index a9041952..567d82e9 100644
--- a/flagai/mpu/cross_entropy.py
+++ b/flagai/mpu/cross_entropy.py
@@ -22,7 +22,6 @@
 
 
 class _VocabParallelCrossEntropy(torch.autograd.Function):
-
     @staticmethod
     def forward(ctx, vocab_parallel_logits, target):
 
diff --git a/flagai/mpu/func_utils.py b/flagai/mpu/func_utils.py
index 076c1ddb..f9eba243 100755
--- a/flagai/mpu/func_utils.py
+++ b/flagai/mpu/func_utils.py
@@ -30,7 +30,6 @@ def gelu(x):
 
 def unscaled_init_method(sigma):
     """Init method based on N(0, sigma)."""
-
     def init_(tensor):
         return torch.nn.init.normal_(tensor, mean=0.0, std=sigma)
 
diff --git a/flagai/mpu/local_attention_function.py b/flagai/mpu/local_attention_function.py
index 10759274..50f9d6e7 100755
--- a/flagai/mpu/local_attention_function.py
+++ b/flagai/mpu/local_attention_function.py
@@ -10,7 +10,6 @@
 
 
 class similarFunction(Function):
-
     @staticmethod
     def forward(ctx, x_ori, x_loc, kH, kW, casual_mask=False):
         ctx.save_for_backward(x_ori, x_loc)
@@ -36,7 +35,6 @@ def backward(ctx, grad_outputs):
 
 
 class weightingFunction(Function):
-
     @staticmethod
     def forward(ctx, x_ori, x_weight, kH, kW, casual_mask=False):
         ctx.save_for_backward(x_ori, x_weight)
diff --git a/flagai/mpu/mappings.py b/flagai/mpu/mappings.py
index bdb74555..1f5fa9d9 100755
--- a/flagai/mpu/mappings.py
+++ b/flagai/mpu/mappings.py
@@ -78,7 +78,6 @@ def _gather(input_):
 
 class _CopyToModelParallelRegion(torch.autograd.Function):
     """Pass the input to the model parallel region."""
-
     @staticmethod
     def forward(ctx, input_):
         return input_
@@ -90,7 +89,6 @@ def backward(ctx, grad_output):
 
 class _ReduceFromModelParallelRegion(torch.autograd.Function):
     """All-redcue the input from the model parallel region."""
-
     @staticmethod
     def forward(ctx, input_):
         return _reduce(input_)
@@ -102,7 +100,6 @@ def backward(ctx, grad_output):
 
 class _ScatterToModelParallelRegion(torch.autograd.Function):
     """Split the input and keep only the corresponding chuck to the rank."""
-
     @staticmethod
     def forward(ctx, input_):
         return _split(input_)
@@ -114,7 +111,6 @@ def backward(ctx, grad_output):
 
 class _GatherFromModelParallelRegion(torch.autograd.Function):
     """Gather the input from model parallel region and concatinate."""
-
     @staticmethod
     def forward(ctx, input_):
         return _gather(input_)
diff --git a/flagai/mpu/random.py b/flagai/mpu/random.py
index 8cd306c0..82aed5b0 100755
--- a/flagai/mpu/random.py
+++ b/flagai/mpu/random.py
@@ -26,8 +26,6 @@
 from torch.cuda import _lazy_call, device as device_ctx_manager
 #from torch.utils.checkpoint import detach_variable
 
-import torch.distributed as dist
-
 PARTITION_ACTIVATIONS = False
 PA_CORRECTNESS_TEST = False
 
@@ -134,7 +132,6 @@ class CudaRNGStatesTracker:
     rng state, we can perform operations and return to our starting
     cuda state.
     """
-
     def __init__(self):
         # Map from a string name to the cuda rng state.
         self.states_ = {}
@@ -293,7 +290,6 @@ class CheckpointFunction(torch.autograd.Function):
            2) the states in the model parallel tracker are also properly
               tracked/set/reset.
     """
-
     @staticmethod
     def forward(ctx, run_function, *args):
         ctx.run_function = run_function
diff --git a/flagai/mpu/utils.py b/flagai/mpu/utils.py
index 18038d63..88e9ae41 100755
--- a/flagai/mpu/utils.py
+++ b/flagai/mpu/utils.py
@@ -55,7 +55,6 @@ class VocabUtility:
     """Split the vocabulary into `world_size` chunks amd return the
         first and last index of the vocabulary belonging to the `rank`
         partition: Note that indecies in [fist, last)"""
-
     @staticmethod
     def vocab_range_from_per_partition_vocab_size(per_partition_vocab_size,
                                                   rank, world_size):
diff --git a/flagai/test_utils.py b/flagai/test_utils.py
index 4a101b92..2bdc495a 100644
--- a/flagai/test_utils.py
+++ b/flagai/test_utils.py
@@ -98,7 +98,6 @@ def build_input_from_ids(text_a_ids=None,
 
 
 class CollateArguments:
-
     def __init__(self):
         self.cloze_eval = True
         self.pretrained_bert = False
diff --git a/flagai/trainer.py b/flagai/trainer.py
index 5b1e7f66..2a765806 100644
--- a/flagai/trainer.py
+++ b/flagai/trainer.py
@@ -6,7 +6,7 @@
     pass
 try:
     from flagai import mpu
-except Exception as e:
+except Exception:
     pass
 from re import I
 import torch
@@ -45,7 +45,7 @@ class Trainer():
                         deepspeed: single-/multi- node gpu <data/pipline parallel>
                         deepspeed+mpu: single-/multi- node gpu <data parallel + model parallel>
         train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.DataLoader`, *optional*):
-            The dataset to use for training. 
+            The dataset to use for training.
             If it is an `Dataset`, we will create a `DataLoader` with the provided `Dataset` and `collate_fn' for the selected `env_type`.
             `Dataset` is prefred to iterally return a sample as followings,
             >>> {'text': 'I like big model.', 'label': 'positive'}
@@ -73,37 +73,36 @@ class Trainer():
         Example settings for `pyrochDDP`:
             >>> trainer = Trainer(env_type='pytorchDDP',
             >>>                        master_ip='127.0.0.1',
-            >>>                        master_port=17750, 
+            >>>                        master_port=17750,
             >>>                        num_nodes=1,   # nodes
             >>>                        num_gpus=2,    # gpus for each nodes
-            >>>                        hostfile='./hostfile', 
-            >>>                        training_script=__file__) 
-            >>> cat ./hostfile 
+            >>>                        hostfile='./hostfile',
+            >>>                        training_script=__file__)
+            >>> cat ./hostfile
             >>> 127.0.0.1 slots=2
         Example settings for `deepspeed`:
             >>> trainer = Trainer(env_type='pytorchDDP',
             >>>                        master_ip='127.0.0.1',
-            >>>                        master_port=17750, 
+            >>>                        master_port=17750,
             >>>                        num_nodes=1,   # nodes
             >>>                        num_gpus=2,    # gpus for each nodes
-            >>>                        hostfile='./hostfile', 
-            >>>                        deepspeed_config='./deepspeed.json', 
-            >>>                        training_script=__file__) 
+            >>>                        hostfile='./hostfile',
+            >>>                        deepspeed_config='./deepspeed.json',
+            >>>                        training_script=__file__)
             The detail settings for deepspeed.json refer to https://www.deepspeed.ai/docs/config-json/
         Example settins for `deepspeed+mpu`:
             The model must be build by megatron-lm!!!
             >>> trainer = Trainer(env_type='pytorchDDP',
             >>>                        master_ip='127.0.0.1',
-            >>>                        master_port=17750, 
+            >>>                        master_port=17750,
             >>>                        num_nodes=1,   # nodes
             >>>                        num_gpus=2,    # gpus for each nodes
-            >>>                        hostfile='./hostfile', 
-            >>>                        deepspeed_config='./deepspeed.json', 
-            >>>                        model_parallel_size=2, # mp_size ==1 
-            >>>                        training_script=__file__) 
+            >>>                        hostfile='./hostfile',
+            >>>                        deepspeed_config='./deepspeed.json',
+            >>>                        model_parallel_size=2, # mp_size ==1
+            >>>                        training_script=__file__)
 
     """
-
     def __init__(
         self,
         timers=None,
@@ -151,8 +150,7 @@ def __init__(
         else:
             self.timers = Timers()
         self.env_type = env_type
-        if env_type not in set(
-            ["deepspeed", 'pytorch', 'pytorchDDP', 'deepspeed+mpu']):
+        if env_type not in set(["deepspeed", 'pytorch', 'pytorchDDP', 'deepspeed+mpu']):
             raise Exception("Not supported env_type!!!!")
         os.environ["ENV_TYPE"] = env_type
         self.experiment_name = experiment_name
@@ -284,7 +282,7 @@ def initialize_distributed(self):
 
     def get_dataloader(self, dataset, collate_fn, shuffle=False):
         """ initilize the dataloader"""
-        if dataset == None:
+        if dataset is None:
             return None
         if self.env_type == 'pytorch':
             return torch.utils.data.DataLoader(dataset,
@@ -332,7 +330,7 @@ def train(self,
                            deepspeed: single-/multi- node gpu <data/pipline parallel>
                            deepspeed+mpu: single-/multi- node gpu <data parallel + model parallel>
            train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.DataLoader`, *optional*):
-               The dataset to use for training. 
+               The dataset to use for training.
                If it is an `Dataset`, we will create a `DataLoader` with the provided `Dataset` and `collate_fn' for the selected `env_type`.
                `Dataset` is prefred to iterally return a sample as followings,
                >>> {'text': 'I like big model.', 'label': 'positive'}
@@ -349,7 +347,6 @@ def train(self,
                 [`AdamW`] on your model.
            lr_scheduler (`torch.optim.lr_scheduler`,  *optional*): A lr_scheduler to use. Will default to an instance of
                 [`AnnealingLR`].
-           
            """
         if not isinstance(train_dataset, torch.utils.data.DataLoader):
             train_dataloader = self.get_dataloader(train_dataset, collate_fn,
@@ -385,7 +382,7 @@ def train(self,
             # for T5 Model
             param_groups = param_groups[0]['params']
 
-        if optimizer == None:
+        if optimizer is None:
             optimizer = get_optimizer(param_groups=param_groups,
                                       lr=self.lr,
                                       weight_decay=self.weight_decay,
@@ -416,7 +413,7 @@ def train(self,
         # Tracking loss.
         total_lm_loss = 0.0
         self.iteration = 0
-        best_score, best_iteration = 0, None
+        best_iteration = 0
         # For each remaining epoch
         self.timers('interval time').start()
         # self.eval_metrics = eval_metrics
@@ -444,9 +441,6 @@ def train(self,
             # For all the batches in the dataset.
             for iteration_, batch in enumerate(train_dataloader):
                 # Train for one step.
-
-                meta = batch.get('meta', None)
-
                 if 'deepspeed' in self.env_type or self.env_type == 'pytorchDDP':
                     batch = {
                         x: batch[x].to(torch.device('cuda', self.local_rank))
@@ -468,7 +462,7 @@ def train(self,
 
                 # Logging.
                 if (self.iteration + 1) % self.log_interval == 0:
-                    if optimizer != None:
+                    if optimizer is not None:
                         learning_rate = optimizer.param_groups[0]['lr']
                     else:
                         learning_rate = model.optimizer.param_groups[0]['lr']
diff --git a/flagai/utils.py b/flagai/utils.py
index 33dff5e1..2af09f3f 100644
--- a/flagai/utils.py
+++ b/flagai/utils.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 """Utilities for logging and serialization"""
 
-from cmath import log
+
 from distutils.command.config import config
 import os
 import random
@@ -112,10 +112,8 @@ def print_params_min_max_norm(optimizer, iteration):
 
 class Timers:
     """Group of timers."""
-
     class Timer:
         """Timer."""
-
         def __init__(self, name):
             self.name_ = name
             self.elapsed_ = 0.0
diff --git a/prepare_test.sh b/prepare_test.sh
index ae1f7668..5941f419 100644
--- a/prepare_test.sh
+++ b/prepare_test.sh
@@ -3,5 +3,5 @@ cd checkpoints
 unzip checkpoints.zip
 mv checkpoints/* .
 rm checkpoints -r
-rm checkpoints.zip 
+rm checkpoints.zip
 
diff --git a/quickstart/sm_ch.py b/quickstart/sm_ch.py
index 84bb9128..925b8c4a 100644
--- a/quickstart/sm_ch.py
+++ b/quickstart/sm_ch.py
@@ -17,4 +17,4 @@
              ["我觉得你很聪明", "你聪明我是这么觉得"]]
 
 for text_pair in test_data:
-    print(text_pair, "相似" if predictor.predict_cls_classifier(text_pair) ==1 else "不相似")
+    print(text_pair, "相似" if predictor.predict_cls_classifier(text_pair) == 1 else "不相似")
diff --git a/quickstart/title_ch.py b/quickstart/title_ch.py
index a7033107..98eb0d36 100644
--- a/quickstart/title_ch.py
+++ b/quickstart/title_ch.py
@@ -25,4 +25,3 @@
         predictor.predict_generate_beamsearch(text,
                                               out_max_length=50,
                                               beam_size=3))
-                                              
\ No newline at end of file
diff --git a/quickstart/title_en.py b/quickstart/title_en.py
index e2cd01b2..712c5c45 100644
--- a/quickstart/title_en.py
+++ b/quickstart/title_en.py
@@ -19,8 +19,4 @@
 ]
 
 for text in test_data:
-    print(
-        predictor.predict_generate_beamsearch(text,
-                                              out_max_length=50,
-                                              beam_size=3))
-                                              
\ No newline at end of file
+    print(predictor.predict_generate_beamsearch(text, out_max_length=50, beam_size=3))
diff --git a/quickstart/writing_ch.py b/quickstart/writing_ch.py
index ec3d45d3..527ce5c3 100644
--- a/quickstart/writing_ch.py
+++ b/quickstart/writing_ch.py
@@ -12,6 +12,6 @@
     text = "今天天气不错"
 
     output = predictor.predict_generate_randomsample(text, input_max_length=512, out_max_length=100,
-                                                    repetition_penalty=1.5, top_k=20, top_p=0.8)
+                                                        repetition_penalty=1.5, top_k=20, top_p=0.8)
 
-    print(f"out is {output}")
\ No newline at end of file
+    print(f"out is {output}")
diff --git a/tests/test_autoloader.py b/tests/bak_test_autoloader.py
similarity index 100%
rename from tests/test_autoloader.py
rename to tests/bak_test_autoloader.py
diff --git a/tests/test_glm_seq2seq.py b/tests/bak_test_glm_seq2seq.py
similarity index 100%
rename from tests/test_glm_seq2seq.py
rename to tests/bak_test_glm_seq2seq.py
diff --git a/tests/test_glm_superclue.py b/tests/bak_test_glm_superclue.py
similarity index 99%
rename from tests/test_glm_superclue.py
rename to tests/bak_test_glm_superclue.py
index 142de551..9777d854 100644
--- a/tests/test_glm_superclue.py
+++ b/tests/bak_test_glm_superclue.py
@@ -79,7 +79,7 @@ def test_init_trainer_pytorch(self):
                 target_ids torch.Size([16, 2, 256])
                 logit_mask torch.Size([16, 2, 256])
                 loss_mask torch.Size([16, 2])
-    
+  
                 '''
                 for key, value in dct.items():
                     try:
diff --git a/tests/test_bert.py b/tests/test_bert.py
new file mode 100644
index 00000000..6ee9c488
--- /dev/null
+++ b/tests/test_bert.py
@@ -0,0 +1,56 @@
+from flagai.auto_model.auto_loader import AutoLoader
+from flagai.model.predictor.predictor import Predictor
+import torch
+from flagai.model.bert_model import BertModel, BertForSeq2seq, \
+                                    BertForSequenceLabeling, \
+                                    BertForSequenceLabelingGP, \
+                                    BertForClsClassifier, \
+                                    BertForSequenceLabelingCRF
+from flagai.data.tokenizer.bert.bert_tokenizer import BertTokenizer
+import unittest
+
+class BertTestCase(unittest.TestCase):
+
+    def setUp(self) -> None:
+
+        self.models = [BertForClsClassifier,
+                       BertForSeq2seq,
+                       BertForSequenceLabeling,
+                       BertForSequenceLabelingGP,
+                       BertForSequenceLabelingCRF]
+        self.model_name = "RoBERTa-wwm-ext"
+        self.bert_path = "./checkpoints/RoBERTa-wwm-ext/config.json"
+        self.tokenizer = BertTokenizer("./checkpoints/RoBERTa-wwm-ext/vocab.txt")
+
+        print("loading bert model successfully!")
+
+    def test_model_predict(self):
+
+        for model in self.models:
+            model = model.init_from_json(self.bert_path, class_num=3, inner_dim=64)
+            class_name = type(model).__name__.lower()
+            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+            model.to(device)
+            predictor = Predictor(model, self.tokenizer)
+            text = "今天吃饭吃了肯德基"
+            if "seq2seq" in class_name:
+                output = predictor.predict_generate_beamsearch(text, out_max_length=20)
+                output = predictor.predict_generate_randomsample(text, out_max_length=20)
+            elif "cls" in class_name:
+                output = predictor.predict_cls_classifier(text)
+            elif "sequencelabeling" in class_name:
+                output = predictor.predict_ner(text, target=["0", "1", "2"])
+            else :
+                output = None
+            print(f"model_name is {class_name}, output is {output}")
+
+
+def suite():
+    suite = unittest.TestSuite()
+    suite.addTest(BertTestCase('test_model_predict'))
+    return suite
+
+
+if __name__ == '__main__':
+    runner = unittest.TextTestRunner()
+    runner.run(suite())
diff --git a/tests/test_glm_large_ch.py b/tests/test_glm_large_ch.py
new file mode 100644
index 00000000..cc3334f4
--- /dev/null
+++ b/tests/test_glm_large_ch.py
@@ -0,0 +1,38 @@
+from flagai.model.predictor.predictor import Predictor
+import torch
+from flagai.model.glm_model import GLMForSeq2Seq
+from flagai.data.tokenizer.glm_large_ch.glm_large_ch_tokenizer import GLMLargeChTokenizer
+import unittest
+
+class GLMLargeChTestCase(unittest.TestCase):
+
+    def setUp(self) -> None:
+
+        self.model = GLMForSeq2Seq.init_from_json("./checkpoints/glm_large_ch/config.json")
+        self.tokenizer = GLMLargeChTokenizer("./checkpoints/glm_large_ch/cog-pretrain.model")
+        print("loading bert model successfully!")
+
+    def test_model_predict(self):
+        model = self.model
+        tokenizer = self.tokenizer
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model.to(device)
+
+        predictor = Predictor(model, tokenizer)
+        text = "今天吃饭吃了肯德基"
+
+        output_beam_search = predictor.predict_generate_beamsearch(text, out_max_length=20)
+        output_randomsample = predictor.predict_generate_randomsample(text, out_max_length=20)
+
+        print(f"output_beamsearch is {output_beam_search}")
+        print(f"output_randomsample is {output_randomsample}")
+
+def suite():
+    suite = unittest.TestSuite()
+    suite.addTest(GLMLargeChTestCase('test_model_predict'))
+    return suite
+
+
+if __name__ == '__main__':
+    runner = unittest.TextTestRunner()
+    runner.run(suite())
diff --git a/tests/test_gpt2_ch.py b/tests/test_gpt2_ch.py
new file mode 100644
index 00000000..1557c506
--- /dev/null
+++ b/tests/test_gpt2_ch.py
@@ -0,0 +1,32 @@
+from flagai.model.gpt2_model import GPT2Model
+from flagai.data.tokenizer.bert.bert_tokenizer import BertTokenizer
+import torch
+import unittest
+import os
+
+class GPT2TestCase(unittest.TestCase):
+    def setUp(self) -> None:
+
+        self.model = GPT2Model.init_from_json("./checkpoints/gpt2_base_chinese/config.json")
+        self.tokenizer = BertTokenizer("./checkpoints/gpt2_base_chinese/vocab.txt")
+
+        print("loading model successfully!")
+
+    def test_model_predict(self):
+        input_ids = self.tokenizer.encode_plus("今天吃饭吃了肯德基")["input_ids"]
+
+        input_ids = torch.LongTensor([input_ids])
+
+        output = self.model(input_ids=input_ids)
+        print(output)
+
+
+def suite():
+    suite = unittest.TestSuite()
+    suite.addTest(GPT2TestCase('test_model_predict'))
+    return suite
+
+
+if __name__ == '__main__':
+    runner = unittest.TextTestRunner()
+    runner.run(suite())
\ No newline at end of file
diff --git a/tests/test_t5_ch.py b/tests/test_t5_ch.py
new file mode 100644
index 00000000..dcbe943c
--- /dev/null
+++ b/tests/test_t5_ch.py
@@ -0,0 +1,32 @@
+from flagai.model.t5_model import T5Model
+from flagai.data.tokenizer.t5.t5_pegasus_tokenizer import T5PegasusTokenizer
+from flagai.model.predictor.predictor import Predictor
+import unittest
+
+class T5TestCase(unittest.TestCase):
+       def setUp(self) -> None:
+
+              self.model = T5Model.init_from_json("./checkpoints/t5_base_ch/config.json")
+              self.tokenizer = T5PegasusTokenizer("./checkpoints/t5_base_ch/vocab.txt")
+
+              print("loading model successfully!")
+
+       def test_model_predict(self):
+              text = "一辆小轿车一名女司机竟造成9死24伤日前深圳市交警局对事故进行通报：" \
+                     "从目前证据看事故系司机超速行驶且操作不当导致目前24名伤员已有6名治愈出院其余正接受治疗预计事故赔偿费或超一千万元"
+              predictor = Predictor(self.model, self.tokenizer)
+              out = predictor.predict_generate_randomsample(text, top_k=30, top_p=0.8, repetition_penalty=1.5, temperature=1.0, out_max_length=10)
+              out_2 = predictor.predict_generate_beamsearch(text, beam_size=2, out_max_length=10)
+              print(out)
+              print(out_2)
+
+
+def suite():
+       suite = unittest.TestSuite()
+       suite.addTest(T5TestCase('test_model_predict'))
+       return suite
+
+
+if __name__ == '__main__':
+       runner = unittest.TextTestRunner()
+       runner.run(suite())
\ No newline at end of file
diff --git a/tests/test_tokenizer.py b/tests/test_tokenizer.py
index 9f92f0c0..63b2c4b1 100644
--- a/tests/test_tokenizer.py
+++ b/tests/test_tokenizer.py
@@ -9,8 +9,9 @@
 
 class TokenizerTestCase(unittest.TestCase):
 
-    def test_tokenizer_glm_chinesesp(self):
-        tokenizer = GLMLargeChTokenizer(fix_command_token=True,
+    def test_tokenizer_glm_large_ch(self):
+        tokenizer = GLMLargeChTokenizer("./checkpoints/glm_large_ch/cog-pretrain.model",
+                                        fix_command_token=True,
                                         add_block_symbols=True,
                                         add_task_mask=False,
                                         add_decoder_mask=False)
@@ -21,16 +22,16 @@ def test_tokenizer_glm_chinesesp(self):
         self.assertEqual(tokenizer.DecodeIds([3378, 1567, 2613, 20282]),
                          '今天吃饭吃了肯德基', 'DecodeIds Error')
 
-    def test_tokenizer_glm_bertwordpiece(self):
-        tokenizer = GLMLargeEnWordPieceTokenizer()
-        print(tokenizer.EncodeAsIds("today is a nice day and"))
-        self.assertEqual(tokenizer.TokenToId("day"), 2154, '')
-        self.assertEqual(tokenizer.EncodeAsIds("fried chicken makes me happy"),
-                         [13017, 7975, 3084, 2033, 3407], '')
-        self.assertEqual(tokenizer.DecodeIds([13017, 7975, 3084, 2033, 3407]),
-                         'fried chicken makes me happy', 'DecodeIds Error')
+    # def test_tokenizer_glm_large_en(self):
+    #     tokenizer = GLMLargeEnWordPieceTokenizer()
+    #     print(tokenizer.EncodeAsIds("today is a nice day and"))
+    #     self.assertEqual(tokenizer.TokenToId("day"), 2154, '')
+    #     self.assertEqual(tokenizer.EncodeAsIds("fried chicken makes me happy"),
+    #                      [13017, 7975, 3084, 2033, 3407], '')
+    #     self.assertEqual(tokenizer.DecodeIds([13017, 7975, 3084, 2033, 3407]),
+    #                      'fried chicken makes me happy', 'DecodeIds Error')
 
-    def test_tokenizer_glm_gpt2(self):
+    def test_tokenizer_glm_10b_en(self):
         tokenizer = GLM10bENBPETokenizer()
         self.assertEqual(tokenizer.TokenToId("day"), 820, '')
         self.assertEqual(tokenizer.EncodeAsIds("fried chicken makes me happy"),
@@ -66,9 +67,9 @@ def test_tokenizer_bert(self):
 
 def suite():
     suite = unittest.TestSuite()
-    suite.addTest(TokenizerTestCase('test_tokenizer_glm_chinesesp'))
-    suite.addTest(TokenizerTestCase('test_tokenizer_glm_bertwordpiece'))
-    suite.addTest(TokenizerTestCase('test_tokenizer_glm_gpt2'))
+    suite.addTest(TokenizerTestCase('test_tokenizer_glm_large_ch'))
+    #suite.addTest(TokenizerTestCase('test_tokenizer_glm_large_en'))
+    suite.addTest(TokenizerTestCase('test_tokenizer_glm_10_en'))
     suite.addTest(TokenizerTestCase('test_tokenizer_t5'))
     suite.addTest(TokenizerTestCase('test_tokenizer_roberta'))
     suite.addTest(TokenizerTestCase('test_tokenizer_bert'))