FlagAI-Open · marscrazy · May 16, 2022 · May 16, 2022 · May 16, 2022 · May 16, 2022
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,26 @@
+__pycache__
+.idea/
+logs/
+test_tokenizer.py
+samples_text2image/
+generate_contexts/
+venv/
+*__pycache__
+.DS_Store
+.vscode
+*.swo
+*.swp
+*log
+build
+dist
+eazybigmodel.egg-info
+flagai.egg-info
+test_report
+/data/
+/tests/*/data
+checkpoints
+state_dict
+checkpoints*
+vocabs
+tensorboard*
+datasets
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,21 +1,21 @@
 # Contributing to FlagAI
 
-We are happy to accept your contributions to make `FlagAI` better and more awesome! To avoid unnecessary work on either 
+We are happy to accept your contributions to make `FlagAI` better and more awesome! To avoid unnecessary work on either
 side, please stick to the following process:
 
 1. Check if there is already [an issue](https://github.com/BAAI-WuDao/Sailing/issues) for your concern.
 2. If there is not, open a new one to start a discussion. We hate to close finished PRs!
-3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the 
+3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
 commit guidelines below.
 
 
 ## Git Commit Guidelines
 
-If there is already a ticket, use this number at the start of your commit message. 
+If there is already a ticket, use this number at the start of your commit message.
 Use meaningful commit messages that described what you did.
 
-**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.` 
-**Example:** `ISSUE#123: Fix typo in README.` 
+**Example:** `GH-42: Added new type of embeddings: DocumentEmbedding.`
+**Example:** `ISSUE#123: Fix typo in README.`
 
 
 ## Developing locally
@@ -42,7 +42,7 @@ To run all basic tests execute:
 python test.py
 ```
 
-To check the test results in 
+To check the test results in
 ```
 tests/test_report
 ```

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ![FlagAI](logo.png)
-[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052) 
+[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052)
 [简体中文](README_zh.md)
 
 --------------------------------------------------------------------------------
@@ -8,7 +8,7 @@
 FlagAI aims to help researchers and developers to freely train and test large-scale models for NLP tasks.
 <br><br>
 
-* Now it supports GLM, Bert, RoBerta, GPT2, T5 models and models from Huggingface Transformers.
+* Now it supports GLM, Bert, RoBerta, GPT2, T5 and models from Huggingface Transformers.
 
 * It provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub.
 
@@ -76,7 +76,7 @@ Host 127.0.0.1
 
 ## Quick Start
 We provide many models which are trained to perform different tasks. You can load these models by AutoLoader to make prediction.
-## Load model and tokenizer 
+## Load model and tokenizer
 We provide the AutoLoad class to load the model and tokenizer quickly, for example:
 ```python
 from flagai.auto_model.auto_loader import AutoLoader
@@ -93,7 +93,7 @@ tokenizer = auto_loader.get_tokenizer()
 This example is for the `title_generation` task, and you can also model other tasks by modifying the `task_name`.
 Then you can use the model and tokenizer to finetune or test.
 
-## Predictor 
+## Predictor
 We provide the `Predictor` class to predict for different tasks, for example:
 
 ```python

diff --git a/README_zh.md b/README_zh.md
@@ -1,8 +1,8 @@
 ![FlagAI](logo.png)
-[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052) 
+[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/6052/badge)](https://bestpractices.coreinfrastructure.org/projects/6052)
 [English](README.md)
 
--------------------------------------------------------------------------------- 
+--------------------------------------------------------------------------------
 
 FlagAI 旨在帮助研究人员和开发人员自由地训练和测试用于 NLP 任务的大规模模型。
 <br><br>
@@ -34,7 +34,7 @@ FlagAI 旨在帮助研究人员和开发人员自由地训练和测试用于 NLP
 - [许可证](#许可证)
 
 <!-- tocstop -->
-# 安装 
+# 安装
 * PyTorch version >= 1.8.0
 * Python version >= 3.8
 * 使用GPUs进行训练和测试, 你需要安装CUDA 和 NCCL
@@ -248,7 +248,7 @@ for text in test_data:
 * [所有支持的任务](docs/AllSupportedTasks.md)
 
 
-本节解释了本项目中基础NLP类是如何工作的，如何加载预先训练的模型来标记您的文本，如何使用不同的词或文档嵌入来得到表示，以及如何训练自己的语言模型、序列标注模型和文本分类模型。 
+本节解释了本项目中基础NLP类是如何工作的，如何加载预先训练的模型来标记您的文本，如何使用不同的词或文档嵌入来得到表示，以及如何训练自己的语言模型、序列标注模型和文本分类模型。
 
 
 # 教程

diff --git a/SUPPORT.md b/SUPPORT.md
@@ -3,7 +3,7 @@ _Use this file to detail how to make get support forthe project. The structure b
 # Getting Help
 
 There are a few ways to connect with the PROJECT NAME project:
-* https://github.com/BAAI-WuDao/Sailing/issues 
+* https://github.com/BAAI-WuDao/Sailing/issues
 
 ## How to Ask for Help
 

diff --git a/doc_zh/APPENDIX_GLM_IO.md b/doc_zh/APPENDIX_GLM_IO.md
@@ -2,16 +2,16 @@
 
 a) 如下图所示，原文包含6个token，两个区间被屏蔽：第一个区间包含第3个token，第二个区间包含第5个和第6个token。
 
-![results1](img/glm_io_1.png) 
+![results1](img/glm_io_1.png)
 
 b) 将输入分成两个部分： A 部分 (将遮挡区间遮盖掉后的文本)和B部分(被遮挡的区间). 注意所有被遮挡区间的顺序会被重新打乱
 
-![results1](img/glm_io_2.png) 
+![results1](img/glm_io_2.png)
 
 c) GLM的输入和输出，输入包括tokens和2个位置编码
 
-![results1](img/glm_io_3.png) 
+![results1](img/glm_io_3.png)
 
 d) 下图里的自注意力机制既通过遮挡文本实现了自编码， 也在预测遮挡区间内文本的过程里实现了自回归
 
-![results1](img/glm_io_4.png) 
+![results1](img/glm_io_4.png)
diff --git a/doc_zh/AllSupportedTasks.md b/doc_zh/AllSupportedTasks.md
@@ -30,11 +30,11 @@ tokenizer = auto_loader.get_tokenizer()
 所有支持的模型都可以在 **model hub** 中找到。
 不同的模型适应不同的任务。
 
-#### Transfomrer编码器: 
+#### Transfomrer编码器:
 
 例如 model_name="bert-base-chinese" or "RoBERTa-wwm-ext"时， 这些模型支持上一节中提到的所有任务
 
-#### Transformer解码器: 
+#### Transformer解码器:
 
 例如 model_name="gpt2-chinese"时, 模型支持 "seq2seq" 任务.
 

diff --git a/doc_zh/EnvironmentSetup.md b/doc_zh/EnvironmentSetup.md
@@ -1,6 +1,6 @@
 # 预训练分布式环境搭建
 
-# 一.  Docker 
+# 一.  Docker
 
 ## 1.安装docker
 
@@ -117,7 +117,7 @@ systemctl restart docker
 
 ```dockerfile
 #pull base image
-FROM nvidia/cuda:10.2-devel-ubuntu18.04   
+FROM nvidia/cuda:10.2-devel-ubuntu18.04 
 #maintainer
 MAINTAINER deepspeed <gqwang@baai.ac.cn>
 
@@ -135,7 +135,7 @@ RUN mkdir -p ${STAGE_DIR}
 #Installation/Basic Utilities
 ##############################################################################
 RUN  sed -i s@/archive.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
-RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list 
+RUN  sed -i s@/security.ubuntu.com/@/mirrors.tuna.tsinghua.edu.cn/@g /etc/apt/sources.list
 RUN apt-get update && \
     DEBIAN_FRONTEND="noninteractive" apt-get install -y --no-install-recommends \
         software-properties-common build-essential autotools-dev \
@@ -149,8 +149,8 @@ RUN apt-get update && \
         libjpeg-dev \
         libpng-dev \
         screen jq psmisc dnsutils lsof musl-dev systemd
-```        
-        
+```      
+
 ### c.  安装最新版git(创建镜像clone 安装包)
 
 ```dockerfile
@@ -178,7 +178,7 @@ RUN cd ${STAGE_DIR} && \
     PATH=/usr/bin:$PATH ./mlnxofedinstall --user-space-only --without-fw-update --umad-dev-rw --all -q && \
     cd ${STAGE_DIR} && \
     rm -rf ${STAGE_DIR}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64*
-```    
+```  
 
 ### e. 安装 nv_peer_mem
 
@@ -200,7 +200,7 @@ RUN cd ${STAGE_DIR}/nv_peer_memory && \
     apt-get update && \
     apt-get install -y dkms && \
     dpkg-buildpackage -us -uc && \
-    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb   
+    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_${NV_PEER_MEM_TAG}_all.deb 
 ```
 
 ### f. 安装openmpi, 需先安装libevent 依赖包
@@ -237,7 +237,7 @@ RUN mv /usr/local/mpi/bin/mpirun /usr/local/mpi/bin/mpirun.real && \
     echo 'mpirun.real --allow-run-as-root --prefix /usr/local/mpi "$@"' >> /usr/local/mpi/bin/mpirun && \
     chmod a+x /usr/local/mpi/bin/mpirun
 ```
-    
+
 ### g.安装 python
 
 ```dockerfile
@@ -249,24 +249,24 @@ RUN curl -o ~/miniconda.sh https://mirrors.tuna.tsinghua.edu.cn/anaconda/minicon
      chmod +x ~/miniconda.sh && \
      ~/miniconda.sh -b -p /opt/conda && \
      rm ~/miniconda.sh && \
-     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing 
+     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing
 ```
-    
+
 ### h.安装 magma-cuda
 
 ```dockerfile
 ###########################################################################
 #Install magma-cuda
 ##############################################################################
-COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR}   
+COPY magma-cuda102-2.5.2-1.tar.bz2   ${STAGE_DIR} 
 RUN  cd ${STAGE_DIR} && \
      /opt/conda/bin/conda install -y -c pytorch --use-local magma-cuda102-2.5.2-1.tar.bz2  && \
      /opt/conda/bin/conda clean -ya
 ####optional#####
 #RUN  /opt/conda/bin/conda install -y -c pytorch  magma-cuda102  && \
-#/opt/conda/bin/conda clean -ya 
+#/opt/conda/bin/conda clean -ya
 ```
-    
+
 ### i.配置路径
 
 ```dockerfile
@@ -278,7 +278,7 @@ RUN echo "export PATH=/opt/conda/bin:\$PATH" >> /root/.bashrc
 RUN pip install --upgrade pip setuptools
 RUN wget https://tuna.moe/oh-my-tuna/oh-my-tuna.py && python oh-my-tuna.py
 ```
-    
+
 ### j.安装一些pip 包
 
 ```dockerfile
@@ -314,8 +314,8 @@ RUN pip install psutil \
                 rouge_score \
                 cupy-cuda102\
                 setuptools==60.0.3
-```  
-    
+```
+
 ### k.安装mpi4py （需下载到本地安装，pip 安装可能因为版本兼容问题报错）
 
 ```dockerfile
@@ -327,7 +327,7 @@ RUN apt-get update && \
 COPY mpi4py-3.1.3.tar.gz ${STAGE_DIR}
 RUN cd ${STAGE_DIR} && tar zxvf mpi4py-3.1.3.tar.gz && \
  cd mpi4py-3.1.3 &&\
- python setup.py build && python setup.py install 
+ python setup.py build && python setup.py install
 ```
 
 ### l.安装pytorch, 版本可替换， 需先下载项目到本地，国内安装容易因为网速原因，造成终止, pytorch git clone 过程中可能有些子包下载过程中会终止。可以多 git clone 几次
@@ -347,8 +347,8 @@ ENV NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu
 ENV NCCL_INCLUDE_DIR=/usr/include
 RUN cd /opt/pytorch && TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
     CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" USE_SYSTEM_NCCL=1 \
-    pip install -v . && rm -rf /opt/pytorch 
- 
+    pip install -v . && rm -rf /opt/pytorch
+
 
 
 ##############################################################################
@@ -380,7 +380,7 @@ RUN cd ${STAGE_DIR}/apex && pip install -v --no-cache-dir --global-option="--cpp
 #Install deepSpeed
 #############################################################################
 RUN pip install  py-cpuinfo
-RUN apt-get install -y libaio-dev 
+RUN apt-get install -y libaio-dev
 ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
 RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
 #COPY DeepSpeed ${STAGE_DIR}/DeepSpeed
@@ -406,7 +406,7 @@ RUN python -c "import deepspeed; print(deepspeed.__version__)"
 #ENV NCCL_VERSION=2.9.6
 ```
 
-### p.配置网络端口、公钥和ssh 
+### p.配置网络端口、公钥和ssh
 
 ```dockerfile
 #############################################################################
@@ -447,7 +447,7 @@ docker load --input deepspeed-cuda102.tar.gz
 ```shell
 docker build -f cuda102.dockerfile  -t deepspeed/cuda102:1221 .
 #cuda102.dockerfile 参考 dockerfile 文件制作流程
-```     
+```   
 # 二. 在每个机器节点构建容器
 
 ```shell
@@ -501,9 +501,9 @@ docker rm 容器名/容器id
 
 ## 2.将各节点容器生成的公钥文件
 
-~/.ssh/id_rsa.pub 
+~/.ssh/id_rsa.pub
 中的内容收集，并同步到各机器的文件
-~/.ssh/authorized_keys 
+~/.ssh/authorized_keys
 
 ## 3.免密登陆