From 6ccb04789265d2f8eec7409fe85a0038dc5c391c Mon Sep 17 00:00:00 2001
From: gongel <gongel@qq.com>
Date: Tue, 27 Apr 2021 21:23:06 +0800
Subject: [PATCH] Fix bug and update docs (#315)

* feat: add text_simultaneous_translation: STACL

* refactor: update requirement and readme, rename function/class name

* feat: add tokenizer

* refactor: refactor vocab,data_reader

* refactor: rename TransformerModel, fix src_slf_attn_bias

* docs: update README and requirements

* fix: avoid getting the whole source in advance

* docs: update docs

* refactor: refactor tokenizer

* docs: update pretrained model BLEU

* docs: update docs

* style: change code style(line too long)

* docs: update information_extraction link

* docs: add pretrained models download link

* docs: add STACL to example_readme;enrich parameters description

* fix: fix the wrong shape of validation label

* docs: set default parameters to yaml

Co-authored-by: Guo Sheng <whucsgs@163.com>
---
 examples/README.md                            |  3 +-
 .../text_simultaneous_translation/README.md   | 29 +++++++++++++++++--
 .../config/transformer.yaml                   |  4 +--
 .../text_simultaneous_translation/train.py    |  1 -
 4 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/examples/README.md b/examples/README.md
index 02c67bcf905316..340cd76903b2ea 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -106,12 +106,13 @@ PaddleNLP 提供了多种成熟的预训练模型技术，适用于自然语言
 ## NLP系统应用
 
 ### 机器翻译 (Machine Translation)
-机器翻译是计算语言学的一个分支，是人工智能的终极目标之一，具有重要的科学研究价值。在机器翻译的任务上，提供了两大类模型，一类是传统的 Sequence to Sequence任务，简称Seq2Seq，通过RNN类模型进行编码，解码；另外一类是Transformer类模型，通过Self-Attention机制来提升Encoder和Decoder的效果，Transformer模型的具体信息可以参考论文, [Attention Is All You Need](https://arxiv.org/abs/1706.03762)。下面是具体的模型信息。
+机器翻译是计算语言学的一个分支，是人工智能的终极目标之一，具有重要的科学研究价值。在机器翻译的任务上，提供了两大类模型，一类是传统的 Sequence to Sequence任务，简称Seq2Seq，通过RNN类模型进行编码，解码；另外一类是Transformer类模型，通过Self-Attention机制来提升Encoder和Decoder的效果，Transformer模型的具体信息可以参考论文, [Attention Is All You Need](https://arxiv.org/abs/1706.03762)。同声传译（Simultaneous Translation）也隶属于机器翻译，它要求在句子完成之前进行翻译，同传模型STACL是针对同传场景提出的模型，它的Prefix-to-Prefix架构和Wait-k策略能够克服词序差异并带来较高的翻译质量。下面是具体的模型信息。
 
 | 模型    | 简介     |
 | ------ | ------- |
 | [Seq2Seq](./machine_translation/seq2seq) | 使用编码器-解码器（Encoder-Decoder）结构, 同时使用了Attention机制来加强Decoder和Encoder之间的信息交互，Seq2Seq 广泛应用于机器翻译，自动对话机器人，文档摘要自动生成，图片描述自动生成等任务中。|
 | [Transformer](./machine_translation/transformer) |基于PaddlePaddle框架的Transformer结构搭建的机器翻译模型，Transformer 计算并行度高，能解决学习长程依赖问题。并且模型框架集成了训练，验证，预测任务，功能完备，效果突出。|
+| [STACL](./machine_translation/text_simultaneous_translation) | 基于Transformer网络结构的同传模型STACL的PaddlePaddle 实现，STACL是针对同传提出的适用于所有同传场景的翻译模型，既能在未看到源词的情况下仍然翻译出对应的目标词，同时又能保持较高的翻译质量。|
 
 
 ### 阅读理解 (Machine Reading Comprehension)
diff --git a/examples/machine_translation/text_simultaneous_translation/README.md b/examples/machine_translation/text_simultaneous_translation/README.md
index 338a382f96e7f7..a6f54645b9e93f 100644
--- a/examples/machine_translation/text_simultaneous_translation/README.md
+++ b/examples/machine_translation/text_simultaneous_translation/README.md
@@ -79,7 +79,25 @@ python -m paddle.distributed.launch --gpus "0" train.py --config ./config/transf
 
 可以在`config/transformer.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项，程序将默认使用`config/transformer.yaml` 的配置。
 
-建议：如果为了更好的效果，可先在整句模型(即waitk=-1)进行预训练，然后在此基础上根据不同的waitk进行微调来训练不同的waitk模型。
+建议：如果为了更好的效果，可先在整句模型(即`waik=-1`)进行预训练，然后在此基础上根据不同的waitk进行微调来训练不同的waitk模型，训练的命令都同上，下面给出具体的流程以及主要的参数配置：
+- Pretrain
+  Pretrain用来训练整句模型(即`waik=-1`)，可在`config/transformer.yaml`文件中配置参数：
+  - `waik`表示waik策略，这里设置为-1
+  - `training_file`表示训练集，数据格式同上文
+  - `validation_file`表示验证集，数据格式同上文
+  - `init_from_checkpoint`表示模型目录，从该checkpoint恢复训练，这里设置为空
+  - `init_from_pretrain_model`表示模型目录，从该checkpoint开始finetune下游任务，这里设置为空
+  - `use_cuda`表示是否使用GPU，示例设置为True
+  - `use_amp`表示混合精度训练，示例设置为False
+- Finetune
+  Finetune用来训练waik模型(即`waitk=1,2,3,4...`)，可在`config/transformer.yaml`文件中配置参数：
+  - `waik`表示waik策略，这里设置为3（以wait-3模型为例）
+  - `training_file`表示训练集，数据格式同上文
+  - `validation_file`表示验证集，数据格式同上文
+  - `init_from_checkpoint`表示模型目录，从该checkpoint恢复训练，这里设置`waik=-1`模型的ckeckpoint
+  - `init_from_pretrain_model`表示模型目录，从该checkpoint开始finetune下游任务，这里设置为空
+  - `use_cuda`表示是否使用GPU，示例设置为True
+  - `use_amp`表示混合精度训练，示例设置为False
 ## 模型推理
 
 模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
@@ -89,8 +107,13 @@ python -m paddle.distributed.launch --gpus "0" train.py --config ./config/transf
 export CUDA_VISIBLE_DEVICES=0
 python predict.py --config ./config/transformer.yaml
 ```
-
-翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `config/transformer.yaml`文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 `config/transformer.yaml` 的配置。
+- Predict
+ 根据具体的waik策略来进行翻译，可在`config/transformer.yaml`文件中配置参数，预测的命令同上，下面给出主要的参数说明：
+  - `waik`表示waik策略，这里设置为3（以wait-3模型为例）
+  - `predict_file`表示测试集，数据格式是BPE分词后的源语言（中文为Jieba+BPE分词），按行区分
+  - `output_file`表示输出文件，翻译结果会输出到该参数指定的文件
+  - `init_from_params`表示模型的所在目录，根据具体的`waik`来设置，这里设置为`wait=3`模型目录
+  - 更多参数的使用可以在 `config/transformer.yaml`文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 `config/transformer.yaml` 的配置。
 
 需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
 
diff --git a/examples/machine_translation/text_simultaneous_translation/config/transformer.yaml b/examples/machine_translation/text_simultaneous_translation/config/transformer.yaml
index 465089d17fcee5..df1737e240bb8f 100644
--- a/examples/machine_translation/text_simultaneous_translation/config/transformer.yaml
+++ b/examples/machine_translation/text_simultaneous_translation/config/transformer.yaml
@@ -21,9 +21,9 @@ predict_file: "data/nist2m/testdata/test_08.zh"
 # The file to output the translation results of predict_file to.
 output_file: "predict.txt"
 # The path of vocabulary file of source language.
-src_vocab_fpath: "data/nist2m/vocab_zh.20k.bpe"
+src_vocab_fpath: "data/nist2m/nist.20k.zh.vocab"
 # The path of vocabulary file of target language.
-trg_vocab_fpath: "data/nist2m/vocab_en.10k.bpe"
+trg_vocab_fpath: "data/nist2m/nist.10k.en.vocab"
 # The <bos>, <eos> and <unk> tokens in the dictionary.
 special_token: ["<s>", "<e>", "<unk>"]
 
diff --git a/examples/machine_translation/text_simultaneous_translation/train.py b/examples/machine_translation/text_simultaneous_translation/train.py
index c17b6213201be9..8f6df28d592901 100644
--- a/examples/machine_translation/text_simultaneous_translation/train.py
+++ b/examples/machine_translation/text_simultaneous_translation/train.py
@@ -189,7 +189,6 @@ def do_train(args):
                 with paddle.no_grad():
                     for input_data in eval_loader:
                         (src_word, trg_word, lbl_word) = input_data
-                        lbl_word = paddle.reshape(lbl_word, shape=[-1, 1])
                         logits = transformer(
                             src_word=src_word, trg_word=trg_word)
                         sum_cost, avg_cost, token_num = criterion(logits,