Anticipation-free Training for Simultaneous Machine Translation

Implementation of the paper Anticipation-free Training for Simultaneous Machine Translation. It is based on fairseq.

Setup

Install fairseq

WARNING: Stick to the specified checkout version to avoid compatibility issues.

git clone https://github.com/pytorch/fairseq.git
cd fairseq
git checkout 8b861be
python setup.py build_ext --inplace
pip install .

(Optional) Install apex for faster mixed precision (fp16) training.
Install dependencies

pip install -r requirements.txt

Data Preparation

This section introduces the data preparation for training and evaluation.

First download moses tokenizer:

git clone https://github.com/moses-smt/mosesdecoder.git

CWMT En<->Zh

For CWMT, you need kaggle account and api before dowloading.

Setup paths in DATA/get_data_cwmt.sh

DATA_ROOT=/path/to/cwmt                     # set path to store raw and preprocessed data
FAIRSEQ=/path/to/fairseq                    # set path to fairseq root
export PYTHONPATH="$FAIRSEQ:$PYTHONPATH"
SCRIPTS=~/utility/mosesdecoder/scripts      # set path to moses tokenizer root
source ~/envs/apex/bin/activate             # activate your virtual environment if any

Preprocess data with

cd DATA
bash get_data_cwmt.sh

WMT15 De<->En

Similarly, preprocess with get_data_wmt15.sh.

Training

The output binarized files should appear under ${DATA_ROOT}/${SRC}-${TGT}/data-bin.

Configure environment and data path in exp*/data_path.sh before training, for instance:

export SRC=en
export TGT=zh
export DATA=/path/to/cwmt/en-zh/data-bin

FAIRSEQ=/path/to/fairseq                    # set path to fairseq root
USERDIR=`realpath ../simultaneous_translation`
export PYTHONPATH="$FAIRSEQ:$PYTHONPATH"

source ~/envs/fair/bin/activate             # activate your virtual environment if any

Go into the either expcwmt/ or expwmt15/ directories to start training models.

NOTE: We will use CWMT as example for the rest of the instructions.

Sequence-Level KD

We need a full-sentence model as teacher for sequence-KD.

The following command will train the teacher model.

bash 0-distill.sh

To distill the training set, run

bash 0a-decode-distill.sh # generate prediction at ./mt.results/generate-test.txt
bash 0b-create-distill.sh # generate distillation data as 'train_distill_${TGT}' split.

To use the distillation data as training set, use/add the command line argument

--train-subset train_distill_${TGT}

Vanilla wait-k

We can now train vanilla wait-k model as a baseline. To do this, run

DELAY=1
bash 1-vanilla_wait_k.sh ${DELAY}

CTC

Train CTC baseline:

DELAY=1
bash 3-causal_ctc.sh ${DELAY}

CTC + ASN

Train our proposed model:

NOTE: To train from scratch, remove the option --load-pretrained-encoder-from in 2-sinkhorn.sh.

DELAY=1
bash 2-sinkhorn.sh ${DELAY}

Pseudo Reference

The following command will generate pseudo reference from wait-9 model:

bash 5a-decode-monotonic.sh

The prediction will be at monotonic.results/generate-train.txt. Run the following to generate pseudo reference dataset as train_monotonic_${TGT} split:

NOTE: Remember to change the paths of DATA_ROOT,FAIRSEQ and SCRIPTS in 5b-create-monotonic.sh to your paths.

bash 5b-create-monotonic.sh

To train waitk / CTC models with pseudo reference, run

DELAY=1
bash 6a-vanilla_wait_k_monotonic.sh ${DELAY}
bash 6b-causal_ctc_monotonic.sh ${DELAY}

Reorder Baseline

The following command will generate word alignments and reordered target for distill set.

NOTE: Remember to change the path of PREFIX in 7a-word-align.sh to your cwmt/zh-en/ready/distill_${TGT} path.

bash 7a-word-align.sh

The alignments will be at ./alignments.results/distill_${TGT}.${SRC}-${TGT}. Run the following to generate reorder dataset as train_reorder_${TGT} split:

NOTE: Remember to change the paths of DATA_ROOT,FAIRSEQ in 7b-create-reorder.sh to your paths.

bash 7b-create-reorder.sh

To train waitk / CTC models with reorder dataset, run

DELAY=1
bash 8a-vanilla_wait_k_reorder.sh ${DELAY}
bash 6b-causal_ctc_reorder.sh ${DELAY}

Inference Stage

See Inference Instructions

Citation

If this repository helps you, please cite the paper as:

@inproceedings{chang-etal-2022-anticipation,
    title = "Anticipation-Free Training for Simultaneous Machine Translation",
    author = "Chang, Chih-Chiang  and
      Chuang, Shun-Po  and
      Lee, Hung-yi",
    booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland (in-person and online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.iwslt-1.5",
    pages = "43--61",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anticipation-free Training for Simultaneous Machine Translation

Setup

Data Preparation

CWMT En<->Zh

WMT15 De<->En

Training

Sequence-Level KD

Vanilla wait-k

CTC

CTC + ASN

Pseudo Reference

Reorder Baseline

Inference Stage

Citation

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
DATA		DATA
docs		docs
eval		eval
expcwmt		expcwmt
expwmt15		expwmt15
scripts		scripts
simultaneous_translation		simultaneous_translation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

George0828Zhang/sinkhorn-simultrans

Folders and files

Latest commit

History

Repository files navigation

Anticipation-free Training for Simultaneous Machine Translation

Setup

Data Preparation

CWMT En<->Zh

WMT15 De<->En

Training

Sequence-Level KD

Vanilla wait-k

CTC

CTC + ASN

Pseudo Reference

Reorder Baseline

Inference Stage

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages