Implementation of the paper Anticipation-free Training for Simultaneous Machine Translation. It is based on fairseq.
- Install fairseq
WARNING: Stick to the specified checkout version to avoid compatibility issues.
git clone https://github.com/pytorch/fairseq.git
cd fairseq
git checkout 8b861be
python setup.py build_ext --inplace
pip install .
- (Optional) Install apex for faster mixed precision (fp16) training.
- Install dependencies
pip install -r requirements.txt
This section introduces the data preparation for training and evaluation.
First download moses tokenizer:
git clone https://github.com/moses-smt/mosesdecoder.git
For CWMT, you need kaggle account and api before dowloading.
- Setup paths in
DATA/get_data_cwmt.sh
DATA_ROOT=/path/to/cwmt # set path to store raw and preprocessed data
FAIRSEQ=/path/to/fairseq # set path to fairseq root
export PYTHONPATH="$FAIRSEQ:$PYTHONPATH"
SCRIPTS=~/utility/mosesdecoder/scripts # set path to moses tokenizer root
source ~/envs/apex/bin/activate # activate your virtual environment if any
- Preprocess data with
cd DATA
bash get_data_cwmt.sh
- Similarly, preprocess with
get_data_wmt15.sh
.
The output binarized files should appear under ${DATA_ROOT}/${SRC}-${TGT}/data-bin
.
Configure environment and data path in exp*/data_path.sh
before training, for instance:
export SRC=en
export TGT=zh
export DATA=/path/to/cwmt/en-zh/data-bin
FAIRSEQ=/path/to/fairseq # set path to fairseq root
USERDIR=`realpath ../simultaneous_translation`
export PYTHONPATH="$FAIRSEQ:$PYTHONPATH"
source ~/envs/fair/bin/activate # activate your virtual environment if any
Go into the either expcwmt/
or expwmt15/
directories to start training models.
NOTE: We will use CWMT as example for the rest of the instructions.
We need a full-sentence model as teacher for sequence-KD.
The following command will train the teacher model.
bash 0-distill.sh
To distill the training set, run
bash 0a-decode-distill.sh # generate prediction at ./mt.results/generate-test.txt
bash 0b-create-distill.sh # generate distillation data as 'train_distill_${TGT}' split.
To use the distillation data as training set, use/add the command line argument
--train-subset train_distill_${TGT}
We can now train vanilla wait-k model as a baseline. To do this, run
DELAY=1
bash 1-vanilla_wait_k.sh ${DELAY}
Train CTC baseline:
DELAY=1
bash 3-causal_ctc.sh ${DELAY}
Train our proposed model:
NOTE: To train from scratch, remove the option
--load-pretrained-encoder-from
in2-sinkhorn.sh
.
DELAY=1
bash 2-sinkhorn.sh ${DELAY}
The following command will generate pseudo reference from wait-9 model:
bash 5a-decode-monotonic.sh
The prediction will be at monotonic.results/generate-train.txt
.
Run the following to generate pseudo reference dataset as train_monotonic_${TGT}
split:
NOTE: Remember to change the paths of
DATA_ROOT
,FAIRSEQ
andSCRIPTS
in5b-create-monotonic.sh
to your paths.
bash 5b-create-monotonic.sh
To train waitk / CTC models with pseudo reference, run
DELAY=1
bash 6a-vanilla_wait_k_monotonic.sh ${DELAY}
bash 6b-causal_ctc_monotonic.sh ${DELAY}
The following command will generate word alignments and reordered target for distill set.
NOTE: Remember to change the path of
PREFIX
in7a-word-align.sh
to yourcwmt/zh-en/ready/distill_${TGT}
path.
bash 7a-word-align.sh
The alignments will be at ./alignments.results/distill_${TGT}.${SRC}-${TGT}
.
Run the following to generate reorder dataset as train_reorder_${TGT}
split:
NOTE: Remember to change the paths of
DATA_ROOT
,FAIRSEQ
in7b-create-reorder.sh
to your paths.
bash 7b-create-reorder.sh
To train waitk / CTC models with reorder dataset, run
DELAY=1
bash 8a-vanilla_wait_k_reorder.sh ${DELAY}
bash 6b-causal_ctc_reorder.sh ${DELAY}
If this repository helps you, please cite the paper as:
@inproceedings{chang-etal-2022-anticipation,
title = "Anticipation-Free Training for Simultaneous Machine Translation",
author = "Chang, Chih-Chiang and
Chuang, Shun-Po and
Lee, Hung-yi",
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
month = may,
year = "2022",
address = "Dublin, Ireland (in-person and online)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.iwslt-1.5",
pages = "43--61",
}