Contextual Code-Switching for Pretraining in mNMT

This repository contains the code for our EACL 2023 (Findings) publication "Exploring Contextual Code-Switching for Pretraining in Multilingual Neural Machine Translation".

In this work, we show how to leverage massive multilingual NMT models (like mBART50) to pretrain small, high-performing models with minimal data and compute requirements. Pretraining is done using a superior code-switched noising algorithm called Contextual Code-Switching (CCS) - which uses these massive models to generate contextual, many-to-many substitutions for constructing synthetic code-switchd pretraining corpora. We observe that our models perform comparably or better than massive models, depending on the amount of data provided.

Installation

Run the following command:

conda env create -f environment.yml

The mCOLT repository

We use the mCOLT fairseq module from https://github.com/PANXiao1994/mRASP2/ and make some minor changes to make it suitable for our work. We include the same for reproducibility. Thanks to Pan et al. for providing the original module!

Pipeline

Preprocessing

The first step is to generate translations and alignments using a base NMT model and a word-aligner respectively. We primarily use mBART50 and awesome-align in our work. We also use from-scratch model as an alternative. The scripts we use for all of these models are in the preprocessing dir.

CCS + Training

Once the translations and alignments are generated, we run:

bash preprocess.sh

preprocess.sh calls ccs.py which will code-switch the corpus for you. On completion, it calls train.sh to train the model using the generated code-switched corpus.

Once training is complete, train.sh also contains commands to evaluate the trained models on the test sets.

Note for SLURM users

SLURM users can run:

sbatch preprocess.sh

It is recommended to use CPU nodes for preprocess.sh and GPU nodes for train.sh.

Fine-Tuning

The scripts to run the fine-tuning experiments are in finetuning. Run bash preprocess.sh, just as before. After code-switching, it will call pretrain.sh to pretrain the model on monolingual code-switched corpora. Then, pretrain.sh calls mlft.sh (for Multilingual Fine-Tuning) and blft.sh (for Bilingual Fine-Tuning) to fine-tune on real (unnoised) corpora.

Baselines

We include scripts for all baselines mentioned in this paper in baselines. Namely, it includes scrips to train and evaluate Aligned Augmentation (Pan et al., 2021) and Knowledge Distillation (Hinton et al., 2015) as well as the massively multilingual models mBART50 (Tang et al., 2021) and mRASP2 (Pan et al., 2021).

Acknowledgements

The Aligned Augmentation scripts were built using the instructions and the scripts provided in the mRASP2 repository. For mBART50 and mRASP2, we use the checkpoints available here and here. All due credits to the concerned authors for their excellent works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextual Code-Switching for Pretraining in mNMT

Installation

The mCOLT repository

Pipeline

Preprocessing

CCS + Training

Note for SLURM users

Fine-Tuning

Baselines

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
baselines		baselines
configs		configs
finetuning		finetuning
mcolt		mcolt
preprocessing		preprocessing
.DS_Store		.DS_Store
README.md		README.md
ccs.py		ccs.py
environment.yml		environment.yml
preprocess.sh		preprocess.sh
requirements.txt		requirements.txt
train.sh		train.sh

Remorax/CCS-Pretraining-NMT

Folders and files

Latest commit

History

Repository files navigation

Contextual Code-Switching for Pretraining in mNMT

Installation

The mCOLT repository

Pipeline

Preprocessing

CCS + Training

Note for SLURM users

Fine-Tuning

Baselines

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages