Skip to content

Latest commit

 

History

History
209 lines (180 loc) · 22.8 KB

README.md

File metadata and controls

209 lines (180 loc) · 22.8 KB

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

PEACH is a new sequence to sequence multilingual transformer model trained with the semi-supervised pseudo-parallel document generation, our proposed pre-training objective for training multilingual models.

Abstract

Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.

File orderiing

The files are organized in the following system:

|
|__ models
   |
   |__ peach
      |
	  |__ bin
	  |__ data
	  |__ datasets
	  |__ eval
	  |__ layers
	  |__ models
	  |__ ops
	  |__ params
   |__ requirements.txt
   |__ setup.py
|
|__ requirements.txt
|__ T5
|__ mBART
|__ peach
   |
   |__ denoising
   |__ translation

models directory, contains tensorflow codes for creating the models and parameters. There is a Readme in the repository which shows how exactly the codes work and how they can be used.

In the pretrain directory, we have our model objective implementation, as well as mT5' objective and mBART's objective.

For our objective, we have two pre-training methods:

  • word-by-word translation which can be found at translation directory
  • denoising which can be found at denoising directory

In case to find out how to change the hyperparameters and parameters of the models, read the README files in models directory.

peach_training_finetuning.ipynb notebook shows how the generate data for different models (pre-training), how to train models, and how to fine-tune the model. You can use the following checkpoint in order not to train the model from scratch.

Link to models

Denoising models

Here is the link to denosing models.

language model vocab
German(de) download download
English(en) download download
French(fr) download download
Macedonian(mk) download download

Masked Language Modeling objective models

Pre-trained and fine-tuned models for MLM objective:

Pre-trained model links:

language model vocab
en, fr, and de download download
en, fr, and de(xlni) download download
en and mk download download

Fine-tuned models:

language pairs model vocab
de-en download download
de-fr download download
en-de download download
en-fr download download
fr-de download download
fr-en download download
en-mk download download
mk-en download download

Masked Language Modeling with Reordering objective models

Pre-trained and fine-tuned models for MLM with Reordering objective:

Pre-trained model links:

language model vocab
en, fr, and de download download
en and mk download download

XLNI for MLM with Reordering is available here:

model vocab
download download

Fine-tuned models:

language pairs model vocab
de-en download download
de-fr download download
en-de download download
en-fr download download
fr-de download download
fr-en download download
en-mk download download
mk-en download download

SPDG objective models

checkpoint model vocab
checkpoint-100000 download download
checkpoint-200000 download download
checkpoint-300000 download download
checkpoint-400000 download download
checkpoint-500000 download download

XLNI for SPDG is available here:

model vocab
download download

Pre-trained pair-language models:

pair-language model vocab
en and de download download
en and fr download download
en and mk download download
fr and de download download

Fine-tuned pair-language models:

language model vocab
en-de download download
de-en download download
en-fr download download
fr-en download download
de-fr download download
fr-de download download
en-mk download download
mk-en download download

Transformer models:

languages model vocab
de-en download download
de-fr download download
en-de download download
en-fr download download
fr-de download download
fr-de download download

Translation models:

checkpoint languages model vocab
checkpoint-100000 de-en download download
checkpoint-100000 de-fr download download
checkpoint-100000 en-de download download
checkpoint-100000 en-fr download download
checkpoint-100000 fr-de download download
checkpoint-100000 fr-en download download
checkpoint-200000 de-en download download
checkpoint-200000 de-fr download download
checkpoint-200000 en-de download download
checkpoint-200000 en-fr download download
checkpoint-200000 fr-de download download
checkpoint-200000 fr-en download download
checkpoint-300000 de-en download download
checkpoint-300000 de-fr download download
checkpoint-300000 en-de download download
checkpoint-300000 en-fr download download
checkpoint-300000 fr-de download download
checkpoint-300000 fr-en download download
checkpoint-400000 de-en download download
checkpoint-400000 de-fr download download
checkpoint-400000 en-de download download
checkpoint-400000 en-fr download download
checkpoint-400000 fr-de download download
checkpoint-400000 fr-en download download
checkpoint-500000 de-en download download
checkpoint-500000 de-fr download download
checkpoint-500000 en-de download download
checkpoint-500000 en-fr download download
checkpoint-500000 fr-de download download
checkpoint-500000 fr-en download download

Citation

@inproceedings{salemi-etal-2023-peach,
    title = "{PEACH}: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation",
    author = "Salemi, Alireza  and
      Abaskohi, Amirhossein  and
      Tavakoli, Sara  and
      Shakery, Azadeh  and
      Yaghoobzadeh, Yadollah",
    booktitle = "Proceedings of the The Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.loresmt-1.3",
    pages = "32--46",
    abstract = "Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents{'} quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH{'}s ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.",
}