Support training a student model from an already existing teacher model #180

marco-c · 2023-08-30T10:25:04Z

This will greatly speed us up in covering more languages (by using existing open source models as teacher models, depending on results from #179) and making improvements to languages that we already trained (finetuning them).

See also #117.

eu9ene · 2023-08-30T17:19:40Z

I'm already testing the GreenNLP fork. My concern here is that the quality of OPUS models might not be good enough. According to their website, they don't use backtranslations. If we can train better quality models from scratch, the pre-trained OPUS models would be useful only to expand coverage faster or maybe as backward models.

TommiNieminen · 2023-08-30T17:26:54Z

Hi, the most current OPUS models are actually in a different repository (I know this is confusing, sorry about that): https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models.

These newer models generally include backtranslated data (if the model name contains +bt), and there are also transformer-big models available. You can see comparisons of different models (with model links) here: https://opus.nlpl.eu/leaderboard/

eu9ene · 2023-08-30T17:48:58Z

This is a great insight, thank you @TommiNieminen!

eu9ene · 2023-09-12T21:51:55Z

I trained a student model with this config:

####
# Example of a production config
# Change language pair, experiment name, datasets and other settings if needed
# Training low resource languages might require more tuning of pipeline/training/configs
###


experiment:
  name: opusprod
  src: fi
  trg: en
  src_three_letter: fin
  trg_three_letter: eng

  #OPUS models are not ensembled, they have different vocabs anyway
  teacher-ensemble: 1

  #URL to the OPUS-MT model to use as the teacher
  # opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+bt-2021-09-01.zip"
  opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/fin-eng/opusTCv20210807+news+bt_transformer-big_2023-04-13.zip"
  #URL to the OPUS-MT model to use as the backward model
  # opusmt-backward: "https://object.pouta.csc.fi/Tatoeba-MT-models/fin-eng/opusTCv20210807+bt-2021-08-25.zip" 
  opusmt-backward: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+news+bt_transformer-big_2023-04-13.zip"

  # path to a pretrained backward model (optional)
  backward-model: ""
  
  # limits per downloaded dataset
  mono-max-sentences-src: 100000000
  mono-max-sentences-trg: 20000000

  # split corpus to parallelize translation
  split-length: 2000000
  spm-sample-size: 10000000
  
  best-model: perplexity
  bicleaner:
    default-threshold: 0
    dataset-thresholds: []

marian-args:
 decoding-teacher:
    # 2080ti or newer
   precision: float16
   mini-batch-words: 12000

#TODO: extract this info straight from the OPUS model yml info file
datasets:
  # parallel training corpus
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_CCMatrix/v1
    - opus_DGT/v2019
    - opus_ECB/v1
    - opus_ECDC/v2016-03-16
    - opus_ELITR-ECA/v1
    - opus_ELRA-W0217/v1
    - opus_ELRA-W0220/v1
    - opus_ELRA-W0305/v1
    - opus_ELRC-1127-www.vtv.fi/v1
    - opus_ELRC-1128-www.visitestonia.com/v1
    - opus_ELRC-1769-valtioneuvosto.fi/v1
    - opus_ELRC-1771-vnk.fi/v1
    - opus_ELRC-2017-EUIPO_2017/v1
    - opus_ELRC-2032-www.turku.fi/v1
    - opus_ELRC-2036-www.vero.fi/v1
    - opus_ELRC-2708-EMEA/v1
    - opus_ELRC-2739-vaccination/v1
    - opus_ELRC-2869-EU_publications_medi/v1
    - opus_ELRC-3045-wikipedia_health/v1
    - opus_ELRC-3196-antibiotic/v1
    - opus_ELRC-3287-EUROPARL_covid/v1
    - opus_ELRC-3458-EC_EUROPA_covid/v1
    - opus_ELRC-3559-EUR_LEX_covid/v1
    - opus_ELRC-3600-presscorner_covid/v1
    - opus_ELRC-401-Swedish_Labour_Part2/v1
    - opus_ELRC-406-Swedish_Labour_Part1/v1
    - opus_ELRC-416-Swedish_Social_Secur/v1
    - opus_ELRC-4239-NTEU_TierA/v1
    - opus_ELRC-436-Swedish_Food/v1
    - opus_ELRC-4995-Finnish_Financial_MT/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-716-Finnish_Information_/v1
    - opus_ELRC-724-Hallituskausi_2007_2/v1
    - opus_ELRC-725-Hallituskausi_2011_2/v1
    - opus_ELRC-735-www.norden.org/v1
    - opus_ELRC-EC_EUROPA/v1
    - opus_ELRC-EMEA/v1
    - opus_ELRC-EUIPO_2017/v1
    - opus_ELRC-EUROPARL_covid/v1
    - opus_ELRC-EUR_LEX/v1
    - opus_ELRC-EU_publications/v1
    - opus_ELRC-Finnish_Information/v1
    - opus_ELRC-Swedish_Labour/v1
    - opus_ELRC-antibiotic/v1
    - opus_ELRC-presscorner_covid/v1
    - opus_ELRC-vaccination/v1
    - opus_ELRC-valtioneuvosto.fi/v1
    - opus_ELRC-vnk.fi/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC-www.norden.org/v1
    - opus_ELRC-www.turku.fi/v1
    - opus_ELRC-www.vero.fi/v1
    - opus_ELRC-www.visitestonia.com/v1
    - opus_ELRC-www.vtv.fi/v1
    - opus_ELRC_2922/v1
    - opus_ELRC_2923/v1
    - opus_ELRC_3382/v1
    - opus_EMEA/v3
    - opus_EUbookshop/v2
    - opus_EUconst/v1
    - opus_Europarl/v8
    - opus_GNOME/v1
    - opus_JRC-Acquis/v3.0
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2020/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_XLEnt/v1.2
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_wikimedia/v20230407
    - mtdata_EU-dcep-1-eng-fin
    - mtdata_EU-eac_forms-1-eng-fin
    - mtdata_EU-eac_reference-1-eng-fin
    - mtdata_EU-ecdc-1-eng-fin
    - mtdata_Statmt-europarl-10-fin-eng
    - mtdata_Statmt-europarl-7-fin-eng
    - mtdata_Statmt-europarl-9-fin-eng
    - mtdata_Statmt-newsdev_enfi-2015-eng-fin
    - mtdata_Statmt-newsdev_fien-2015-fin-eng
    - mtdata_Statmt-wiki_titles-1-fin-eng
    - mtdata_Tilde-airbaltic-1-eng-fin
    - mtdata_Tilde-ecb-2017-eng-fin
    - mtdata_Tilde-eesc-2017-eng-fin
    - mtdata_Tilde-ema-2016-eng-fin
    - mtdata_Tilde-rapid-2016-eng-fin
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_wmt16
    - sacrebleu_wmt18
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt15
    - sacrebleu_wmt17
    - sacrebleu_wmt19
  mono-src:
    - news-crawl_news.2022
    - news-crawl_news.2021
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
    - news-crawl_news.2017
    - news-crawl_news.2016
    - news-crawl_news.2015
    - news-crawl_news.2014

and got the following BLEU scores for flores-devtest:
student: 32.3
finetuned-student: 31.2
speed (quantized): 30.5

I don't know if it's directly comparable with the model metrics on OPUS dashboard because evaluation procedure and BLEU settings might be different.
5 points difference between the teacher and the final model looks larger than we typically have when training from scratch.

The comparison with the cloud APIs is here and quality looks ok to have it as a dev model but might not be sufficient to release in prod.

I'll proceed with merging the GreenNLP fork to a separate branch to address the issues important for us and then we can continue experimenting with this.

@TommiNieminen do you think I picked the right base model for Finnish language? Maybe you have other ideas on how to get better quality? I'm also planning to train the opposite direction and Swedish.

eu9ene · 2024-07-16T22:07:24Z

We have the functionality to use pre-trained models or fine-tune them now. They should be compatible with our architecture though. I don't think we're planning on using OPUS-MT models at this point since we generally train higher-quality models from scratch.

marco-c added quality Improving robustness and translation quality language-coverage Issues related to covering specific languages labels Aug 30, 2023

eu9ene self-assigned this Aug 30, 2023

eu9ene mentioned this issue Sep 14, 2023

Use pretrained opusmt models #192

Draft

eu9ene added cost & perf Speeding up and lowering cost for the pipeline and removed quality Improving robustness and translation quality labels Sep 22, 2023

eu9ene mentioned this issue Oct 4, 2023

[meta] General translation quality improvements #216

Open

eu9ene closed this as completed Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support training a student model from an already existing teacher model #180

Support training a student model from an already existing teacher model #180

marco-c commented Aug 30, 2023

eu9ene commented Aug 30, 2023

TommiNieminen commented Aug 30, 2023

eu9ene commented Aug 30, 2023

eu9ene commented Sep 12, 2023

eu9ene commented Jul 16, 2024

Support training a student model from an already existing teacher model #180

Support training a student model from an already existing teacher model #180

Comments

marco-c commented Aug 30, 2023

eu9ene commented Aug 30, 2023

TommiNieminen commented Aug 30, 2023

eu9ene commented Aug 30, 2023

eu9ene commented Sep 12, 2023

eu9ene commented Jul 16, 2024