Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support training a student model from an already existing teacher model #180

Closed
marco-c opened this issue Aug 30, 2023 · 5 comments
Closed
Assignees
Labels
cost & perf Speeding up and lowering cost for the pipeline language-coverage Issues related to covering specific languages

Comments

@marco-c
Copy link
Collaborator

marco-c commented Aug 30, 2023

This will greatly speed us up in covering more languages (by using existing open source models as teacher models, depending on results from #179) and making improvements to languages that we already trained (finetuning them).

See also #117.

@marco-c marco-c added quality Improving robustness and translation quality language-coverage Issues related to covering specific languages labels Aug 30, 2023
@eu9ene
Copy link
Collaborator

eu9ene commented Aug 30, 2023

I'm already testing the GreenNLP fork. My concern here is that the quality of OPUS models might not be good enough. According to their website, they don't use backtranslations. If we can train better quality models from scratch, the pre-trained OPUS models would be useful only to expand coverage faster or maybe as backward models.

@eu9ene eu9ene self-assigned this Aug 30, 2023
@TommiNieminen
Copy link

Hi, the most current OPUS models are actually in a different repository (I know this is confusing, sorry about that): https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models.

These newer models generally include backtranslated data (if the model name contains +bt), and there are also transformer-big models available. You can see comparisons of different models (with model links) here: https://opus.nlpl.eu/leaderboard/

@eu9ene
Copy link
Collaborator

eu9ene commented Aug 30, 2023

This is a great insight, thank you @TommiNieminen!

@eu9ene
Copy link
Collaborator

eu9ene commented Sep 12, 2023

I trained a student model with this config:

####
# Example of a production config
# Change language pair, experiment name, datasets and other settings if needed
# Training low resource languages might require more tuning of pipeline/training/configs
###


experiment:
  name: opusprod
  src: fi
  trg: en
  src_three_letter: fin
  trg_three_letter: eng

  #OPUS models are not ensembled, they have different vocabs anyway
  teacher-ensemble: 1

  #URL to the OPUS-MT model to use as the teacher
  # opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+bt-2021-09-01.zip"
  opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/fin-eng/opusTCv20210807+news+bt_transformer-big_2023-04-13.zip"
  #URL to the OPUS-MT model to use as the backward model
  # opusmt-backward: "https://object.pouta.csc.fi/Tatoeba-MT-models/fin-eng/opusTCv20210807+bt-2021-08-25.zip" 
  opusmt-backward: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fin/opusTCv20210807+news+bt_transformer-big_2023-04-13.zip"

  # path to a pretrained backward model (optional)
  backward-model: ""
  
  # limits per downloaded dataset
  mono-max-sentences-src: 100000000
  mono-max-sentences-trg: 20000000

  # split corpus to parallelize translation
  split-length: 2000000
  spm-sample-size: 10000000
  
  best-model: perplexity
  bicleaner:
    default-threshold: 0
    dataset-thresholds: []

marian-args:
 decoding-teacher:
    # 2080ti or newer
   precision: float16
   mini-batch-words: 12000

#TODO: extract this info straight from the OPUS model yml info file
datasets:
  # parallel training corpus
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_CCMatrix/v1
    - opus_DGT/v2019
    - opus_ECB/v1
    - opus_ECDC/v2016-03-16
    - opus_ELITR-ECA/v1
    - opus_ELRA-W0217/v1
    - opus_ELRA-W0220/v1
    - opus_ELRA-W0305/v1
    - opus_ELRC-1127-www.vtv.fi/v1
    - opus_ELRC-1128-www.visitestonia.com/v1
    - opus_ELRC-1769-valtioneuvosto.fi/v1
    - opus_ELRC-1771-vnk.fi/v1
    - opus_ELRC-2017-EUIPO_2017/v1
    - opus_ELRC-2032-www.turku.fi/v1
    - opus_ELRC-2036-www.vero.fi/v1
    - opus_ELRC-2708-EMEA/v1
    - opus_ELRC-2739-vaccination/v1
    - opus_ELRC-2869-EU_publications_medi/v1
    - opus_ELRC-3045-wikipedia_health/v1
    - opus_ELRC-3196-antibiotic/v1
    - opus_ELRC-3287-EUROPARL_covid/v1
    - opus_ELRC-3458-EC_EUROPA_covid/v1
    - opus_ELRC-3559-EUR_LEX_covid/v1
    - opus_ELRC-3600-presscorner_covid/v1
    - opus_ELRC-401-Swedish_Labour_Part2/v1
    - opus_ELRC-406-Swedish_Labour_Part1/v1
    - opus_ELRC-416-Swedish_Social_Secur/v1
    - opus_ELRC-4239-NTEU_TierA/v1
    - opus_ELRC-436-Swedish_Food/v1
    - opus_ELRC-4995-Finnish_Financial_MT/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-716-Finnish_Information_/v1
    - opus_ELRC-724-Hallituskausi_2007_2/v1
    - opus_ELRC-725-Hallituskausi_2011_2/v1
    - opus_ELRC-735-www.norden.org/v1
    - opus_ELRC-EC_EUROPA/v1
    - opus_ELRC-EMEA/v1
    - opus_ELRC-EUIPO_2017/v1
    - opus_ELRC-EUROPARL_covid/v1
    - opus_ELRC-EUR_LEX/v1
    - opus_ELRC-EU_publications/v1
    - opus_ELRC-Finnish_Information/v1
    - opus_ELRC-Swedish_Labour/v1
    - opus_ELRC-antibiotic/v1
    - opus_ELRC-presscorner_covid/v1
    - opus_ELRC-vaccination/v1
    - opus_ELRC-valtioneuvosto.fi/v1
    - opus_ELRC-vnk.fi/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC-www.norden.org/v1
    - opus_ELRC-www.turku.fi/v1
    - opus_ELRC-www.vero.fi/v1
    - opus_ELRC-www.visitestonia.com/v1
    - opus_ELRC-www.vtv.fi/v1
    - opus_ELRC_2922/v1
    - opus_ELRC_2923/v1
    - opus_ELRC_3382/v1
    - opus_EMEA/v3
    - opus_EUbookshop/v2
    - opus_EUconst/v1
    - opus_Europarl/v8
    - opus_GNOME/v1
    - opus_JRC-Acquis/v3.0
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2020/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_XLEnt/v1.2
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_wikimedia/v20230407
    - mtdata_EU-dcep-1-eng-fin
    - mtdata_EU-eac_forms-1-eng-fin
    - mtdata_EU-eac_reference-1-eng-fin
    - mtdata_EU-ecdc-1-eng-fin
    - mtdata_Statmt-europarl-10-fin-eng
    - mtdata_Statmt-europarl-7-fin-eng
    - mtdata_Statmt-europarl-9-fin-eng
    - mtdata_Statmt-newsdev_enfi-2015-eng-fin
    - mtdata_Statmt-newsdev_fien-2015-fin-eng
    - mtdata_Statmt-wiki_titles-1-fin-eng
    - mtdata_Tilde-airbaltic-1-eng-fin
    - mtdata_Tilde-ecb-2017-eng-fin
    - mtdata_Tilde-eesc-2017-eng-fin
    - mtdata_Tilde-ema-2016-eng-fin
    - mtdata_Tilde-rapid-2016-eng-fin
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_wmt16
    - sacrebleu_wmt18
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt15
    - sacrebleu_wmt17
    - sacrebleu_wmt19
  mono-src:
    - news-crawl_news.2022
    - news-crawl_news.2021
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
    - news-crawl_news.2017
    - news-crawl_news.2016
    - news-crawl_news.2015
    - news-crawl_news.2014

and got the following BLEU scores for flores-devtest:
student: 32.3
finetuned-student: 31.2
speed (quantized): 30.5

I don't know if it's directly comparable with the model metrics on OPUS dashboard because evaluation procedure and BLEU settings might be different.
5 points difference between the teacher and the final model looks larger than we typically have when training from scratch.

The comparison with the cloud APIs is here and quality looks ok to have it as a dev model but might not be sufficient to release in prod.

I'll proceed with merging the GreenNLP fork to a separate branch to address the issues important for us and then we can continue experimenting with this.

@TommiNieminen do you think I picked the right base model for Finnish language? Maybe you have other ideas on how to get better quality? I'm also planning to train the opposite direction and Swedish.

@eu9ene eu9ene added cost & perf Speeding up and lowering cost for the pipeline and removed quality Improving robustness and translation quality labels Sep 22, 2023
@eu9ene
Copy link
Collaborator

eu9ene commented Jul 16, 2024

We have the functionality to use pre-trained models or fine-tune them now. They should be compatible with our architecture though. I don't think we're planning on using OPUS-MT models at this point since we generally train higher-quality models from scratch.

@eu9ene eu9ene closed this as completed Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

3 participants