Lack of transparency on used training data. - Does finetuning make sense? #92

Thybo-D · 2023-04-05T14:39:24Z

I'm aware the translation models are trained on training data from the OPUS corpus.
But for me it's very unclear on how much data exactly they have trained these models and whether they have used ALL available data from the OPUS corpus given the language directions.

Does it make sense to download OPUS data and further finetune these models?

Does it make sense to find other data sources and finetune the models? If so, how much sentence pairs (approximately) do I need to see an improvement?

I'm particularly interested in finetuning "Helsinki-NLP/opus-mt-nl-en" and "Helsinki-NLP/opus-mt-en-nl"
.

jorgtied · 2023-04-17T20:55:41Z

Yes, more or less all data in OPUS at that time of training. I am not sure about fine-tuning. It may also forget about previously learned information. You could continue training with some larger data set but then you may need some longer warm-up time as well to get the optimizer back on track.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of transparency on used training data. - Does finetuning make sense? #92

Lack of transparency on used training data. - Does finetuning make sense? #92

Thybo-D commented Apr 5, 2023

jorgtied commented Apr 17, 2023

Lack of transparency on used training data. - Does finetuning make sense? #92

Lack of transparency on used training data. - Does finetuning make sense? #92

Comments

Thybo-D commented Apr 5, 2023

jorgtied commented Apr 17, 2023