Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of transparency on used training data. - Does finetuning make sense? #92

Open
Thybo-D opened this issue Apr 5, 2023 · 1 comment

Comments

@Thybo-D
Copy link

Thybo-D commented Apr 5, 2023

I'm aware the translation models are trained on training data from the OPUS corpus.
But for me it's very unclear on how much data exactly they have trained these models and whether they have used ALL available data from the OPUS corpus given the language directions.

Does it make sense to download OPUS data and further finetune these models?

Does it make sense to find other data sources and finetune the models? If so, how much sentence pairs (approximately) do I need to see an improvement?

I'm particularly interested in finetuning "Helsinki-NLP/opus-mt-nl-en" and "Helsinki-NLP/opus-mt-en-nl"
.

@jorgtied
Copy link
Member

Yes, more or less all data in OPUS at that time of training. I am not sure about fine-tuning. It may also forget about previously learned information. You could continue training with some larger data set but then you may need some longer warm-up time as well to get the optimizer back on track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants