Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Norwegian Bokmål-Nynorsk bitext mining task #202

Merged
merged 1 commit into from
Jan 15, 2024
Merged

Added Norwegian Bokmål-Nynorsk bitext mining task #202

merged 1 commit into from
Jan 15, 2024

Conversation

x-tabdeveloping
Copy link
Collaborator

This is mainly important for the Scandinavian Embedding Benchmark, but generally speaking, all multiligual embeddings should aim to encode sentences in Bokmål and Nynorsk relatively close to each other, as they are two written variants of the same language.

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cool! If it looks good to @KennethEnevoldsen too, fine to merge with me!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are two written variants of the same language.

It is actually two distinctly different languages. E.g. Danes can generally understand bokmål (if spoken slowly) but can't understand nynorsk.

Otherwise it looks good.

@Muennighoff thanks for tagging me. Feel free to just add me as a reviewer to any Scandinavian PRs

@Muennighoff Muennighoff merged commit c3fb742 into embeddings-benchmark:main Jan 15, 2024
3 checks passed
@x-tabdeveloping
Copy link
Collaborator Author

I feel like calling it two different languages is a bit of a stretch. Spoken Norwegian has different and much fuzzier standards than written Norwegian, and to my knowledge it is debated whether a standard Norwegian spoken language exist.
The variant normally considered to be the standard is Standard Østnorsk.
Nynorsk and Bokmål are the two official written languages, and are taught in school and used in media, but there isn't necessarily a direct connection between which Norwegian you write and how you speak.

@KennethEnevoldsen
Copy link
Contributor

I am not an expert, so let us not get too much into it. Subjectively, though, I can't understand Norwegian Nynorsk, but at least the southern dialects of bokmål go just fine. However, reading more up on it, it seems like you are right that they are indeed more on a continuum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants