Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/nn and fo language extensions #13116

Merged
merged 15 commits into from
Nov 20, 2023
Merged

Feature/nn and fo language extensions #13116

merged 15 commits into from
Nov 20, 2023

Conversation

lise-brinck
Copy link
Contributor

Description

Added language extensions for Faroese and Norwegian Nynorsk.

Types of change

This is an enhancement as it enhances the support of the spaCy language extensions.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added lang / fo Faroese language data and models lang / nn Norwegian Nynorsk language data and models labels Nov 8, 2023
@adrianeboyd
Copy link
Contributor

Thanks for the nice PR, especially for including all the citations and tests!

I'm a little concerned about the duplication between the nb and nn tokenizer settings. Are there many practical differences, e.g. with removing 's as a suffix for one or the other? (nb doesn't remove the kind of English-specific suffixes, but nn does.)

And do you mind if I push directly to this branch to add these languages to the website docs?

{ORTH: "feb.", NORM: "februar"},
{ORTH: "mar.", NORM: "mars"},
{ORTH: "apr.", NORM: "april"},
{ORTH: "jun.", NORM: "juni"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "juli" missing on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The abbreviation "jul(.)" for "juli" also means "Christmas" i Norwegian Nynorsk.

@lise-brinck
Copy link
Contributor Author

Thanks for the nice PR, especially for including all the citations and tests!

I'm a little concerned about the duplication between the nb and nn tokenizer settings. Are there many practical differences, e.g. with removing 's as a suffix for one or the other? (nb doesn't remove the kind of English-specific suffixes, but nn does.)

And do you mind if I push directly to this branch to add these languages to the website docs?

The Norwegian Nynorsk tokenizer is a mix of the Norwegian Bokmål and Danish tokenizers with the addition of some language-specific abbreviations. I'm honestly not too certain about the exact differences between the two variations of Norwegian, and I'm sure there's room for improvements by someone who knows Norwegian Nynorsk.

And sure, go ahead and push to this branch :)

@adrianeboyd adrianeboyd merged commit b6e0223 into explosion:master Nov 20, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / fo Faroese language data and models lang / nn Norwegian Nynorsk language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants