-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Under-splitting on "set." and "ago." #22
Labels
bug
Something isn't working
Comments
leitneratselerity
changed the title
Under-splitting on "You"
Under-splitting on "You" and "Now"
Feb 28, 2022
leitneratselerity
changed the title
Under-splitting on "You" and "Now"
Under-splitting on "set." and "ago."
Feb 28, 2022
peter-lang-dealogic
added a commit
to peter-lang-dealogic/syntok
that referenced
this issue
Feb 28, 2022
Fixes fnl#22 Some month abbreviations are also valid English words (e.g.: "set", "ago"), which cases false positives if we are treating month abberviations as generic abbrevations, as it would consume such sentence endings. Code already contains logic to check if month abbreviation is preceeded by number, see test-case: "Am 13. Jän. 2006 war es regnerisch."
fnl
added a commit
that referenced
this issue
Feb 28, 2022
Fixes #22 Some month abbreviations are also valid English words (e.g.: "set", "ago"), which causes false positives if we are treating month abbreviations generically, as syntok does not split abbreviations. Therefore, this change removes the months from the list of abbreviations. However, neither introduce over-splitting of month abbreviations as long as they are followed by a numeric token: Because after removing the months from the list of official abbreviations, syntok would only not split at a sentence terminal marker if it was followed by anything with 3+ digits (due to the next_is_a_large_number rule). Therefore, this change also avoids splitting after a month abbreviation if it is followed even by a short number (i.e., days or 2-digit-years). Co-authored-by: fnl <me@fnl.es>
fnl
added a commit
that referenced
this issue
Feb 28, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Syntok did not split:
and:
Because those are the official abbreviations for two Spanish months (septiembre and agosto).
The text was updated successfully, but these errors were encountered: