Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenising when using using more than just a-zA-Z #37

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

robotdana
Copy link

@robotdana robotdana commented Nov 30, 2018

Previously: Händler would be tokenized as ndler or ändler depending on python version
Rather than the expected händler

Solution: use regexp rather than re.
This gives us the ability to use unicode character clasess such as [[:upper:]] and [[:lower:]]

Fixes #35

I'm usually a ruby developer not a python developer I don't know how to get the regex library working on 2.7 or how to compare the test strings in a unicode-aware way (they're different on my mac vs on travis, if one passes the other fails)

But it mostly works

@robotdana robotdana force-pushed the diacritics branch 3 times, most recently from 57da098 to c8bd64d Compare November 30, 2018 02:52
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
@myint
Copy link
Owner

myint commented Dec 23, 2018

Thanks! I haven't tried the regex module before. I'll take a look when I have more time.

@robotdana
Copy link
Author

robotdana commented Sep 22, 2019

If you're interested, i took the really long way round fixing this by creating my own spell checker https://github.com/robotdana/spellr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scspell splits words tokens with diacritics inside words
2 participants