Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better remove_ordinals regex pattern #96

Closed
wants to merge 4 commits into from
Closed

Conversation

missinglink
Copy link
Member

see #94 for more info.

@orangejulius could you review this plz?

closes #94

[edit] so after much yak shaving I have come up with the perfect megaregex replace( "(?i)((^| )((1)st?|(2)nd?|(3)rd?|([4-9])th?)|(([0-9]*)(1[0-9])th?)|(([0-9]*[02-9])((1)st?|(2)nd?|(3)rd?|([04-9])th?))($| ))", "$2$4$5$6$7$9$10$12$14$15$16$17$18"

...which makes my eyes hurt but makes all the tests pass.

@trescube suggested we should maybe leave it a bit more relaxed so that users who are not perfect at english can still find what they need.

so there are 2 options, either we go with the super complex one above or we modify the existing one to support autocomplete with the simpler: replace( "(([0-9])(st?|nd?|rd?|th?))", "$2" ) or similar

@@ -148,8 +148,8 @@ function generate(){
},
"remove_ordinals" : {
"type" : "pattern_replace",
"pattern": "(([0-9])(st|nd|rd|th))",
"replacement": "$2"
"pattern": "(([0-9]*1)st?|([0-9]*2)nd?|([0-9]*3)rd?|([0-9]*[456789])th?|([0-9]+0)th?|([0-9]*1[0-9])th?)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim and Eric mind blown gif against the stars

This is quite the regex. Looks like it does what you want though

@orangejulius
Copy link
Member

This looks like an improvement over the old regex for sure. I'd be very interested to do some testing of this on dev.

Somewhat related: have we tested not removing ordinals? There's definitely a big signal difference between 5th and 5. Just like in #93 it's not possible to find places named "whole food market" with kstemming because places named "whole foods" evaluate to the same tokens, having 5 and 5th indistinguishable seems less than ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

possible regex issue
2 participants