-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Chosen for non-ASCII languages. #2877
Conversation
@vandrijevik could you perhaps lend your skills in verifying this fix? I forgot about your talents 🙈 . |
Sure thing! I just tested the JSbin with Macedonian words (using the Cyrillic alphabet), and the single-select and multi-select fields worked as I would expect them to (this goes for both the text field, and the list of options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review consists fundamentally of a single note: the non-capturing group could be a capturing group and save us some work in the loop.
Additionally, I have a suggestion that moves the responsibility for knowing where the match actually starts into search_string_match
and adds an additional check.
The capture-vs-noncapture is the main idea. The rest is optional, even if I think it's a good idea.
coffee/lib/abstract-chosen.coffee
Outdated
@@ -217,7 +218,7 @@ class AbstractChosen | |||
this.winnow_results_set_highlight() | |||
|
|||
get_search_regex: (escaped_search_string) -> | |||
regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b" | |||
regex_string = if @search_contains then escaped_search_string else "(?:^|\\s)#{escaped_search_string}[^\\s]*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things we can do here is just s/?://
– make the the non-capturing group a capturing group instead.
This would allow the check in winnow_results
to be a simple:
startpos += 1 if search_match[1]
Which is nice, because it keeps us from having to do another regex test
in the loop for potentially numerous matches.
And my further suggestion would be to delete the line from winnow_results
altogether, and have get_search_text
and search_string_match
(which are related to each other both by relevance and also in-file proximity) look like:
get_search_regex: (escaped_search_string) ->
regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*"
regex_string = "^#{regex_string}" unless @enable_split_word_search or @search_contains
regex_flag = if @case_sensitive_search then "" else "i"
new RegExp(regex_string, regex_flag)
search_string_match: (search_string, regex) ->
match = regex.exec(search_string)
match.index += 1 if !@search_contains && match?[1] # <--- do the potential munging here
match
(I've implemented this change locally, and it passes all the tests.)
This way, the consuming code doesn't have to care about how the match is made, and making this change locally also helped me see that we weren't checking @search_contains
for the match, which is unnecessary at the moment, but satisfies my OCD "completeness" sense.
(The reason it's unnecessary at the moment is that get_search_text
for both the jquery and prototype implementations strips leading and trailing whitespace. If that ever changed, though, adding the parallel @search_contains
check would be robust against that. Granted, that's probably not going to change, but like I said: OCD completeness.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huzzah!
@adunkman Do you mind squashing these and rewriting your commit message before merging this? Thanks! |
coffee/lib/abstract-chosen.coffee
Outdated
@@ -217,13 +217,15 @@ class AbstractChosen | |||
this.winnow_results_set_highlight() | |||
|
|||
get_search_regex: (escaped_search_string) -> | |||
regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b" | |||
regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe word boundary matches more than just spaces, so we probably should extend this list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly which characters are word characters depends on the regex flavor you're working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w.
http://regular-expressions.mobi/wordboundaries.html?wlr=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So probably [^\w]
would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, like you said, \w
is ascii only, but still a list would be nice then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can easily be demonstrated when adding a test for options with parentheses, “Cocos (Keeling) Islands” for example. Currently “keel” does give a match, but with this change it wouldn’t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koenpunt good point!
First, @adunkman mentioned trying to use a list but switching back to a "good enough" approach when that didn't perform well.
With that i mind, I can think of a couple of approaches:
-
Stick with the "good enough" approach, but explicitly not try to cover all of the possible unicode line breaks. Maybe something like just combining the
\s
approach with a\b
approach (e.g.(^|\\s|\\b)
). This should preserve all the ASCII word boundaries we expect, and catch whitespace boundaries which, while not perfect, still allows for better results for non-ASCII languages than what we have today -
Actually try to adhere to all the unicode word boundary possibilities. Wikimedia has a unicodejs library that could be instructive. Specifically, its
isbreak
method. This would increase code size by quite a bit, and probably decrease performance, since we wouldn't be inre.exec
-land anymore.
I'd recommend (1), because (2) is a much bigger endeavor than (1) is. And (1) is still better than what we've got now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I paired with @adunkman on implementing idea (1) above and added this commit.
Let us know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that approach is good enough, although I dont believe the tests around it are very thorough; only testing it when the special character is at the begin of the value. I think it would be good to have some more test cases where those characters appear in the middle and the end.
Restores old word-boundary matching behavior while also preserving the new whitespace-based word-boundary matching for non-ASCII languages. Adds a test for the common word-starters that we think are especially important.
Great changes! |
I’m going to take “some test cases” as better than “no test cases” and run with this — it’s better than what’s currently have in master, and it’s an active problem for us in Harvest. We can follow-up with additional tests as needed! |
I figured that adding a few additional testcases isn't that much effort, but hey ¯_(ツ)_/¯ |
@jacob8000 This has now been released as part of version 1.8.1. |
It works perfectly. Thank you very much! |
@harvesthq/chosen-developers to fix #2821.
This tweaks the search regular expression for use in non-ASCII languages. The
\w
(word character) and\b
(word boundary) characters are ASCII-only in JavaScript (even with the unicode flag set), so I attempted a “good enough” solution using\s
(whitespace characters) to achieve a similar effect. This clearly isn’t the best, but for the test cases I imagined it seemed to work well enough.Is there a test case you can think of which this doesn’t work well? If so, I’ll write up a test (or you can) and we can adjust the solution.
I originally attempted to use unicode character ranges to better represent “word characters” — but that quickly got out of hand, resulting in a crazy regular expression which performed quite poorly — which resulted in switching to a “good enough” whitespace approach.
One caveat to the whitespace approach is that we now occasionally match a whitespace character as the first character of our match. I compensated in the highlighter (and wrote a test for it) to adjust the start index when this is the case.
Would it be possible for those with experience in non-ASCII languages verify that this branch works as expected for you? I wrote the tests in Chinese, but… I can only count to 3 in Chinese, so y’all are definitely more qualified to test this.
Here’s this code on jsbin for quick testing — you should be able to quickly edit the HTML to adjust to your language and see if it displays/searches correctly.