Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Chosen for non-ASCII languages. #2877

Merged
merged 4 commits into from
Sep 5, 2017
Merged

Fix Chosen for non-ASCII languages. #2877

merged 4 commits into from
Sep 5, 2017

Conversation

adunkman
Copy link
Contributor

@harvesthq/chosen-developers to fix #2821.

This tweaks the search regular expression for use in non-ASCII languages. The \w (word character) and \b (word boundary) characters are ASCII-only in JavaScript (even with the unicode flag set), so I attempted a “good enough” solution using \s (whitespace characters) to achieve a similar effect. This clearly isn’t the best, but for the test cases I imagined it seemed to work well enough.

Is there a test case you can think of which this doesn’t work well? If so, I’ll write up a test (or you can) and we can adjust the solution.

I originally attempted to use unicode character ranges to better represent “word characters” — but that quickly got out of hand, resulting in a crazy regular expression which performed quite poorly — which resulted in switching to a “good enough” whitespace approach.
[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԣԱ-Ֆՙա-ևא-תװ-ײء-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺऄ-हऽॐक़-ॡॱ-ॲॻ-ॿঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-ళవ-హఽౘ-ౙౠ-ౡಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡഅ-ഌഎ-ഐഒ-നപ-ഹഽൠ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะา-ำเ-ๆກ-ຂຄງ-ຈຊຍດ-ທນ-ຟມ-ຣລວສ-ຫອ-ະາ-ຳຽເ-ໄໆໜ-ໝༀཀ-ཇཉ-ཬྈ-ྋက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-Ⴥა-ჺჼᄀ-ᅙᅟ-ᆢᆨ-ᇹሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡷᢀ-ᢨᢪᤀ-ᤜᥐ-ᥭᥰ-ᥴᦀ-ᦩᧁ-ᧇᨀ-ᨖᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᰀ-ᰣᱍ-ᱏᱚ-ᱽᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₔℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃ-ↄⰀ-Ⱞⰰ-ⱞⱠ-Ɐⱱ-ⱽⲀ-ⳤⴀ-ⴥⴰ-ⵥⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々-〆〱-〵〻-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄭㄱ-ㆎㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙟꙢ-ꙮꙿ-ꚗꜗ-ꜟꜢ-ꞈꞋ-ꞌꟻ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꤊ-ꤥꤰ-ꥆꨀ-ꨨꩀ-ꩂꩄ-ꩋ가-힣豈-鶴侮-頻並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼA-Za-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ]|[\ud840-\ud868][\udc00-\udfff]|\ud800[\udc00-\udc0b\udc0d-\udc26\udc28-\udc3a\udc3c-\udc3d\udc3f-\udc4d\udc50-\udc5d\udc80-\udcfa\ude80-\ude9c\udea0-\uded0\udf00-\udf1e\udf30-\udf40\udf42-\udf49\udf80-\udf9d\udfa0-\udfc3\udfc8-\udfcf]|\ud801[\udc00-\udc9d]|\ud802[\udc00-\udc05\udc08\udc0a-\udc35\udc37-\udc38\udc3c\udc3f\udd00-\udd15\udd20-\udd39\ude00\ude10-\ude13\ude15-\ude17\ude19-\ude33]|\ud808[\udc00-\udf6e]|\ud835[\udc00-\udc54\udc56-\udc9c\udc9e-\udc9f\udca2\udca5-\udca6\udca9-\udcac\udcae-\udcb9\udcbb\udcbd-\udcc3\udcc5-\udd05\udd07-\udd0a\udd0d-\udd14\udd16-\udd1c\udd1e-\udd39\udd3b-\udd3e\udd40-\udd44\udd46\udd4a-\udd50\udd52-\udea5\udea8-\udec0\udec2-\udeda\udedc-\udefa\udefc-\udf14\udf16-\udf34\udf36-\udf4e\udf50-\udf6e\udf70-\udf88\udf8a-\udfa8\udfaa-\udfc2\udfc4-\udfcb]|\ud869[\udc00-\uded6]|\ud87e[\udc00-\ude1d]

One caveat to the whitespace approach is that we now occasionally match a whitespace character as the first character of our match. I compensated in the highlighter (and wrote a test for it) to adjust the start index when this is the case.

Would it be possible for those with experience in non-ASCII languages verify that this branch works as expected for you? I wrote the tests in Chinese, but… I can only count to 3 in Chinese, so y’all are definitely more qualified to test this.

Here’s this code on jsbin for quick testing — you should be able to quickly edit the HTML to adjust to your language and see if it displays/searches correctly.

@adunkman
Copy link
Contributor Author

@vandrijevik could you perhaps lend your skills in verifying this fix? I forgot about your talents 🙈 .

@vandrijevik
Copy link

vandrijevik commented Aug 31, 2017

Sure thing! I just tested the JSbin with Macedonian words (using the Cyrillic alphabet), and the single-select and multi-select fields worked as I would expect them to (this goes for both the text field, and the list of options).

Copy link
Contributor

@satchmorun satchmorun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review consists fundamentally of a single note: the non-capturing group could be a capturing group and save us some work in the loop.

Additionally, I have a suggestion that moves the responsibility for knowing where the match actually starts into search_string_match and adds an additional check.

The capture-vs-noncapture is the main idea. The rest is optional, even if I think it's a good idea.

@@ -217,7 +218,7 @@ class AbstractChosen
this.winnow_results_set_highlight()

get_search_regex: (escaped_search_string) ->
regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b"
regex_string = if @search_contains then escaped_search_string else "(?:^|\\s)#{escaped_search_string}[^\\s]*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things we can do here is just s/?:// – make the the non-capturing group a capturing group instead.

This would allow the check in winnow_results to be a simple:

startpos += 1 if search_match[1]

Which is nice, because it keeps us from having to do another regex test in the loop for potentially numerous matches.

And my further suggestion would be to delete the line from winnow_results altogether, and have get_search_text and search_string_match (which are related to each other both by relevance and also in-file proximity) look like:

  get_search_regex: (escaped_search_string) ->
    regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*"
    regex_string = "^#{regex_string}" unless @enable_split_word_search or @search_contains
    regex_flag = if @case_sensitive_search then "" else "i"
    new RegExp(regex_string, regex_flag)

  search_string_match: (search_string, regex) ->
    match = regex.exec(search_string)
    match.index += 1 if !@search_contains && match?[1] # <--- do the potential munging here
    match

(I've implemented this change locally, and it passes all the tests.)

This way, the consuming code doesn't have to care about how the match is made, and making this change locally also helped me see that we weren't checking @search_contains for the match, which is unnecessary at the moment, but satisfies my OCD "completeness" sense.

(The reason it's unnecessary at the moment is that get_search_text for both the jquery and prototype implementations strips leading and trailing whitespace. If that ever changed, though, adding the parallel @search_contains check would be robust against that. Granted, that's probably not going to change, but like I said: OCD completeness.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Love all of it. ❤️

Updated in ffd5919 and 9d613cf (kept the history so everyone can follow the discussion if need be later on).

Copy link
Contributor

@satchmorun satchmorun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah!

@tjschuck
Copy link
Member

@adunkman Do you mind squashing these and rewriting your commit message before merging this? Thanks!

@@ -217,13 +217,15 @@ class AbstractChosen
this.winnow_results_set_highlight()

get_search_regex: (escaped_search_string) ->
regex_string = if @search_contains then escaped_search_string else "\\b#{escaped_search_string}\\w*\\b"
regex_string = if @search_contains then escaped_search_string else "(^|\\s)#{escaped_search_string}[^\\s]*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe word boundary matches more than just spaces, so we probably should extend this list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly which characters are word characters depends on the regex flavor you're working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w.
http://regular-expressions.mobi/wordboundaries.html?wlr=1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So probably [^\w] would work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, like you said, \w is ascii only, but still a list would be nice then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can easily be demonstrated when adding a test for options with parentheses, “Cocos (Keeling) Islands” for example. Currently “keel” does give a match, but with this change it wouldn’t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koenpunt good point!

First, @adunkman mentioned trying to use a list but switching back to a "good enough" approach when that didn't perform well.

With that i mind, I can think of a couple of approaches:

  1. Stick with the "good enough" approach, but explicitly not try to cover all of the possible unicode line breaks. Maybe something like just combining the \s approach with a \b approach (e.g. (^|\\s|\\b)). This should preserve all the ASCII word boundaries we expect, and catch whitespace boundaries which, while not perfect, still allows for better results for non-ASCII languages than what we have today

  2. Actually try to adhere to all the unicode word boundary possibilities. Wikimedia has a unicodejs library that could be instructive. Specifically, its isbreak method. This would increase code size by quite a bit, and probably decrease performance, since we wouldn't be in re.exec-land anymore.

I'd recommend (1), because (2) is a much bigger endeavor than (1) is. And (1) is still better than what we've got now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koenpunt

I paired with @adunkman on implementing idea (1) above and added this commit.

Let us know what you think!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that approach is good enough, although I dont believe the tests around it are very thorough; only testing it when the special character is at the begin of the value. I think it would be good to have some more test cases where those characters appear in the middle and the end.

Restores old word-boundary matching behavior while also preserving the
new whitespace-based word-boundary matching for non-ASCII languages.

Adds a test for the common word-starters that we think are especially
important.
@jacob8000
Copy link

Great changes!
When will it be released?

@adunkman
Copy link
Contributor Author

adunkman commented Sep 5, 2017

I think that approach is good enough, although I dont believe the tests around it are very thorough; only testing it when the special character is at the begin of the value. I think it would be good to have some more test cases where those characters appear in the middle and the end.

I’m going to take “some test cases” as better than “no test cases” and run with this — it’s better than what’s currently have in master, and it’s an active problem for us in Harvest. We can follow-up with additional tests as needed!

@adunkman adunkman merged commit fb6e071 into master Sep 5, 2017
@adunkman adunkman deleted the multibyte-characters branch September 5, 2017 16:41
@koenpunt
Copy link
Collaborator

koenpunt commented Sep 5, 2017

I’m going to take “some test cases” as better than “no test cases” and run with this

I figured that adding a few additional testcases isn't that much effort, but hey ¯_(ツ)_/¯

@tjschuck
Copy link
Member

tjschuck commented Sep 5, 2017

@jacob8000 This has now been released as part of version 1.8.1.

@jacob8000
Copy link

It works perfectly. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug when searching non-ASCII languages
6 participants