Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

msaboff · 2016-04-06T22:31:03Z

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

The proposal here is increase the characters in the set returned by \w CharacterClassEscape when unicode is specified. Given the current CaseFolding.txt contents, this will add \u017f (lower case long s) and \u212a (Kelvin symbol) to the returned set. It has the correspond effect that those two characters will not be in the set returned by the \W CharacterClassEscape rule.

This change will make \u017f (lower case long s) and \u212a (Kelvin symbol) word characters for unicode regular expressions. It also means that \w == !\W and !\w == \W regardless of the flags provided to the regular expression.

This is an improvement to pull request 516. It resolves the inconsistency without creating an issue with explicit characters classes ([] syntax) that include CharacterClassEscapes that exist in that pull request.

Note that much of the spec diff is due to indentation changes of the character table.

claudepache · 2016-04-07T18:08:31Z

spec.html

+              </tbody>
+            </table>
+          </figure>
+          1. If  _Unicode_ is *true*, then


We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.

I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.

I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.

claudepache · 2016-04-07T18:50:16Z

(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).)

msaboff · 2016-04-14T17:52:38Z

I'm fine with eliminating the If Unicode check and adding the Assert.

goyakin · 2016-04-19T19:10:59Z

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

Rather than expanding \w only with the additional characters that canonicalize to the 63 characters specifically called out, why don't we generalize this and have \w match the alphanumeric characters from the Unicode Character Database plus _ when the Unicode flag is present? There is already precedent in other languages, e.g. Python, .NET, and Perl.

We can then redefine \b as the boundary between \w and \W in the spec so that they are in accord.

mathiasbynens · 2016-04-19T20:17:15Z

@goyakin There was a proposal to make \w more Unicode-aware, but it was decided against in favor of \p{…}.

goyakin · 2016-04-20T17:53:37Z

@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead.

If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters > 127 to <= 127, as mentioned in the email thread you linked to.

hashseed · 2016-04-26T19:47:54Z

While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways.

claudepache · 2016-04-27T16:41:28Z

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

To be clear, there was already an issue before this change, i.e.:

without the change, /^\W\w$/ui will match "ss", but /^.\b.$/ui won’t;
with the change, /^\W\w$/ui won’t match "ſs", but /^.\b.$/ui will.

Which gotcha is the worst one, I don’t know. It is possible to also modify \b so that the equivalence holds again. (Although, at this point, I'd be more inclined to let \w just match all Unicode letters.)

bterlson · 2016-04-29T18:56:01Z

I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right?

msaboff · 2016-05-18T01:18:10Z

I'm fine with also modifying \b so that the equivalence holds once again. I'll try to get the spec changes ready for the May meeting.

bterlson · 2016-05-23T13:17:21Z

We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b?

bterlson · 2016-05-23T13:18:01Z

@mathiasbynens thanks for raising this issue!

hashseed · 2016-05-24T06:00:10Z

I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:

The production Assertion::\b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
Let e be x's endIndex.
Call IsWordChar(e-1) and let a be the Boolean result.
Call IsWordChar(e) and let b be the Boolean result.
If a is true and b is false, return true.
If a is false and b is true, return true.
Return false.

And

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
If c is one of the sixty-three characters below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps?

ljharb · 2016-05-24T07:00:33Z

@hashseed imo the goal is to make the following true, in all modes:

/W match all the characters not in /w (currently false in /ui)
\B match all the characters not in \b (already the case)
(if I understand correctly) /B to match all the characters in \w (currently false in /ui)

Rephrased, if Kelvin and sharp-S are matched by \w, then they are word characters, and \B should match them.

(someone please correct me if I've misstated the relationship between \B and \w)

claudepache · 2016-05-24T07:07:52Z

@ljharb \b and \B are zero-length assertions, they don’t match characters.

claudepache · 2016-05-24T07:13:41Z

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui

I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be?

ljharb · 2016-05-24T07:33:30Z

@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i.

Suffice to say, within the same mode, \b, \B, \w, and \W should certainly be logically self-consistent.

msaboff · 2016-05-24T08:04:16Z

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters.

hashseed · 2016-05-24T20:05:24Z

@claudepache that's what I meant. \w and \b are not consistent currently for /ui.
@msaboff thanks for clearing up.

msaboff · 2016-06-08T00:23:17Z

I updated the spec to reflect what we agreed on at the May meeting.

I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation.

bterlson · 2016-06-08T00:31:59Z

ping @littledan, @efaust. I'll take a look tomorrow!

bterlson · 2016-06-08T00:34:18Z

@msaboff can you also rebase?

Created a new abstract operation "WordCharacters()" that is used by both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.

msaboff · 2016-06-08T18:33:16Z

Rebased.

msaboff · 2016-06-20T17:43:55Z

@bterlson @littledan @efaust Can I get closure on this change?

bterlson · 2016-06-20T17:55:17Z

I'll try to review this today (still out on a wedding break but festivities are now over :))

bterlson · 2016-06-22T23:04:31Z

Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit).

Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`. This is no longer the case now that tc39/ecma262#525 is merged. Ref. mathiasbynens/regexpu-core#8.

Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`. This is no longer the case now that tc39/ecma262#525 is merged. Ref. #8. Ref. mathiasbynens/regexpu-fixtures@81eeb14.

mathiasbynens · 2016-06-23T10:58:06Z

Bug tickets:

V8: https://bugs.chromium.org/p/v8/issues/detail?id=5080
JavaScriptCore: https://bugs.webkit.org/show_bug.cgi?id=159055
SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1281739
ChakraCore: [RegExp] Match updated spec for /\w/iu and /\W/iu chakra-core/ChakraCore#1181

(Also listed here: https://mathiasbynens.be/notes/es6-unicode-regex#support)

msaboff mentioned this pull request Apr 6, 2016

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

Closed

claudepache reviewed Apr 7, 2016
View reviewed changes

bterlson added the needs consensus This needs committee consensus before it can be eligible to be merged. label May 6, 2016

msaboff changed the title ~~Unify handling of RegExp CharacterClassEscapes \w and \W~~ Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B Jun 8, 2016

msaboff changed the title ~~Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B~~ Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B Jun 8, 2016

Michael Saboff and others added 2 commits June 8, 2016 11:31

Proposed RegExp CharacterClassEscape changes for \w

0d3d71b

Updated per what was agreed at the May 2016 meeting.

ab3380d

Created a new abstract operation "WordCharacters()" that is used by both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.

msaboff force-pushed the regexp-changes branch from bc68fa8 to ab3380d Compare June 8, 2016 18:32

bterlson merged commit 996af87 into tc39:master Jun 22, 2016

This was referenced Jun 23, 2016

[RegExp] Match updated spec for /\w/iu and /\W/iu chakra-core/ChakraCore#1181

Closed

Match updated spec for /\w/iu and /\W/iu mathiasbynens/regexpu-core#8

Closed

mathiasbynens mentioned this pull request Jun 23, 2016

[RegExp] case-insensitive matching misses characters chakra-core/ChakraCore#517

Closed

This was referenced Jun 27, 2016

Update RegExp "u" flag, case folding test compat-table/compat-table#826

Closed

Update RegExp tests for \w, \W, \b & \B tc39/test262#705

Open

mathiasbynens mentioned this pull request Apr 12, 2017

Should the optimizer avoid rewriting to \W? DmitrySoshnikov/regexp-tree#53

Open

mdjermanovic mentioned this pull request Jul 9, 2022

fix: no-warning-comments rule escapes special RegEx characters in terms eslint/eslint#16090

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

msaboff commented Apr 6, 2016

claudepache Apr 7, 2016

msaboff Apr 8, 2016

claudepache Apr 8, 2016

claudepache commented Apr 7, 2016

msaboff commented Apr 14, 2016

goyakin commented Apr 19, 2016 •

edited

Loading

mathiasbynens commented Apr 19, 2016

goyakin commented Apr 20, 2016

hashseed commented Apr 26, 2016

claudepache commented Apr 27, 2016 •

edited

Loading

bterlson commented Apr 29, 2016

msaboff commented May 18, 2016

bterlson commented May 23, 2016

bterlson commented May 23, 2016

hashseed commented May 24, 2016 •

edited

Loading

ljharb commented May 24, 2016

claudepache commented May 24, 2016

claudepache commented May 24, 2016

ljharb commented May 24, 2016

msaboff commented May 24, 2016

hashseed commented May 24, 2016

msaboff commented Jun 8, 2016

bterlson commented Jun 8, 2016

bterlson commented Jun 8, 2016

msaboff commented Jun 8, 2016

msaboff commented Jun 20, 2016

bterlson commented Jun 20, 2016

bterlson commented Jun 22, 2016

mathiasbynens commented Jun 23, 2016 •

edited

Loading

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Conversation

msaboff commented Apr 6, 2016

claudepache Apr 7, 2016

Choose a reason for hiding this comment

msaboff Apr 8, 2016

Choose a reason for hiding this comment

claudepache Apr 8, 2016

Choose a reason for hiding this comment

claudepache commented Apr 7, 2016

msaboff commented Apr 14, 2016

goyakin commented Apr 19, 2016 • edited Loading

mathiasbynens commented Apr 19, 2016

goyakin commented Apr 20, 2016

hashseed commented Apr 26, 2016

claudepache commented Apr 27, 2016 • edited Loading

bterlson commented Apr 29, 2016

msaboff commented May 18, 2016

bterlson commented May 23, 2016

bterlson commented May 23, 2016

hashseed commented May 24, 2016 • edited Loading

ljharb commented May 24, 2016

claudepache commented May 24, 2016

claudepache commented May 24, 2016

ljharb commented May 24, 2016

msaboff commented May 24, 2016

hashseed commented May 24, 2016

msaboff commented Jun 8, 2016

bterlson commented Jun 8, 2016

bterlson commented Jun 8, 2016

msaboff commented Jun 8, 2016

msaboff commented Jun 20, 2016

bterlson commented Jun 20, 2016

bterlson commented Jun 22, 2016

mathiasbynens commented Jun 23, 2016 • edited Loading

goyakin commented Apr 19, 2016 •

edited

Loading

claudepache commented Apr 27, 2016 •

edited

Loading

hashseed commented May 24, 2016 •

edited

Loading

mathiasbynens commented Jun 23, 2016 •

edited

Loading