Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Merged
merged 2 commits into from
Jun 22, 2016

Conversation

msaboff
Copy link
Contributor

@msaboff msaboff commented Apr 6, 2016

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

The proposal here is increase the characters in the set returned by \w CharacterClassEscape when unicode is specified. Given the current CaseFolding.txt contents, this will add \u017f (lower case long s) and \u212a (Kelvin symbol) to the returned set. It has the correspond effect that those two characters will not be in the set returned by the \W CharacterClassEscape rule.

This change will make \u017f (lower case long s) and \u212a (Kelvin symbol) word characters for unicode regular expressions. It also means that \w == !\W and !\w == \W regardless of the flags provided to the regular expression.

This is an improvement to pull request 516. It resolves the inconsistency without creating an issue with explicit characters classes ([] syntax) that include CharacterClassEscapes that exist in that pull request.

Note that much of the spec diff is due to indentation changes of the character table.

</tbody>
</table>
</figure>
1. If _Unicode_ is *true*, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.

@claudepache
Copy link
Contributor

(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).)

@msaboff
Copy link
Contributor Author

msaboff commented Apr 14, 2016

I'm fine with eliminating the If Unicode check and adding the Assert.

@goyakin
Copy link
Member

goyakin commented Apr 19, 2016

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

Rather than expanding \w only with the additional characters that canonicalize to the 63 characters specifically called out, why don't we generalize this and have \w match the alphanumeric characters from the Unicode Character Database plus _ when the Unicode flag is present? There is already precedent in other languages, e.g. Python, .NET, and Perl.

We can then redefine \b as the boundary between \w and \W in the spec so that they are in accord.

@mathiasbynens
Copy link
Member

@goyakin There was a proposal to make \w more Unicode-aware, but it was decided against in favor of \p{…}.

@goyakin
Copy link
Member

goyakin commented Apr 20, 2016

@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead.

If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters > 127 to <= 127, as mentioned in the email thread you linked to.

@hashseed
Copy link

While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways.

@claudepache
Copy link
Contributor

claudepache commented Apr 27, 2016

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

To be clear, there was already an issue before this change, i.e.:

  • without the change, /^\W\w$/ui will match "ss", but /^.\b.$/ui won’t;
  • with the change, /^\W\w$/ui won’t match "ſs", but /^.\b.$/ui will.

Which gotcha is the worst one, I don’t know. It is possible to also modify \b so that the equivalence holds again. (Although, at this point, I'd be more inclined to let \w just match all Unicode letters.)

@bterlson
Copy link
Member

I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right?

@bterlson bterlson added the needs consensus This needs committee consensus before it can be eligible to be merged. label May 6, 2016
@msaboff
Copy link
Contributor Author

msaboff commented May 18, 2016

I'm fine with also modifying \b so that the equivalence holds once again. I'll try to get the spec changes ready for the May meeting.

@bterlson
Copy link
Member

We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b?

@bterlson
Copy link
Member

@mathiasbynens thanks for raising this issue!

@hashseed
Copy link

hashseed commented May 24, 2016

I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:

The production Assertion::\b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
Let e be x's endIndex.
Call IsWordChar(e-1) and let a be the Boolean result.
Call IsWordChar(e) and let b be the Boolean result.
If a is true and b is false, return true.
If a is false and b is true, return true.
Return false.

And

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
If c is one of the sixty-three characters below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps?

@ljharb
Copy link
Member

ljharb commented May 24, 2016

@hashseed imo the goal is to make the following true, in all modes:

  • /W match all the characters not in /w (currently false in /ui)
  • \B match all the characters not in \b (already the case)
  • (if I understand correctly) /B to match all the characters in \w (currently false in /ui)

Rephrased, if Kelvin and sharp-S are matched by \w, then they are word characters, and \B should match them.

(someone please correct me if I've misstated the relationship between \B and \w)

@claudepache
Copy link
Contributor

@ljharb \b and \B are zero-length assertions, they don’t match characters.

@claudepache
Copy link
Contributor

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui

I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be?

@ljharb
Copy link
Member

ljharb commented May 24, 2016

@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i.

Suffice to say, within the same mode, \b, \B, \w, and \W should certainly be logically self-consistent.

@msaboff
Copy link
Contributor Author

msaboff commented May 24, 2016

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters.

@hashseed
Copy link

@claudepache that's what I meant. \w and \b are not consistent currently for /ui.
@msaboff thanks for clearing up.

@msaboff
Copy link
Contributor Author

msaboff commented Jun 8, 2016

I updated the spec to reflect what we agreed on at the May meeting.

I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation.

@msaboff msaboff changed the title Unify handling of RegExp CharacterClassEscapes \w and \W Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B Jun 8, 2016
@msaboff msaboff changed the title Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B Jun 8, 2016
@bterlson
Copy link
Member

bterlson commented Jun 8, 2016

ping @littledan, @efaust. I'll take a look tomorrow!

@bterlson
Copy link
Member

bterlson commented Jun 8, 2016

@msaboff can you also rebase?

Michael Saboff and others added 2 commits June 8, 2016 11:31
Created a new abstract operation "WordCharacters()" that is used by
both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.
@msaboff
Copy link
Contributor Author

msaboff commented Jun 8, 2016

Rebased.

@msaboff
Copy link
Contributor Author

msaboff commented Jun 20, 2016

@bterlson @littledan @efaust Can I get closure on this change?

@bterlson
Copy link
Member

I'll try to review this today (still out on a wedding break but festivities are now over :))

@bterlson
Copy link
Member

Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit).

@bterlson bterlson merged commit 996af87 into tc39:master Jun 22, 2016
mathiasbynens added a commit to mathiasbynens/regexpu-fixtures that referenced this pull request Jun 23, 2016
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. mathiasbynens/regexpu-core#8.
mathiasbynens added a commit to mathiasbynens/regexpu-core that referenced this pull request Jun 23, 2016
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. #8.
Ref. mathiasbynens/regexpu-fixtures@81eeb14.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs consensus This needs committee consensus before it can be eligible to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants