-
Notifications
You must be signed in to change notification settings - Fork 12
Backwards-compatible syntax #2
Comments
Nice! |
During the November TC39 meeting, @waldemarhorwat expressed concerns w.r.t. backwards-incompatible syntax, and @michaelficarra expressed concerns with introducing a new flag. I believe this solution might address both concerns. Waldemar, Michael, did I get that right? |
Correct, this is possible in Unicode regexps using currently unused backslash escapes. Still not totally convinced it's common enough to warrant space in our already quite complex Pattern grammar. |
My main concern which I expressed at the meeting is that compatibility concerns with trying to retrofit these into existing character classes might drive us to make syntax that has seriously confusing gotchas, trapdoors, or special cases. I'd prefer syntax be simple and regular. There are several possible ways of getting there. A flag or something like the Let me repost my position in general:
|
Some observations.
- Any of the syntax options can be limited to just the character
classes, so their usage is bounded by that (and are not huge in scope).
- Many people have experience with set operations in other regex
engines, and these are not the features that have gotchas (unlike, say, the
order of alternates being significant).
- Character properties are vital to good internationalization, so that
people don't make the many mistakes associated with hard-coded character
classes.
Mark
|
Summarizing discussions we had in the meantime; major bike-shedding here. character class prefixWe have been assuming a However, we should not actually use "UnicodeSet" because the proposal we are working towards is noticeably different from the syntax that ICU class UnicodeSet uses, so that term would be confusing. I suggested I also suggested that the term "set" does not quite fit because in regular expressions these things are usually called "character classes". So we could use something like nested classesRegardless of the top-level syntax, we propose that nested classes use conventional, simple Example: curly braces vs. square bracketsI suggested using square brackets at top level as well, to make the new type of character class look more like the existing one, just with a distinct prefix. For example, This would also avoid having to treat curly braces (at least However, most of us feel that the pattern of stateful modifier@macchiati pointed out that some regex engines support stateful modifiers inside the pattern string that change the behavior of the whole expression, or of the part of an expression between an "on" flag and an "off" flag. For example, in some engines, He suggested that we could use such a modifier to change the syntax and semantics of affected character classes, instead of a per-class prefix. We would not use Example: This seems intriguing, but ECMAScript does not currently appear to support any such modifiers, and |
The important thing to determine (re the stateful modifier) is whether the construct (eg /(?p).../) currently causes a syntax error. That would clear the way for adding it. Note that the primary use case would be with the stateful modifier is the very first thing in the regex — not embedded part way through the string. That is, I think it should be a non-goal to support part of the regex in "USet mode" and part not in USet mode. That is, I think there is no real advantage to allowing either syntax to only cover part of the regex. /abc \USet{\p{Decimal_Number}--[0-9]} xyz/u |
Unless you disallow unescaped |
So would this be an example of what you are concerned about?
/\USet{[/u]}/u
|
Yes, you have mentioned problems with literal Is it correct that an escaped slash, as in So far we have “exclude / for JS regex literals (TODO: confirm restrictions)” -- are there other things to watch out for? |
An example of a problem case might be
Both are consistent. You can't tell whether the regular expression is in Unicode mode until you find its end, but where it ends depends on whether it's in Unicode mode. The problem goes away if you either disallow nested character classes or disallow unescaped |
Nice example!
We do propose to disallow unescaped |
When I test with the demo on
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace,
your example fails to compile. Is there some more context to your example
or am I missing something else?
const regex2 = /USet{[[a-z]/+"]}"/u ;
|
I just gave the core of the example. In one of the chameleon modes it interprets the In Chrome or Firefox:
produces the output |
Thanks for the example. |
The more I think about it, the more I think for backwards compatibility /(?U)…/ is a better choice to signal the new syntax than /\USet{…}/:
Here's a little comparison: OUTPUT:
CODE:
|
I'm concerned that /(?U)…/ (or any other choice of letter) would be confusing to those who are familiar with the syntax from other regular expression implementations, including Java, Perl, PCRE or Python. In these, the letter(s) can indicate any of the supported regex flags, and setting a flag in this way has the same effect as other ways of setting it. With this proposal, U is not a normal flag, U cannot be specified as a flag, and other flags cannot be specified with the /(?U).../ style syntax. It just feels inconsistent. Going this way would also complicate any future effort to extend ES regexp to support /(?x).../ for flags in general. |
I don't share those concerns.
|
I also prefer a flag-like If the |
There was some objection in the tc39 meeting against using the flags as
flags. (Unclear why, but regular participants may be able to say).
IMO it would be cleanest and most useful to do both:
a) allow U as a flag, eg /[\p{abcd}--\p{defg}]/U
b) add the syntax to allow all the flags to be used inside the expression,
eg /(?U)[\p{abcd}--\p{defg}]
But if that isn't feasible, then the best fallback would be
(b2) allow (?U)
(b2) does not preclude or require future extension to either (a) or (b) or
both.
|
About flags: I think the comment from @erights at TC39 hit the nail on the head. We added the Personal preference: I prefer the
|
@macchiati As noted in #2 (comment) |
Note that this is not just because of “new ideas” but also because of ECMAScript's desire to be 100% backwards compatible, even for what seems like unlikely or unnecessary usage (such as double punctuation). That is a good and understandable goal, but where syntax has not been defined with suitable extension options, it requires some sort of versioning. ECMAScript also seems to prefer incremental, limited proposals. If one step does not provide for extensions sufficient for what happens to be the next step, then we need to partially start over. We are lucky that the I don't know about a roadmap for regular expressions because I am not a general regex expert. UTS #18 has a number of things that are useful but not yet supported in ECMAScript; it seems like our current proposals, and the reserved syntax they provide, reasonably cover what's in there. In particular, we are overcoming major hurdles with strings, nested classes, and reserved double punctuation (as well as simplifying the handling of dash and slash). If this were to “hold” for, say, ten years, that seems pretty good. If someone here knows a lot about regex syntaxes across many engines, it would be useful to list whatever other not-yet-supported syntactic features we might leave open. Personally, I find the use of an in-pattern modifier as attractive as Mark does. If that would want to be equivalent with an external flag, then that seems ok assuming we cover foreseeable syntax needs. The |
From discussion with Waldemar:
A flag or modifier is a possibility, and may be less likely to be forgotten when the character classes otherwise just use If we want to avoid letters that are modifiers somewhere (see above) or flags in ES regex, then we could pick one from |
I also like the idea of using |
If we were to add a new flag, I like |
"twice as unicode" versus "sharper unicode" |
If we were to add a flag, I think the flag should not only add the Unicode set notation stuff. The new flag should also do some subset of:
If, however, we are only adding the Unicode set notation, then, IMO, it should not be its own flag. |
I think “Tokenize based on grapheme clusters” goes too far. |
Tokenizing based on grapheme clusters is an interesting idea, and I brought it up as well. The things that give me pause are its stability over time — some of the official emoji being added are fairly long strings of various Unicode characters. Some of them are also not self-synchronizing, in that you can't tell what's a grapheme if you start searching from the middle of a string. In other words, there exist Unicode characters A, B, C, D, E, etc. such that graphemes (which I denote by the constituent characters of a grapheme enclosed by «») can be constructed either as: «AB»«CD»«EF»«GH»«IJ»«K» or: «BC»«DE»«FG»«HI»«JK» Country flags are one example of this phenomenon. |
Correct; the
Tokenizing by grapheme clusters goes too far, but not making code point classes into string classes? :) What I'm trying to say is that if we add a flag, we should try to do everything right and not leave any unplugged holes. |
Tokenizing based on grapheme clusters has all sorts of weird complications, where separate pieces of a pattern expression could combine during matching to match a composed grapheme cluster. I started to look at this as a way to implement Java's canonical equivalence mode in ICU, but it was messy enough that I set it aside. At the time I was looking at it, some years back, Java's canonical equivalence mode seemed pretty broken if you started poking at edge cases. Something probably could be done in this area, but it needs to be optional - people still want to be able to match on code points, finding combining marks for example. And it wants some experimentation before saying the ideas are ready for any sort of standardization. |
|
v mode is for nested bracket expressions with -- and && operators, which are very useful with traditional code point based matching. Grapheme cluster based parsing and/or matching would be a whole separate feature, and would need to be independently settable. And so fundamentally changes how matching works that I don't think it's near ready. |
I want us to be able to move forward with properties-of-strings and set operators, for which we are converging on a proposal. I assume that "grapheme cluster tokenization" means things like In particular, it has to be possible to opt into the features we know we need without also opting into grapheme cluster tokenization. Is anyone really asking for grapheme cluster tokenization? Is anyone working on it? Sounds like it could take years to figure it out. Stability: It feels like there is a difference in the degree of destabilization, between the set of characters/strings changing for which some property is true (which affects what that property matches) vs. grapheme cluster/token boundaries shifting when UAX #29 or its CLDR tailoring changes (which changes all matching when based on grapheme clusters). |
About grapheme cluster tokenization, we may be talking past one another
here.
1. Someone could mean that [a\u0308] == [{a\u0308}], that is, as you
parse literals in a regex pattern you treat a grapheme cluster as a
sequence. This can be tricky to implement, and wouldn't want to put the
rest at risk by including it in this proposal. And sometimes you probably
would want to disable that but not the other features.
2. Someone could mean that you support \X matching any grapheme cluster
(a grapheme cluster "."). That is not problematic to implement, and could
be added.
What I think the flag-v* should subsume is:
1. flag-u
2. set operations — eg [\p{scx=Grek}&&\p{Letter}]
3. properties of strings — eg \p{RGI_emoji}
4. literal-strings — eg [ab{cz}]
* or w, I lean very slightly towards v, but think w is fine also.
Mark
|
Right, by "grapheme cluster tokenization", I meant |
I think operating on grapheme clusters instead of code points is an interesting idea, but it's too unstable and too big a change to include it in this proposal. It feels like everything else we'd want to enable through this new flag ( |
One other thing we could include in this new flag is Unicode-aware |
Note that in addition to the flag letter bikeshed ( For example, the Perhaps the name corresponding to our new |
Moving the flag/getter discussion to #14. |
As I mentioned during the meeting, Grapheme matching is about more than just deciding what With grapheme matching (enabled, say, with a different flag such as |
Has the option of expanding the definition of
|
Closing this issue. I think we have firmly settled on using a new flag. There are other issues for bike-shedding on the exact flag and getter, and further details. |
This can be implemented with today's engines in a way that is completely backwards-compatible. import {charSet} from 'compose-regexp'
const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
LcGrekLetter.test("Γ") // false
LcGrekLetter.test("γ") // true
LcGrekLetter.test("x") // false
console.log(LcGrekLetter) // /(?!(?!\p{Script=Greek})\p{Lowercase})\p{Lowercase}/u Engines could easily detect patterns like this and do character ranges operations under the hood. The core pattern is // notAhead is negative lookAhead
function csDiff(a, b) {return sequence(notAhead(b), a)} Inter can be optimized with the same logic function csInter(a, b) {return sequence(notAhead(csDiff(a, b)), a)} union is just |
We could require the
u
flag (which we would do anyway) and then use\UnicodeSet{…}
to introduce new syntax in a backwards-compatible manner, since\U
throws.(We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)
The text was updated successfully, but these errors were encountered: