Are properties of strings atomic? #50

bakkot · 2021-12-15T05:36:14Z

Concretely: does /\p{RGI_Emoji}.../u (note the .s) match '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'? That sequence of five points is in RGI_Emoji (as an emoji ZWJ sequence), but so is its prefix \u{1F468}\u{1F3FE} (as an emoji sequence).

I would assume it doesn't match (it says "an emoji followed by three more code points", and the string is not that), and that's what the discussion earlier seems to me to imply, but I'm not sure I'm reading it correctly. And the spec text doesn't seem to actually specify sequence properties (unless I've missed it).

The text was updated successfully, but these errors were encountered:

mathiasbynens · 2021-12-15T07:33:50Z

https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings

bakkot · 2021-12-15T07:50:43Z

That doesn't actually answer the question I asked.

mathiasbynens · 2021-12-15T08:01:55Z

I think it does — the FAQ entry gives a concrete example of a hypothetical property of strings that “expands to” the strings 'a', 'b', 'c', 'W', 'xy', and 'xyz'. It says:

This pattern behaves like xyz|xy|a|b|c|W.

Similarly, a hypothetical property of strings that matches only the two strings in your example: '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' and '\u{1F468}\u{1F3FE}' behaves like \u{1F468}\u{1F3FE}\u200D\u2695\uFE0F|\u{1F468}\u{1F3FE}. This gives us some code we can test in existing engines:

const five = '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F';
const re = /(?:\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F|\u{1F468}\u{1F3FE}).../u;
console.assert(five.match(re)[0] === five);

bakkot · 2021-12-15T08:06:02Z

Ah, ok. I assumed that "behaves like" meant "behaves similarly for the purposes of match order, the topic of this FAQ entry", not "is precisely identical in all situations".

Anyway, that behavior is, to put it lightly, surprising to me, and I expect it to be surprising to developers. Was this discussed anywhere except in tc39/proposal-regexp-unicode-sequence-properties#20? That issue seems to devolve into the question of matching grapheme clusters, rather than the question of whether you can backtrack across a \p match.

mathiasbynens · 2021-12-15T08:59:12Z

The following simplified snippet is equivalent:

const five = 'abcde';
const re = /(?:abcde|ab).../;
console.assert(five.match(re)[0] === five);

The pattern attempts to find one of (abcde or ab) followed by three more characters. There’s only one way for the pattern to find a match in the given string. It would be surprising to NOT find a match when a match is possible. Could you explain what is surprising?

jridgewell · 2021-12-15T09:44:55Z

I think @bakkot is expecting the match to be possessive: once the sequence matches the full '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' it shouldn't backtrack to just '\u{1F468}\u{1F3FE}'.

When thinking through this, I initially agreed with @mathiasbynens, thinking it should backtrack. But then when I look at the result of the match from #50 (comment), all I see is 👨🏾‍⚕️, a single grapheme cluster. Reading the regex, I expect to see 3 trailing chars after my sequence, but they're getting folded into the default 👨🏾 emoji. That does surprise me. So now I agree with @bakkot.

mathiasbynens · 2021-12-15T10:00:15Z

I think @bakkot is expecting the match to be possessive: once the sequence matches the full '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' it shouldn't backtrack to just '\u{1F468}\u{1F3FE}'.

When thinking through this, I initially agreed with @mathiasbynens, thinking it should backtrack. But then when I look at the result of the match from #50 (comment), all I see is 👨🏾‍⚕️, a single grapheme cluster. Reading the regex, I expect to see 3 trailing chars after my sequence, but they're getting folded into the default 👨🏾 emoji. That does surprise me. So now I agree with @bakkot.

Are you saying you’d expect different semantics between strings that form a single grapheme cluster (e.g. '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F') vs. strings that don’t (e.g. 'abcde')?

nicolo-ribaudo · 2021-12-15T10:19:03Z

I think @jridgewell is talking about this?

/\p{RGI_Emoji}.../.test("👨🏾‍⚕️")

Does it return true because RGI_Emoji matches 👨🏾, or false because it matches 👨🏾‍⚕️? The problem is visible when using emojis directly in strings, not when escaping them with multiple \u sequences.

bakkot · 2021-12-15T15:31:53Z

It would be surprising to NOT find a match when a match is possible.

Well, whether a match is possible in my example depends on the semantics you assume it has. Here I guess my intuition is basically that it's an atomic group containing a disjunction, rather than a plain disjunction.

For me it's not so much about grapheme clusters per se; it's just that I'm expecting \p{} to eat the longest substring available at that index. For me, the point of using \p{RGI_Emoji} would be so that I don't have to think about all the complexity of how emojis are actually encoded: that they are actually strings of various lengths and that some of them are prefixes of each other and so on. But if \p can stop in the middle of an emoji, exposing the rest of my regex to its remaining code points, now I do have to think about it again.

markusicu · 2021-12-15T17:20:49Z

As for the proposed spec, see https://arai-a.github.io/ecma262-compare/?pr=2418 section “22.2.2.7 Runtime Semantics: CompileAtom” / “Atom :: CharacterClass” -- it builds a plain disjunction.

I am not sure how to decide between the two possible behaviors (atomic group or not).

markusicu · 2021-12-15T17:42:00Z

Note: There is a question of whether Unicode line endings could be expressed using a \p{property of strings} rather than \R: tc39/proposal-regexp-r-escape#3

That would work only if a property of strings / character class with strings was matched as an atomic group, because we never want a line ending expression to match just the CR of a CR LF line ending.

macchiati · 2021-12-15T20:00:09Z

Let me choose a simpler example for clarity: /\p{RGI_Emoji}🏽/ I would expect that to match anything that was an emoji, followed by a 🏽. It would match currently 3,633 Code Points. Some of what it matches would be RGI_Emoji, but some wouldn't, like "🫐🏽". Anything else seems bizarre. Mark

…

On Wed, Dec 15, 2021 at 7:32 AM Kevin Gibbons ***@***.***> wrote: It would be surprising to NOT find a match when a match is possible. Well, whether a match is possible in my example depends on the semantics you assume it has. Here I guess my intuition is basically that it's an atomic group <https://www.regular-expressions.info/atomic.html> containing a disjunction, rather than a plain disjunction. For me it's not so much about grapheme clusters *per se*; it's just that I'm expecting \p{} to eat the longest substring available at that index. For me, the point of using \p{RGI_Emoji} would be so that I don't have to think about all the complexity of how emojis are actually encoded: that they are actually strings of various lengths and that some of them are prefixes of each other and so on. But if \p can stop in the middle of an emoji, exposing the rest of my regex to its remaining code points, now I do have to think about it again. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCPH466OAOXJRMX7ADURCYHJANCNFSM5KCXAKEQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

bakkot · 2021-12-15T21:11:07Z

I'm not convinced that the non-atomic behavior is more desirable in this case. I have a hard time seeing why people would commonly write \p{RGI_Emoji} together with a component of an emoji; most people aren't familiar with exactly how emoji are composed (hiding that complexity is much of the point of this proposal, as I understand it). On the other hand, I really do think that my original example is likely to come up, and to be unexpected; indeed, that behavior seems to me to be a major detriment to the usability of sequence properties.

On the other hand, if we do end up convinced that the non-atomic behavior is better, I note that one possible outcome could be to advance @rbuckton's regex atomic operators proposal, and then say that the behavior I expect must be opted into by wrapping the \p{} in (?>\p{}).

(That proposal did not advance when it was last presented, but most of the concerns [except my own] were about the expressed motivation for it, which was focused on performance; it's possible it could proceed given other motivation.)

I hesitate to bring this up, because it involves choosing what I think is the worse behavior for \p, but it is at least a possibility.

markusicu · 2021-12-15T21:31:54Z

Let me choose a simpler example for clarity: /\p{RGI_Emoji}🏽/
I would expect that to match anything that was an emoji, followed by a 🏽.
It would match currently 3,633 Code Points. Some of what it matches would be RGI_Emoji, but some wouldn't, like "🫐🏽". Anything else seems bizarre.

Well, the blueberries plus skintone would match either way, because that sequence is not an RGI_Emoji.

Where it makes a difference is a character that is in RGI_Emoji both with and without skintone modifier, such as 🎅 U+1F385 FATHER CHRISTMAS.

With a simple disjunction, your expression matches a string of 🎅+skintone. ("🎅🏽")
With an eager (atomic-group) match, \p{RGI_Emoji} alone matches a string of 🎅+skintone without backtracking, so the whole expression fails.

macchiati · 2021-12-16T17:49:09Z

Let me try a simpler example with ASCII characters. Suppose that we have a property of strings \p{exemplars=sk}, which is the characters and sequences used by Slovak. It contains (among other things) [\q{ch|c|h}].

If we have the expression /\p{exemplars=sk}h/, I would expect that to match "ch", because "c" is in the exemplars, just as I would expect /(ch|c|h)h/ to match "ch" because "c" is in the alternation.

I think the regex atomic operators would be great to have — completely independently of properties of strings. And that would let users have the choice: if they want the atomic behavior they could use /(?>\p{exemplars=sk})h/, and if they want the non-atomic behavior they use /\p{exemplars=sk}h/.

If we forced the properties of strings (and their equivalent literals) to be atomic, then we have a disconnect with how the existing non-atomic expressions (like alternation) work, and then to give people that choice we'd have to introduce the non-atomic regex operator — an unnecessary complication when this would really be the only situation that would need that.

nicolo-ribaudo · 2021-12-16T18:36:23Z

That's a good point: it's "easy" (= we can copy from other languages) to make a non-atomic pattern atomic, while I don't think there are many languages allowing the opposite?

Issue: #50

mathiasbynens · 2021-12-27T13:28:31Z

I’m proposing to resolve this issue with #53.

Issue: #50

mathiasbynens · 2022-01-03T06:33:03Z

https://github.com/tc39/proposal-regexp-set-notation#are-properties-of-strings-eager--atomic

michaelficarra · 2022-01-11T00:59:07Z

@nicolo-ribaudo That is true, and also typically the right kind of design. But for this, I feel the non-atomic behaviour is surprising and probably not nearly as useful. What's an actual use case for the non-atomic match?

markusicu · 2022-01-11T18:45:52Z

@michaelficarra

I feel the non-atomic behaviour is surprising and probably not nearly as useful. What's an actual use case for the non-atomic match?

See the discussion and examples above. Seemingly reasonable practitioners have different intuitions about atomic vs. non-atomic matching, so we decided to leave properties of strings as proposed, exactly equivalent to a simple disjunction, and leave the choice to the regex author who may (in the future) choose to add an operator for atomic matching, or one for a grapheme cluster boundary. (There are precedents in other regex implementations for both of these.)

Making properties of strings atomic means that those practitioners who want non-atomic matching would have to unravel the contents of the property and hardcode it as a disjunction.

michaelficarra · 2022-01-12T19:25:47Z

@markusicu Which examples above should I see, specifically? I have read the above thread and don't know what you're referring to. What are these practitioners trying to do that would be easier or clearer with a non-atomic match?

To repeat, I understand that it is easy (and follows what would typically be a good design practice) to make a non-atomic match atomic. I just have not been convinced that it's even useful at all to have a non-atomic match, which would make it an especially bad default.

markusicu · 2022-01-12T20:33:46Z

@markusicu Which examples above should I see, specifically?

Personally, I don't know how to decide which behavior is "right". I was persuaded by the argument that different people have different expectations (and find the respective "other behavior" surprising). By defaulting to a simple disjunction together with the atomic modifier (proposed elsewhere) each behavior is available --> #50 (comment)

michaelficarra · 2022-01-12T22:15:35Z

@markusicu Each of those examples is just describing the difference between atomic/non-atomic matches. None advocate for non-atomic behaviour being the right default or even being practically useful at all. Surely if the non-atomic match was useful we could come up with some contrived scenario where someone might want that behaviour, right?

As a side note, I value ease of reading code (especially regexps) more highly than ease of writing code. Do you think we should run a user study to see if the choice of matching behaviour here would trip up readers who have only a passing familiarity with this feature? I personally suspect that readers will expect atomic matching behaviour (though whether they'll understand its consequences on backtracking is also unknown).

markusicu · 2022-01-12T22:43:35Z

@markusicu Each of those examples is just describing the difference between atomic/non-atomic matches. None advocate for non-atomic behaviour being the right default or even being practically useful at all.

I think it is quite clear that at least Mathias and Mark expect non-atomic matching behavior. All we have so far is people's expectations and intuitions, except that the line-ending matchers want to be atomic.

Do you think we should run a user study to see if the choice of matching behaviour here would trip up readers who have only a passing familiarity with this feature? I personally suspect that readers will expect atomic matching behaviour (though whether they'll understand its consequences on backtracking is also unknown).

I don't know how well these kinds of surveys work. It seems like the trick is to design the question so that people really understand the choices. Otherwise the results are not useful. Ideally, online demos or sample code that people can try would give people a better feel, but that would be a lot of work.

jridgewell · 2022-01-12T23:39:35Z

Surely if the non-atomic match was useful we could come up with some contrived scenario where someone might want that behaviour, right?

This doesn't seem productive. We could easily flip this and ask for some scenario where atomic behavior is practically useful at all. Any hypothetical can go both ways.

Only one choice here allows both outcomes, so it seems the best choice to make.

bakkot · 2022-01-12T23:59:34Z

We could easily flip this and ask for some scenario where atomic behavior is practically useful at all

We can and should do this if there's any doubt about the question, yes. Here is such a scenario: I want to match an emoji followed by another character. So I write /\p{RGI_Emoji}./u. Seems like a pretty common thing to want.

Now we should do the same exercise with the non-atomic behavior.

macchiati · 2022-01-13T04:22:40Z

If we have the expression /\p{exemplars=sk}h/, I would expect that to match "ch", because "c" is in the exemplars, just as I would expect /(ch|c|h)h/ to match "ch" because "c" is in the alternation.

But more importantly, all of the arguments for atomic vs non-atomic property matches seem just as applicable to atomic vs non-atomic alternation.

There is a (useful proposal) for adding a general atomic operation. I see no particular reason to have a mixed model that uses a different semantic and essentially forces the addition of a general non-atomic operation.

jridgewell · 2022-01-13T04:23:52Z

Here's a regex to remove skin tones: input.replace(/(\p{RGI_Emoji})[\u{1F3FB}-\u{1F3FF}]/vg, '$1'), which depends on non-atomic behavior.

rbuckton · 2022-05-27T00:04:27Z

Has this been conclusively decided on the part of this proposal? There has been no activity on this thread in over four months, and while the issue is closed I am not clear on whether there has been a definitive conclusion.

mathiasbynens · 2022-05-27T07:09:32Z

@rbuckton The conclusion was https://github.com/tc39/proposal-regexp-v-flag#are-properties-of-strings-eager--atomic. I see you’ve kicked off a new proposal for atomic operators, yay! Do you want to send a PR linking to it from that FAQ entry?

mathiasbynens · 2022-05-27T21:43:52Z

@rbuckton The conclusion was https://github.com/tc39/proposal-regexp-v-flag#are-properties-of-strings-eager--atomic. I see you’ve kicked off a new proposal for atomic operators, yay! Do you want to send a PR linking to it from that FAQ entry?

Done as part of #66.

rbuckton · 2022-05-27T22:45:08Z

[...] I see you’ve kicked off a new proposal for atomic operators, yay! [...]

I originally presented this October of last year but it failed to advance.

mathiasbynens closed this as completed Dec 15, 2021

mathiasbynens reopened this Dec 15, 2021

mathiasbynens changed the title ~~Are sequence properties eager?~~ Are properties of strings eager? Dec 15, 2021

mathiasbynens added a commit that referenced this issue Dec 27, 2021

Add FAQ entry on atomic vs. non-atomic grouping

c74d3d0

Issue: #50

mathiasbynens mentioned this issue Dec 27, 2021

Add FAQ entry on atomic vs. non-atomic grouping #53

Merged

mathiasbynens added a commit that referenced this issue Jan 3, 2022

Add FAQ entry on atomic vs. non-atomic grouping (#53)

fbee25d

Issue: #50

mathiasbynens closed this as completed Jan 3, 2022

rbuckton mentioned this issue Jan 10, 2022

is there a Unicode sequence property that can be used instead? tc39/proposal-regexp-r-escape#3

Open

bakkot changed the title ~~Are properties of strings eager?~~ Are properties of strings atomic? Jan 11, 2022

mathiasbynens mentioned this issue May 27, 2022

Point to @rbuckton’s atomic operators proposal #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are properties of strings atomic? #50

Are properties of strings atomic? #50

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021 •

edited

Loading

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021

jridgewell commented Dec 15, 2021 •

edited

Loading

mathiasbynens commented Dec 15, 2021

nicolo-ribaudo commented Dec 15, 2021 •

edited

Loading

bakkot commented Dec 15, 2021

markusicu commented Dec 15, 2021

markusicu commented Dec 15, 2021

macchiati commented Dec 15, 2021 via email

bakkot commented Dec 15, 2021 •

edited

Loading

markusicu commented Dec 15, 2021

macchiati commented Dec 16, 2021 •

edited by mathiasbynens

Loading

nicolo-ribaudo commented Dec 16, 2021

mathiasbynens commented Dec 27, 2021

mathiasbynens commented Jan 3, 2022

michaelficarra commented Jan 11, 2022

markusicu commented Jan 11, 2022

michaelficarra commented Jan 12, 2022

markusicu commented Jan 12, 2022

michaelficarra commented Jan 12, 2022

markusicu commented Jan 12, 2022

jridgewell commented Jan 12, 2022 •

edited

Loading

bakkot commented Jan 12, 2022

macchiati commented Jan 13, 2022

jridgewell commented Jan 13, 2022 •

edited by mathiasbynens

Loading

rbuckton commented May 27, 2022

mathiasbynens commented May 27, 2022

mathiasbynens commented May 27, 2022

rbuckton commented May 27, 2022

Are properties of strings atomic? #50

Are properties of strings atomic? #50

Comments

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021 • edited Loading

bakkot commented Dec 15, 2021

mathiasbynens commented Dec 15, 2021

jridgewell commented Dec 15, 2021 • edited Loading

mathiasbynens commented Dec 15, 2021

nicolo-ribaudo commented Dec 15, 2021 • edited Loading

bakkot commented Dec 15, 2021

markusicu commented Dec 15, 2021

markusicu commented Dec 15, 2021

macchiati commented Dec 15, 2021 via email

bakkot commented Dec 15, 2021 • edited Loading

markusicu commented Dec 15, 2021

macchiati commented Dec 16, 2021 • edited by mathiasbynens Loading

nicolo-ribaudo commented Dec 16, 2021

mathiasbynens commented Dec 27, 2021

mathiasbynens commented Jan 3, 2022

michaelficarra commented Jan 11, 2022

markusicu commented Jan 11, 2022

michaelficarra commented Jan 12, 2022

markusicu commented Jan 12, 2022

michaelficarra commented Jan 12, 2022

markusicu commented Jan 12, 2022

jridgewell commented Jan 12, 2022 • edited Loading

bakkot commented Jan 12, 2022

macchiati commented Jan 13, 2022

jridgewell commented Jan 13, 2022 • edited by mathiasbynens Loading

rbuckton commented May 27, 2022

mathiasbynens commented May 27, 2022

mathiasbynens commented May 27, 2022

rbuckton commented May 27, 2022

mathiasbynens commented Dec 15, 2021 •

edited

Loading

jridgewell commented Dec 15, 2021 •

edited

Loading

nicolo-ribaudo commented Dec 15, 2021 •

edited

Loading

bakkot commented Dec 15, 2021 •

edited

Loading

macchiati commented Dec 16, 2021 •

edited by mathiasbynens

Loading

jridgewell commented Jan 12, 2022 •

edited

Loading

jridgewell commented Jan 13, 2022 •

edited by mathiasbynens

Loading