Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

Are properties of strings atomic? #50

Closed
bakkot opened this issue Dec 15, 2021 · 32 comments
Closed

Are properties of strings atomic? #50

bakkot opened this issue Dec 15, 2021 · 32 comments

Comments

@bakkot
Copy link

bakkot commented Dec 15, 2021

Concretely: does /\p{RGI_Emoji}.../u (note the .s) match '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'? That sequence of five points is in RGI_Emoji (as an emoji ZWJ sequence), but so is its prefix \u{1F468}\u{1F3FE} (as an emoji sequence).

I would assume it doesn't match (it says "an emoji followed by three more code points", and the string is not that), and that's what the discussion earlier seems to me to imply, but I'm not sure I'm reading it correctly. And the spec text doesn't seem to actually specify sequence properties (unless I've missed it).

@bakkot
Copy link
Author

bakkot commented Dec 15, 2021

That doesn't actually answer the question I asked.

@mathiasbynens mathiasbynens reopened this Dec 15, 2021
@mathiasbynens
Copy link
Member

mathiasbynens commented Dec 15, 2021

I think it does — the FAQ entry gives a concrete example of a hypothetical property of strings that “expands to” the strings 'a', 'b', 'c', 'W', 'xy', and 'xyz'. It says:

This pattern behaves like xyz|xy|a|b|c|W.

Similarly, a hypothetical property of strings that matches only the two strings in your example: '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' and '\u{1F468}\u{1F3FE}' behaves like \u{1F468}\u{1F3FE}\u200D\u2695\uFE0F|\u{1F468}\u{1F3FE}. This gives us some code we can test in existing engines:

const five = '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F';
const re = /(?:\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F|\u{1F468}\u{1F3FE}).../u;
console.assert(five.match(re)[0] === five);

@bakkot
Copy link
Author

bakkot commented Dec 15, 2021

Ah, ok. I assumed that "behaves like" meant "behaves similarly for the purposes of match order, the topic of this FAQ entry", not "is precisely identical in all situations".

Anyway, that behavior is, to put it lightly, surprising to me, and I expect it to be surprising to developers. Was this discussed anywhere except in tc39/proposal-regexp-unicode-sequence-properties#20? That issue seems to devolve into the question of matching grapheme clusters, rather than the question of whether you can backtrack across a \p match.

@mathiasbynens
Copy link
Member

The following simplified snippet is equivalent:

const five = 'abcde';
const re = /(?:abcde|ab).../;
console.assert(five.match(re)[0] === five);

The pattern attempts to find one of (abcde or ab) followed by three more characters. There’s only one way for the pattern to find a match in the given string. It would be surprising to NOT find a match when a match is possible. Could you explain what is surprising?

@jridgewell
Copy link
Member

jridgewell commented Dec 15, 2021

I think @bakkot is expecting the match to be possessive: once the sequence matches the full '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' it shouldn't backtrack to just '\u{1F468}\u{1F3FE}'.

When thinking through this, I initially agreed with @mathiasbynens, thinking it should backtrack. But then when I look at the result of the match from #50 (comment), all I see is 👨🏾‍⚕️, a single grapheme cluster. Reading the regex, I expect to see 3 trailing chars after my sequence, but they're getting folded into the default 👨🏾 emoji. That does surprise me. So now I agree with @bakkot.

@mathiasbynens mathiasbynens changed the title Are sequence properties eager? Are properties of strings eager? Dec 15, 2021
@mathiasbynens
Copy link
Member

I think @bakkot is expecting the match to be possessive: once the sequence matches the full '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F' it shouldn't backtrack to just '\u{1F468}\u{1F3FE}'.

When thinking through this, I initially agreed with @mathiasbynens, thinking it should backtrack. But then when I look at the result of the match from #50 (comment), all I see is 👨🏾‍⚕️, a single grapheme cluster. Reading the regex, I expect to see 3 trailing chars after my sequence, but they're getting folded into the default 👨🏾 emoji. That does surprise me. So now I agree with @bakkot.

Are you saying you’d expect different semantics between strings that form a single grapheme cluster (e.g. '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F') vs. strings that don’t (e.g. 'abcde')?

@nicolo-ribaudo
Copy link
Member

nicolo-ribaudo commented Dec 15, 2021

I think @jridgewell is talking about this?

/\p{RGI_Emoji}.../.test("👨🏾‍⚕️")

Does it return true because RGI_Emoji matches 👨🏾, or false because it matches 👨🏾‍⚕️? The problem is visible when using emojis directly in strings, not when escaping them with multiple \u sequences.

@bakkot
Copy link
Author

bakkot commented Dec 15, 2021

It would be surprising to NOT find a match when a match is possible.

Well, whether a match is possible in my example depends on the semantics you assume it has. Here I guess my intuition is basically that it's an atomic group containing a disjunction, rather than a plain disjunction.

For me it's not so much about grapheme clusters per se; it's just that I'm expecting \p{} to eat the longest substring available at that index. For me, the point of using \p{RGI_Emoji} would be so that I don't have to think about all the complexity of how emojis are actually encoded: that they are actually strings of various lengths and that some of them are prefixes of each other and so on. But if \p can stop in the middle of an emoji, exposing the rest of my regex to its remaining code points, now I do have to think about it again.

@markusicu
Copy link
Collaborator

As for the proposed spec, see https://arai-a.github.io/ecma262-compare/?pr=2418 section “22.2.2.7 Runtime Semantics: CompileAtom” / “Atom :: CharacterClass” -- it builds a plain disjunction.

I am not sure how to decide between the two possible behaviors (atomic group or not).

@markusicu
Copy link
Collaborator

Note: There is a question of whether Unicode line endings could be expressed using a \p{property of strings} rather than \R: tc39/proposal-regexp-r-escape#3

That would work only if a property of strings / character class with strings was matched as an atomic group, because we never want a line ending expression to match just the CR of a CR LF line ending.

@macchiati
Copy link
Collaborator

macchiati commented Dec 15, 2021 via email

@bakkot
Copy link
Author

bakkot commented Dec 15, 2021

I'm not convinced that the non-atomic behavior is more desirable in this case. I have a hard time seeing why people would commonly write \p{RGI_Emoji} together with a component of an emoji; most people aren't familiar with exactly how emoji are composed (hiding that complexity is much of the point of this proposal, as I understand it). On the other hand, I really do think that my original example is likely to come up, and to be unexpected; indeed, that behavior seems to me to be a major detriment to the usability of sequence properties.


On the other hand, if we do end up convinced that the non-atomic behavior is better, I note that one possible outcome could be to advance @rbuckton's regex atomic operators proposal, and then say that the behavior I expect must be opted into by wrapping the \p{} in (?>\p{}).

(That proposal did not advance when it was last presented, but most of the concerns [except my own] were about the expressed motivation for it, which was focused on performance; it's possible it could proceed given other motivation.)

I hesitate to bring this up, because it involves choosing what I think is the worse behavior for \p, but it is at least a possibility.

@markusicu
Copy link
Collaborator

Let me choose a simpler example for clarity: /\p{RGI_Emoji}🏽/
I would expect that to match anything that was an emoji, followed by a 🏽.
It would match currently 3,633 Code Points. Some of what it matches would be RGI_Emoji, but some wouldn't, like "🫐🏽". Anything else seems bizarre.

Well, the blueberries plus skintone would match either way, because that sequence is not an RGI_Emoji.

Where it makes a difference is a character that is in RGI_Emoji both with and without skintone modifier, such as 🎅 U+1F385 FATHER CHRISTMAS.

  • With a simple disjunction, your expression matches a string of 🎅+skintone. ("🎅🏽")
  • With an eager (atomic-group) match, \p{RGI_Emoji} alone matches a string of 🎅+skintone without backtracking, so the whole expression fails.

@macchiati
Copy link
Collaborator

macchiati commented Dec 16, 2021

Let me try a simpler example with ASCII characters. Suppose that we have a property of strings \p{exemplars=sk}, which is the characters and sequences used by Slovak. It contains (among other things) [\q{ch|c|h}].

If we have the expression /\p{exemplars=sk}h/, I would expect that to match "ch", because "c" is in the exemplars, just as I would expect /(ch|c|h)h/ to match "ch" because "c" is in the alternation.

I think the regex atomic operators would be great to have — completely independently of properties of strings. And that would let users have the choice: if they want the atomic behavior they could use /(?>\p{exemplars=sk})h/, and if they want the non-atomic behavior they use /\p{exemplars=sk}h/.

If we forced the properties of strings (and their equivalent literals) to be atomic, then we have a disconnect with how the existing non-atomic expressions (like alternation) work, and then to give people that choice we'd have to introduce the non-atomic regex operator — an unnecessary complication when this would really be the only situation that would need that.

@nicolo-ribaudo
Copy link
Member

That's a good point: it's "easy" (= we can copy from other languages) to make a non-atomic pattern atomic, while I don't think there are many languages allowing the opposite?

@mathiasbynens
Copy link
Member

I’m proposing to resolve this issue with #53.

@michaelficarra
Copy link
Member

@nicolo-ribaudo That is true, and also typically the right kind of design. But for this, I feel the non-atomic behaviour is surprising and probably not nearly as useful. What's an actual use case for the non-atomic match?

@bakkot bakkot changed the title Are properties of strings eager? Are properties of strings atomic? Jan 11, 2022
@markusicu
Copy link
Collaborator

@michaelficarra

I feel the non-atomic behaviour is surprising and probably not nearly as useful. What's an actual use case for the non-atomic match?

See the discussion and examples above. Seemingly reasonable practitioners have different intuitions about atomic vs. non-atomic matching, so we decided to leave properties of strings as proposed, exactly equivalent to a simple disjunction, and leave the choice to the regex author who may (in the future) choose to add an operator for atomic matching, or one for a grapheme cluster boundary. (There are precedents in other regex implementations for both of these.)

Making properties of strings atomic means that those practitioners who want non-atomic matching would have to unravel the contents of the property and hardcode it as a disjunction.

@michaelficarra
Copy link
Member

@markusicu Which examples above should I see, specifically? I have read the above thread and don't know what you're referring to. What are these practitioners trying to do that would be easier or clearer with a non-atomic match?

To repeat, I understand that it is easy (and follows what would typically be a good design practice) to make a non-atomic match atomic. I just have not been convinced that it's even useful at all to have a non-atomic match, which would make it an especially bad default.

@markusicu
Copy link
Collaborator

@markusicu Which examples above should I see, specifically?

Personally, I don't know how to decide which behavior is "right". I was persuaded by the argument that different people have different expectations (and find the respective "other behavior" surprising). By defaulting to a simple disjunction together with the atomic modifier (proposed elsewhere) each behavior is available --> #50 (comment)

@michaelficarra
Copy link
Member

@markusicu Each of those examples is just describing the difference between atomic/non-atomic matches. None advocate for non-atomic behaviour being the right default or even being practically useful at all. Surely if the non-atomic match was useful we could come up with some contrived scenario where someone might want that behaviour, right?

As a side note, I value ease of reading code (especially regexps) more highly than ease of writing code. Do you think we should run a user study to see if the choice of matching behaviour here would trip up readers who have only a passing familiarity with this feature? I personally suspect that readers will expect atomic matching behaviour (though whether they'll understand its consequences on backtracking is also unknown).

@markusicu
Copy link
Collaborator

@markusicu Each of those examples is just describing the difference between atomic/non-atomic matches. None advocate for non-atomic behaviour being the right default or even being practically useful at all.

I think it is quite clear that at least Mathias and Mark expect non-atomic matching behavior. All we have so far is people's expectations and intuitions, except that the line-ending matchers want to be atomic.

Do you think we should run a user study to see if the choice of matching behaviour here would trip up readers who have only a passing familiarity with this feature? I personally suspect that readers will expect atomic matching behaviour (though whether they'll understand its consequences on backtracking is also unknown).

I don't know how well these kinds of surveys work. It seems like the trick is to design the question so that people really understand the choices. Otherwise the results are not useful. Ideally, online demos or sample code that people can try would give people a better feel, but that would be a lot of work.

@jridgewell
Copy link
Member

jridgewell commented Jan 12, 2022

Surely if the non-atomic match was useful we could come up with some contrived scenario where someone might want that behaviour, right?

This doesn't seem productive. We could easily flip this and ask for some scenario where atomic behavior is practically useful at all. Any hypothetical can go both ways.

Only one choice here allows both outcomes, so it seems the best choice to make.

@bakkot
Copy link
Author

bakkot commented Jan 12, 2022

We could easily flip this and ask for some scenario where atomic behavior is practically useful at all

We can and should do this if there's any doubt about the question, yes. Here is such a scenario: I want to match an emoji followed by another character. So I write /\p{RGI_Emoji}./u. Seems like a pretty common thing to want.

Now we should do the same exercise with the non-atomic behavior.

@macchiati
Copy link
Collaborator

If we have the expression /\p{exemplars=sk}h/, I would expect that to match "ch", because "c" is in the exemplars, just as I would expect /(ch|c|h)h/ to match "ch" because "c" is in the alternation.

But more importantly, all of the arguments for atomic vs non-atomic property matches seem just as applicable to atomic vs non-atomic alternation.

There is a (useful proposal) for adding a general atomic operation. I see no particular reason to have a mixed model that uses a different semantic and essentially forces the addition of a general non-atomic operation.

@jridgewell
Copy link
Member

jridgewell commented Jan 13, 2022

Here's a regex to remove skin tones: input.replace(/(\p{RGI_Emoji})[\u{1F3FB}-\u{1F3FF}]/vg, '$1'), which depends on non-atomic behavior.

@rbuckton
Copy link

Has this been conclusively decided on the part of this proposal? There has been no activity on this thread in over four months, and while the issue is closed I am not clear on whether there has been a definitive conclusion.

@mathiasbynens
Copy link
Member

@rbuckton The conclusion was https://github.com/tc39/proposal-regexp-v-flag#are-properties-of-strings-eager--atomic. I see you’ve kicked off a new proposal for atomic operators, yay! Do you want to send a PR linking to it from that FAQ entry?

@mathiasbynens
Copy link
Member

@rbuckton The conclusion was https://github.com/tc39/proposal-regexp-v-flag#are-properties-of-strings-eager--atomic. I see you’ve kicked off a new proposal for atomic operators, yay! Do you want to send a PR linking to it from that FAQ entry?

Done as part of #66.

@rbuckton
Copy link

[...] I see you’ve kicked off a new proposal for atomic operators, yay! [...]

I originally presented this October of last year but it failed to advance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants