-
Notifications
You must be signed in to change notification settings - Fork 12
Are properties of strings atomic? #50
Comments
That doesn't actually answer the question I asked. |
I think it does — the FAQ entry gives a concrete example of a hypothetical property of strings that “expands to” the strings
Similarly, a hypothetical property of strings that matches only the two strings in your example: const five = '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F';
const re = /(?:\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F|\u{1F468}\u{1F3FE}).../u;
console.assert(five.match(re)[0] === five); |
Ah, ok. I assumed that "behaves like" meant "behaves similarly for the purposes of match order, the topic of this FAQ entry", not "is precisely identical in all situations". Anyway, that behavior is, to put it lightly, surprising to me, and I expect it to be surprising to developers. Was this discussed anywhere except in tc39/proposal-regexp-unicode-sequence-properties#20? That issue seems to devolve into the question of matching grapheme clusters, rather than the question of whether you can backtrack across a |
The following simplified snippet is equivalent: const five = 'abcde';
const re = /(?:abcde|ab).../;
console.assert(five.match(re)[0] === five); The pattern attempts to find one of ( |
I think @bakkot is expecting the match to be possessive: once the sequence matches the full When thinking through this, I initially agreed with @mathiasbynens, thinking it should backtrack. But then when I look at the result of the match from #50 (comment), all I see is 👨🏾⚕️, a single grapheme cluster. Reading the regex, I expect to see 3 trailing chars after my sequence, but they're getting folded into the default 👨🏾 emoji. That does surprise me. So now I agree with @bakkot. |
Are you saying you’d expect different semantics between strings that form a single grapheme cluster (e.g. |
I think @jridgewell is talking about this? /\p{RGI_Emoji}.../.test("👨🏾⚕️") Does it return true because |
Well, whether a match is possible in my example depends on the semantics you assume it has. Here I guess my intuition is basically that it's an atomic group containing a disjunction, rather than a plain disjunction. For me it's not so much about grapheme clusters per se; it's just that I'm expecting |
As for the proposed spec, see https://arai-a.github.io/ecma262-compare/?pr=2418 section “22.2.2.7 Runtime Semantics: CompileAtom” / “Atom :: CharacterClass” -- it builds a plain disjunction. I am not sure how to decide between the two possible behaviors (atomic group or not). |
Note: There is a question of whether Unicode line endings could be expressed using a That would work only if a property of strings / character class with strings was matched as an atomic group, because we never want a line ending expression to match just the CR of a CR LF line ending. |
Let me choose a simpler example for clarity: /\p{RGI_Emoji}🏽/
I would expect that to match anything that was an emoji, followed by a 🏽.
It would match currently 3,633 Code Points. Some of what it matches would
be RGI_Emoji, but some wouldn't, like "🫐🏽". Anything else seems bizarre.
Mark
…On Wed, Dec 15, 2021 at 7:32 AM Kevin Gibbons ***@***.***> wrote:
It would be surprising to NOT find a match when a match is possible.
Well, whether a match is possible in my example depends on the semantics
you assume it has. Here I guess my intuition is basically that it's an atomic
group <https://www.regular-expressions.info/atomic.html> containing a
disjunction, rather than a plain disjunction.
For me it's not so much about grapheme clusters *per se*; it's just that
I'm expecting \p{} to eat the longest substring available at that index.
For me, the point of using \p{RGI_Emoji} would be so that I don't have to
think about all the complexity of how emojis are actually encoded: that
they are actually strings of various lengths and that some of them are
prefixes of each other and so on. But if \p can stop in the middle of an
emoji, exposing the rest of my regex to its remaining code points, now I do
have to think about it again.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#50 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCPH466OAOXJRMX7ADURCYHJANCNFSM5KCXAKEQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I'm not convinced that the non-atomic behavior is more desirable in this case. I have a hard time seeing why people would commonly write On the other hand, if we do end up convinced that the non-atomic behavior is better, I note that one possible outcome could be to advance @rbuckton's regex atomic operators proposal, and then say that the behavior I expect must be opted into by wrapping the (That proposal did not advance when it was last presented, but most of the concerns [except my own] were about the expressed motivation for it, which was focused on performance; it's possible it could proceed given other motivation.) I hesitate to bring this up, because it involves choosing what I think is the worse behavior for |
Well, the blueberries plus skintone would match either way, because that sequence is not an RGI_Emoji. Where it makes a difference is a character that is in RGI_Emoji both with and without skintone modifier, such as 🎅 U+1F385 FATHER CHRISTMAS.
|
Let me try a simpler example with ASCII characters. Suppose that we have a property of strings If we have the expression I think the regex atomic operators would be great to have — completely independently of properties of strings. And that would let users have the choice: if they want the atomic behavior they could use If we forced the properties of strings (and their equivalent literals) to be atomic, then we have a disconnect with how the existing non-atomic expressions (like alternation) work, and then to give people that choice we'd have to introduce the non-atomic regex operator — an unnecessary complication when this would really be the only situation that would need that. |
That's a good point: it's "easy" (= we can copy from other languages) to make a non-atomic pattern atomic, while I don't think there are many languages allowing the opposite? |
I’m proposing to resolve this issue with #53. |
@nicolo-ribaudo That is true, and also typically the right kind of design. But for this, I feel the non-atomic behaviour is surprising and probably not nearly as useful. What's an actual use case for the non-atomic match? |
See the discussion and examples above. Seemingly reasonable practitioners have different intuitions about atomic vs. non-atomic matching, so we decided to leave properties of strings as proposed, exactly equivalent to a simple disjunction, and leave the choice to the regex author who may (in the future) choose to add an operator for atomic matching, or one for a grapheme cluster boundary. (There are precedents in other regex implementations for both of these.) Making properties of strings atomic means that those practitioners who want non-atomic matching would have to unravel the contents of the property and hardcode it as a disjunction. |
@markusicu Which examples above should I see, specifically? I have read the above thread and don't know what you're referring to. What are these practitioners trying to do that would be easier or clearer with a non-atomic match? To repeat, I understand that it is easy (and follows what would typically be a good design practice) to make a non-atomic match atomic. I just have not been convinced that it's even useful at all to have a non-atomic match, which would make it an especially bad default. |
Personally, I don't know how to decide which behavior is "right". I was persuaded by the argument that different people have different expectations (and find the respective "other behavior" surprising). By defaulting to a simple disjunction together with the atomic modifier (proposed elsewhere) each behavior is available --> #50 (comment) |
@markusicu Each of those examples is just describing the difference between atomic/non-atomic matches. None advocate for non-atomic behaviour being the right default or even being practically useful at all. Surely if the non-atomic match was useful we could come up with some contrived scenario where someone might want that behaviour, right? As a side note, I value ease of reading code (especially regexps) more highly than ease of writing code. Do you think we should run a user study to see if the choice of matching behaviour here would trip up readers who have only a passing familiarity with this feature? I personally suspect that readers will expect atomic matching behaviour (though whether they'll understand its consequences on backtracking is also unknown). |
I think it is quite clear that at least Mathias and Mark expect non-atomic matching behavior. All we have so far is people's expectations and intuitions, except that the line-ending matchers want to be atomic.
I don't know how well these kinds of surveys work. It seems like the trick is to design the question so that people really understand the choices. Otherwise the results are not useful. Ideally, online demos or sample code that people can try would give people a better feel, but that would be a lot of work. |
This doesn't seem productive. We could easily flip this and ask for some scenario where atomic behavior is practically useful at all. Any hypothetical can go both ways. Only one choice here allows both outcomes, so it seems the best choice to make. |
We can and should do this if there's any doubt about the question, yes. Here is such a scenario: I want to match an emoji followed by another character. So I write Now we should do the same exercise with the non-atomic behavior. |
If we have the expression /\p{exemplars=sk}h/, I would expect that to match "ch", because "c" is in the exemplars, just as I would expect /(ch|c|h)h/ to match "ch" because "c" is in the alternation. But more importantly, all of the arguments for atomic vs non-atomic property matches seem just as applicable to atomic vs non-atomic alternation. There is a (useful proposal) for adding a general atomic operation. I see no particular reason to have a mixed model that uses a different semantic and essentially forces the addition of a general non-atomic operation. |
Here's a regex to remove skin tones: |
Has this been conclusively decided on the part of this proposal? There has been no activity on this thread in over four months, and while the issue is closed I am not clear on whether there has been a definitive conclusion. |
@rbuckton The conclusion was https://github.com/tc39/proposal-regexp-v-flag#are-properties-of-strings-eager--atomic. I see you’ve kicked off a new proposal for atomic operators, yay! Do you want to send a PR linking to it from that FAQ entry? |
Done as part of #66. |
I originally presented this October of last year but it failed to advance. |
Concretely: does
/\p{RGI_Emoji}.../u
(note the.
s) match'\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'
? That sequence of five points is inRGI_Emoji
(as an emoji ZWJ sequence), but so is its prefix\u{1F468}\u{1F3FE}
(as an emoji sequence).I would assume it doesn't match (it says "an emoji followed by three more code points", and the string is not that), and that's what the discussion earlier seems to me to imply, but I'm not sure I'm reading it correctly. And the spec text doesn't seem to actually specify sequence properties (unless I've missed it).
The text was updated successfully, but these errors were encountered: