Editorial: Incorporate GetSubstitution's table into the algorithm #2484

jmdyck · 2021-08-07T05:18:17Z

Resolves #2479.

I tried various ways to express the table as algorithm steps; I think this is the clearest.

The PR consists of:

3 2 commits of preliminary refactoring,
1 commit where I fix a type-assertion,
6 squashable commits that gradually transfer stuff over, and
1 more refactoring commit (that could be preliminary).

bakkot

Some comments on details, but otherwise looks great.

spec.html

bakkot · 2021-08-07T07:14:36Z

spec.html

-            1. Let _m_ be the number of elements in _captures_.
-            1. Let _result_ be the String value derived from _replacement_ by copying code unit elements from _replacement_ to _result_ while performing replacements as specified in <emu-xref href="#table-replacement-text-symbol-substitutions"></emu-xref>. These `$` replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements.
-            1. Return _result_.
+            1. If _tailPos_ &ge; _stringLength_, then


Not related to your PR, but this is a dumb case. Assuming I understand this correctly, it only happens when code has overridden the default behavior of exec in such a way that it claims to have find a matching substring at a position where that substring is longer than the number of characters left following that position in the input string.

That might warrant a note (though not necessarily in this PR).

Yeah, it struck me as odd too. I think that's the only way that _tailPos_ > _stringLength_ can happen.

I believe have that branch covered in es-abstract's GetSubstitution with this invocation:

GetSubstitution('abcdef', 'abcdefghi', 0, [], '>$`<'),

does that help?

@ljharb With that invocation, you should have matchLength = 'abcdef'.length = 6 and stringLength = 'abcdefghi'.length = 9, and hence tailPos = position + matchLength = 0 + 6 = 6, and hence tailPos < stringLength. So I don't think that invocation should cover the _tailPos_ > _stringLength_ case I'm referring to here.

ah, i may have grabbed the wrong test case.

GetSubstitution('def', 'abcdefghi', 3, [], '>$`<'),

?

If that's still wrong, I'd love help triggering it :-)

Same deal with that one:

position = 3, matchLength = 'def'.length = 3, stringLength = 'abcdefghi'.length = 9, and hence tailPos = position + matchLength = 3 + 3 = 6, and hence tailPos < stringLength. (Oh, and in both cases the replacement-template-string needs to contain $', not $`, to be in the relevant row/branch at all.)

The cases which actually exercise it look like GetSubstitution("1234567", "abc", 0, [], "$'") or GetSubstitution("x", "abc", 3, [], "$'") - in the former case by having the match be longer than the input string, in the latter by having the match be nonempty and occur at the end of the input string.

Obviously that's a very strange thing to do, but you can trigger it by overriding .exec, as in

let evil = new RegExp; evil.exec = () => ({ 0: '1234567', length: 1, index: 0 }); 'abc'.replace(evil, `$'`);

or

let evil = new RegExp; evil.exec = () => ({ 0: 'x', length: 1, index: 3 }); 'abc'.replace(evil, `$'`);

(Fun fact: I just tested these snippets and found bugs in four different JS engines: SpiderMonkey, Chakra, XS, and GraalJS.)

(I haven't added a Note yet.)

bakkot · 2021-08-07T07:18:49Z

spec.html

+            1. If _tailPos_ &ge; _stringLength_, then
+              1. Let _following_ be the empty String.
+            1. Else,
+              1. Let _following_ be the substring of _str_ from _tailPos_.


Instead of an if-else, I'd probably write this as

Suggested change

1. Let _following_ be the substring of _str_ from _tailPos_.

1. Let _following_ be the substring of _str_ from min(_tailPos_, _stringLength_).

Well, except, if we did want to insert a Note, we might want to do it like so:

1. If _tailPos_ > _stringLength_, then 1. NOTE: This can only happen when ... 1. Let _following_ be the empty String. 1. Else, 1. Let _following_ be the substring of _str_ from _tailPos_.

(Note that this shifts the "=" case from the "then" to the "else" so that the Note doesn't apply, but that's valid, because it's an empty String either way.)

If we go with the min phrasing, the note can read NOTE: _tailPos_ can exceed _stringLength_ only when [...].

I put in the min formulation, but I think it makes the step a bit hard to read (too much happening in one step). I think I'd prefer something like:

1. Let _tailPos_ be _position_ + _matchLength_. 1. If _tailPos_ >_stringLength_, then 1. NOTE: This can happen if ... 1. Set _tailPos_ to _stringLength_. 1. Let _refReplacement_ be the substring of _str_ from _tailPos_.

I like it because it gives you a space to consider the weirdness of _tailPos_ > _stringLength_ separately from the definition of _refReplacement_ for this case.

spec.html

gibson042 · 2021-08-08T01:48:50Z

spec.html

+                1. NOTE: No replacement is done.
+                1. Return _ref_.
+              1. Else,
+                1. Let _capture_ be the _index_<sup>th</sup> element of _captures_.


This seems potentially ambiguous, especially with captures being a 0-indexed List while capture references are 1-indexed.

Suggested change

1. Let _capture_ be the _index_<sup>th</sup> element of _captures_.

1. Let _capture_ be _captures_[_index_ - 1].

Technically, I don't think it's ambiguous, since the definition of Lists sets up the equivalence between the _n_<sup>th</sup> element of _L_ and _L_[_n_ - 1]. However, that's a fairly obscure point, so the suggested change is clearer.

It's a pre-existing condition, so I'll deal with it in a separate commit. I'm thinking I might also add a Note in RegExp.prototype [ @@replace ], maybe after 14.i.iii.

I think it's plenty appropriate to change it in this PR.

Yup, I meant a separate commit in this PR.

I've added that commit.

gibson042 · 2021-08-08T02:02:12Z

spec.html

+            1. Else,
+              1. Let _following_ be the substring of _str_ from _tailPos_.
+            1. Let _interpretNumericRef_ be a new Abstract Closure with parameters (_ref_) that captures _captures_ and performs the following steps when called:
+              1. Assert: _ref_ starts with *"$"*.


There is no definition for "starts with"; would more precise language be better here (and in step 15)?

Suggested change

1. Assert: _ref_ starts with *"$"*.

1. Assert: The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN).

There is no definition for "starts with"

That's true, but it's also true of most other phrases that the spec uses to handle String values. (The only ones for which we define specific wording are "the string-concatenation of ..." and "the substring of ...", but phrases involving length and indexing could probably also be considered 'defined'.)

would more precise language be better here (and in step 15)?
The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN)

We could do that. (That would also take care of the kerning problem for $'.) The longer refs in step 15 could get pretty wordy, but rather than (e.g.)

If the code unit at index 0 within _replacerRemainder_ is 0x0024 (DOLLAR SIGN) and the code unit at index 1 within _replacerRemainder_ is 0x0026 (AMPERSAND), then

we could say:

If the first two code units of _replacerRemainder_ are (respectively) 0x0024 (DOLLAR SIGN) and 0x0026 (AMPERSAND), then

(And again, there isn't a definition for "the first N code units of S", but it's used in parseInt, and we could define it if we felt it was necessary.)

[Edit to add: And actually, both of these have a problem: the code unit at index 1 within _X_ and the first two code units of _X_ aren't well-defined if _X_ is of length 1. Ditto if we tried to use (e.g.) the substring of _replacerRemainder_ from 0 to 2. That's the nice thing about "starts with": if the left string is shorter than the right string, the result is simply false.]

We could also consider defining "starts with". Defining "A starts with B" where A and B are Strings would be fairly easy, but it's debatable whether that would handle step 15.e and 15.f:

15.e Else if _replacerRemainder_ starts with *"$"* followed by 2 (or more) decimal digits, then 15.f Else if _replacerRemainder_ starts with *"$"* followed by 1 decimal digit

because (e.g.) *"$"* followed by 1 decimal digit doesn't denote a String, it denotes a set of Strings. Or you can see it as an implicit existential quantification:

If there exists a String _x_ consisting of *"$"* followed by 1 decimal digit, such that _replacerRemainder_ starts with _x_, then

I don't have a conclusion, it's just stuff that comes up when you pull on that thread.

For my own part, I don't think we need to define "starts with", given that strings are defined to be sequences and I think readers should be able to understand what it means for one sequence to start with another.

All of this is true, and there are many variations. I admit that I did have the $' problem on my mind as well, and that I don't have a better answer to 15.e than something like

1. Repeat, while _replacerRemainder_ is not the empty String, 1. Let _len_ be the length of _replacerRemainder_. … 1. Else if _len_ ≥ 3 and …, then 1. Let _ref_ be the substring of _replacerRemainder_ from 0 to 3. 1. Let _refReplacement_ be _interpretNumericRef_(_ref_).

(I've continued to use "starts with".)

spec.html

From tc39/ecma262#2484 (comment)

jmdyck · 2021-08-17T00:54:11Z

Reminder that GetSubstitution is also the subject of issue #1426. I've started thinking about how to handle that.

(See also PR #1732, which proposes a fix for that issue, but won't combine well with this PR.)

michaelficarra

LGTM aside from open comments.

michaelficarra · 2021-08-18T22:03:58Z

@jmdyck From editor call today, editor consensus either aligns with @bakkot's suggestions up to this point or is neutral to them.

jmdyck · 2021-09-14T13:21:39Z

Force-pushed to rebase to master and resolve merge conflicts.

(I should be getting back to this soon.)

jmdyck · 2021-09-16T13:10:40Z

force-pushed to rebase to master and make changes to address most of the comments (see individual replies)

jmdyck · 2021-09-16T13:47:43Z

Reminder that GetSubstitution is also the subject of issue #1426. I've started thinking about how to handle that.

My current plan for resolving #1426 is to change step 5.f.ii.1.h.i to:

1. If _d_ = 2, then
  1. Set _found_ to *false*.
  1. NOTE: "Fall back" to trying a 1-digit reference.
1. Else,
  1. Let _refReplacement_ be _ref_.

(Note that that's a normative change, so I won't be adding it to this PR.)

stale

bakkot

Still LGTM other than the kerning thing (for which I have a proposed workaround). Changes seem good.

spec.html

bakkot · 2021-09-29T02:50:46Z

spec.html

+                1. Let _ref_ be *"$'"*.
+                1. Let _matchLength_ be the number of code units in _matched_.
+                1. Let _tailPos_ be _position_ + _matchLength_.
+                1. Let _refReplacement_ be the substring of _str_ from min(_tailPos_, _stringLength_).


@jmdyck I know you said you think this step has too much happening now, but it seems fine to me.

[shrug] I'll live with it.

But that reminds me: we were talking about maybe having a Note to explain how the weirdness could occur. Should we do that now or leave it to another PR?

Happy either way. If you don't add it here I'll do it in a followup.

I'll leave it to you then.

jmdyck · 2021-09-30T14:05:14Z

force-pushed to:

rebase to master,
squash the squashable commits, and
add a fixup commit re the string with bad kerning.

spec.html

syg · 2021-11-24T22:49:16Z

I'll defer to @bakkot and @michaelficarra's review since they have both reviewed.

_preserved_ is the substring before the matched substring, but the name _preserved_ is a bit odd, because the substring *after* the matched substring is also preserved. So rename _preserved_ as _preceding_, and define _following_ to complement it.

Every caller of GetSubstitution refers to its returned value as _replacement_, so it's somewhat confusing for GetSubstitution to use _replacement_ as one of its parameter names. Rename the parameter from _replacement_ to _replacementTemplate_.

…c39#2484) The current assertion ignores the possibility that an element of _captures_ could be *undefined*.

…39#2484)

@replace

…c39#2484) ... to `_captures_[_index_ - 1]` in `GetSubstitution`. Also, in `RegExp.prototype [ @@replace ]` (the only place that passes a non-empty `_captures_` to `GetSubstitution`), add a NOTE to underline the fact that `_captures_[_n_ - 1]` is the `_n_`th capture.

jmdyck added the editorial change label Aug 7, 2021

bakkot approved these changes Aug 7, 2021

View reviewed changes

bakkot added the editor call to be discussed in the next editor call label Aug 7, 2021

gibson042 reviewed Aug 8, 2021

View reviewed changes

ljharb added a commit to ljharb/es-abstract that referenced this pull request Aug 8, 2021

[Tests] GetSubstitution: add cases

bcf0fd2

From tc39/ecma262#2484 (comment)

michaelficarra previously approved these changes Aug 18, 2021

View reviewed changes

michaelficarra removed the editor call to be discussed in the next editor call label Aug 18, 2021

jmdyck force-pushed the GetSubstitution branch from dc1358a to d89851f Compare September 14, 2021 13:20

jmdyck force-pushed the GetSubstitution branch from d89851f to 8f2e8c8 Compare September 16, 2021 13:08

michaelficarra requested a review from bakkot September 16, 2021 16:09

bakkot approved these changes Sep 29, 2021

View reviewed changes

bakkot added the editor call to be discussed in the next editor call label Sep 29, 2021

jmdyck force-pushed the GetSubstitution branch from d7ca379 to 5ca0e59 Compare September 30, 2021 14:04

michaelficarra removed the editor call to be discussed in the next editor call label Sep 30, 2021

jmdyck force-pushed the GetSubstitution branch from 5ca0e59 to 2fceaeb Compare November 12, 2021 01:35

michaelficarra reviewed Nov 12, 2021

View reviewed changes

spec.html Show resolved Hide resolved

michaelficarra approved these changes Nov 12, 2021

View reviewed changes

michaelficarra requested a review from syg November 12, 2021 02:13

jmdyck force-pushed the GetSubstitution branch from 2fceaeb to eb7ec68 Compare November 12, 2021 04:59

bakkot added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Nov 24, 2021

jmdyck added 2 commits December 7, 2021 07:39

jmdyck added 3 commits December 7, 2021 07:40

Editorial: fix the type-assertion for _captures_ in GetSubstitution (t…

9070c04

…c39#2484) The current assertion ignores the possibility that an element of _captures_ could be *undefined*.

Editorial: Incorporate GetSubstitution's table into the algorithm (tc…

99c0311

…39#2484)

ljharb force-pushed the GetSubstitution branch from eb7ec68 to f2671ca Compare December 7, 2021 15:40

ljharb merged commit f2671ca into tc39:main Dec 7, 2021

jmdyck deleted the GetSubstitution branch December 8, 2021 02:53

bakkot mentioned this pull request Dec 30, 2021

Editorial: Note how edge case in GetSubstitution can arise #2603

Merged

jmdyck mentioned this pull request Apr 14, 2022

Editorial: Simplify GetSubstitution #2739

Merged

	1. Let _following_ be the substring of _str_ from _tailPos_.
	1. Let _following_ be the substring of _str_ from min(_tailPos_, _stringLength_).

	1. Let _capture_ be the _index_<sup>th</sup> element of _captures_.
	1. Let _capture_ be _captures_[_index_ - 1].

	1. Assert: _ref_ starts with "$".
	1. Assert: The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN).

Editorial: Incorporate GetSubstitution's table into the algorithm #2484

Editorial: Incorporate GetSubstitution's table into the algorithm #2484

Conversation

jmdyck commented Aug 7, 2021 • edited Loading

bakkot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmdyck Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmdyck Aug 8, 2021 • edited Loading

Choose a reason for hiding this comment

bakkot Aug 8, 2021 • edited Loading

Choose a reason for hiding this comment

gibson042 Aug 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmdyck commented Aug 17, 2021

michaelficarra left a comment • edited Loading

Choose a reason for hiding this comment

michaelficarra commented Aug 18, 2021

jmdyck commented Sep 14, 2021

jmdyck commented Sep 16, 2021

jmdyck commented Sep 16, 2021

bakkot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmdyck commented Sep 30, 2021

syg commented Nov 24, 2021

jmdyck commented Aug 7, 2021 •

edited

Loading

jmdyck Sep 16, 2021 •

edited

Loading

jmdyck Aug 8, 2021 •

edited

Loading

bakkot Aug 8, 2021 •

edited

Loading

gibson042 Aug 8, 2021 •

edited

Loading

michaelficarra left a comment •

edited

Loading