Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Incorporate GetSubstitution's table into the algorithm #2484

Merged
merged 5 commits into from
Dec 7, 2021

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented Aug 7, 2021

Resolves #2479.

I tried various ways to express the table as algorithm steps; I think this is the clearest.

The PR consists of:

  • 3 2 commits of preliminary refactoring,
  • 1 commit where I fix a type-assertion,
  • 6 squashable commits that gradually transfer stuff over, and
  • 1 more refactoring commit (that could be preliminary).

Copy link
Contributor

@bakkot bakkot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on details, but otherwise looks great.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated
1. Let _m_ be the number of elements in _captures_.
1. Let _result_ be the String value derived from _replacement_ by copying code unit elements from _replacement_ to _result_ while performing replacements as specified in <emu-xref href="#table-replacement-text-symbol-substitutions"></emu-xref>. These `$` replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements.
1. Return _result_.
1. If _tailPos_ &ge; _stringLength_, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your PR, but this is a dumb case. Assuming I understand this correctly, it only happens when code has overridden the default behavior of exec in such a way that it claims to have find a matching substring at a position where that substring is longer than the number of characters left following that position in the input string.

That might warrant a note (though not necessarily in this PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it struck me as odd too. I think that's the only way that _tailPos_ > _stringLength_ can happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe have that branch covered in es-abstract's GetSubstitution with this invocation:

GetSubstitution('abcdef', 'abcdefghi', 0, [], '>$`<'),

does that help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljharb With that invocation, you should have matchLength = 'abcdef'.length = 6 and stringLength = 'abcdefghi'.length = 9, and hence tailPos = position + matchLength = 0 + 6 = 6, and hence tailPos < stringLength. So I don't think that invocation should cover the _tailPos_ > _stringLength_ case I'm referring to here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i may have grabbed the wrong test case.

GetSubstitution('def', 'abcdefghi', 3, [], '>$`<'),

?

If that's still wrong, I'd love help triggering it :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deal with that one:

position = 3, matchLength = 'def'.length = 3, stringLength = 'abcdefghi'.length = 9, and hence tailPos = position + matchLength = 3 + 3 = 6, and hence tailPos < stringLength. (Oh, and in both cases the replacement-template-string needs to contain $', not $`, to be in the relevant row/branch at all.)

The cases which actually exercise it look like GetSubstitution("1234567", "abc", 0, [], "$'") or GetSubstitution("x", "abc", 3, [], "$'") - in the former case by having the match be longer than the input string, in the latter by having the match be nonempty and occur at the end of the input string.

Obviously that's a very strange thing to do, but you can trigger it by overriding .exec, as in

let evil = new RegExp;
evil.exec = () => ({ 0: '1234567', length: 1, index: 0 });
'abc'.replace(evil, `$'`);

or

let evil = new RegExp;
evil.exec = () => ({ 0: 'x', length: 1, index: 3 });
'abc'.replace(evil, `$'`);

(Fun fact: I just tested these snippets and found bugs in four different JS engines: SpiderMonkey, Chakra, XS, and GraalJS.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I haven't added a Note yet.)

spec.html Outdated
1. If _tailPos_ &ge; _stringLength_, then
1. Let _following_ be the empty String.
1. Else,
1. Let _following_ be the substring of _str_ from _tailPos_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an if-else, I'd probably write this as

Suggested change
1. Let _following_ be the substring of _str_ from _tailPos_.
1. Let _following_ be the substring of _str_ from min(_tailPos_, _stringLength_).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, except, if we did want to insert a Note, we might want to do it like so:

1. If _tailPos_ &gt; _stringLength_, then
  1. NOTE: This can only happen when ...
  1. Let _following_ be the empty String.
1. Else,
  1. Let _following_ be the substring of _str_ from _tailPos_.

(Note that this shifts the "=" case from the "then" to the "else" so that the Note doesn't apply, but that's valid, because it's an empty String either way.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with the min phrasing, the note can read NOTE: _tailPos_ can exceed _stringLength_ only when [...].

Copy link
Collaborator Author

@jmdyck jmdyck Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in the min formulation, but I think it makes the step a bit hard to read (too much happening in one step). I think I'd prefer something like:

1. Let _tailPos_ be _position_ + _matchLength_.
1. If _tailPos_ >_stringLength_, then
  1. NOTE: This can happen if ...
  1. Set _tailPos_ to _stringLength_.
1. Let _refReplacement_ be the substring of _str_ from _tailPos_.

I like it because it gives you a space to consider the weirdness of _tailPos_ > _stringLength_ separately from the definition of _refReplacement_ for this case.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
@bakkot bakkot added the editor call to be discussed in the next editor call label Aug 7, 2021
spec.html Outdated
1. NOTE: No replacement is done.
1. Return _ref_.
1. Else,
1. Let _capture_ be the _index_<sup>th</sup> element of _captures_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially ambiguous, especially with captures being a 0-indexed List while capture references are 1-indexed.

Suggested change
1. Let _capture_ be the _index_<sup>th</sup> element of _captures_.
1. Let _capture_ be _captures_[_index_ - 1].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I don't think it's ambiguous, since the definition of Lists sets up the equivalence between the _n_<sup>th</sup> element of _L_ and _L_[_n_ - 1]. However, that's a fairly obscure point, so the suggested change is clearer.

It's a pre-existing condition, so I'll deal with it in a separate commit. I'm thinking I might also add a Note in RegExp.prototype [ @@replace ], maybe after 14.i.iii.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's plenty appropriate to change it in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I meant a separate commit in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added that commit.

spec.html Outdated
1. Else,
1. Let _following_ be the substring of _str_ from _tailPos_.
1. Let _interpretNumericRef_ be a new Abstract Closure with parameters (_ref_) that captures _captures_ and performs the following steps when called:
1. Assert: _ref_ starts with *"$"*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no definition for "starts with"; would more precise language be better here (and in step 15)?

Suggested change
1. Assert: _ref_ starts with *"$"*.
1. Assert: The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN).

Copy link
Collaborator Author

@jmdyck jmdyck Aug 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no definition for "starts with"

That's true, but it's also true of most other phrases that the spec uses to handle String values. (The only ones for which we define specific wording are "the string-concatenation of ..." and "the substring of ...", but phrases involving length and indexing could probably also be considered 'defined'.)

would more precise language be better here (and in step 15)?
The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN)

We could do that. (That would also take care of the kerning problem for $'.) The longer refs in step 15 could get pretty wordy, but rather than (e.g.)

If the code unit at index 0 within _replacerRemainder_ is 0x0024 (DOLLAR SIGN)
and the code unit at index 1 within _replacerRemainder_ is 0x0026 (AMPERSAND), then

we could say:

If the first two code units of _replacerRemainder_ are (respectively)
0x0024 (DOLLAR SIGN) and 0x0026 (AMPERSAND), then

(And again, there isn't a definition for "the first N code units of S", but it's used in parseInt, and we could define it if we felt it was necessary.)

[Edit to add: And actually, both of these have a problem: the code unit at index 1 within _X_ and the first two code units of _X_ aren't well-defined if _X_ is of length 1. Ditto if we tried to use (e.g.) the substring of _replacerRemainder_ from 0 to 2. That's the nice thing about "starts with": if the left string is shorter than the right string, the result is simply false.]


We could also consider defining "starts with". Defining "A starts with B" where A and B are Strings would be fairly easy, but it's debatable whether that would handle step 15.e and 15.f:

15.e Else if _replacerRemainder_ starts with *"$"* followed by 2 (or more) decimal digits, then
15.f Else if _replacerRemainder_ starts with *"$"* followed by 1 decimal digit

because (e.g.) *"$"* followed by 1 decimal digit doesn't denote a String, it denotes a set of Strings. Or you can see it as an implicit existential quantification:

If there exists a String _x_ consisting of *"$"* followed by 1 decimal digit,
such that _replacerRemainder_ starts with _x_, then

I don't have a conclusion, it's just stuff that comes up when you pull on that thread.

Copy link
Contributor

@bakkot bakkot Aug 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own part, I don't think we need to define "starts with", given that strings are defined to be sequences and I think readers should be able to understand what it means for one sequence to start with another.

Copy link
Contributor

@gibson042 gibson042 Aug 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is true, and there are many variations. I admit that I did have the $' problem on my mind as well, and that I don't have a better answer to 15.e than something like

            1. Repeat, while _replacerRemainder_ is not the empty String,
              1. Let _len_ be the length of _replacerRemainder_.
              …
              1. Else if _len_ &ge; 3 and …, then
                1. Let _ref_ be the substring of _replacerRemainder_ from 0 to 3.
                1. Let _refReplacement_ be _interpretNumericRef_(_ref_).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've continued to use "starts with".)

spec.html Outdated Show resolved Hide resolved
ljharb added a commit to ljharb/es-abstract that referenced this pull request Aug 8, 2021
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 17, 2021

Reminder that GetSubstitution is also the subject of issue #1426. I've started thinking about how to handle that.

(See also PR #1732, which proposes a fix for that issue, but won't combine well with this PR.)

michaelficarra
michaelficarra previously approved these changes Aug 18, 2021
Copy link
Member

@michaelficarra michaelficarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from open comments.

@michaelficarra
Copy link
Member

@jmdyck From editor call today, editor consensus either aligns with @bakkot's suggestions up to this point or is neutral to them.

@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Aug 18, 2021
@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 14, 2021

Force-pushed to rebase to master and resolve merge conflicts.

(I should be getting back to this soon.)

@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 16, 2021

force-pushed to rebase to master and make changes to address most of the comments (see individual replies)

@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 16, 2021

Reminder that GetSubstitution is also the subject of issue #1426. I've started thinking about how to handle that.

My current plan for resolving #1426 is to change step 5.f.ii.1.h.i to:

1. If _d_ = 2, then
  1. Set _found_ to *false*.
  1. NOTE: "Fall back" to trying a 1-digit reference.
1. Else,
  1. Let _refReplacement_ be _ref_.

(Note that that's a normative change, so I won't be adding it to this PR.)

Copy link
Contributor

@bakkot bakkot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM other than the kerning thing (for which I have a proposed workaround). Changes seem good.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated
1. Let _ref_ be *"$'"*.
1. Let _matchLength_ be the number of code units in _matched_.
1. Let _tailPos_ be _position_ + _matchLength_.
1. Let _refReplacement_ be the substring of _str_ from min(_tailPos_, _stringLength_).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmdyck I know you said you think this step has too much happening now, but it seems fine to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shrug] I'll live with it.

But that reminds me: we were talking about maybe having a Note to explain how the weirdness could occur. Should we do that now or leave it to another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy either way. If you don't add it here I'll do it in a followup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to you then.

@bakkot bakkot added the editor call to be discussed in the next editor call label Sep 29, 2021
@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 30, 2021

force-pushed to:

  • rebase to master,
  • squash the squashable commits, and
  • add a fixup commit re the string with bad kerning.

@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Sep 30, 2021
@syg
Copy link
Contributor

syg commented Nov 24, 2021

I'll defer to @bakkot and @michaelficarra's review since they have both reviewed.

@bakkot bakkot added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Nov 24, 2021
_preserved_ is the substring before the matched substring,
but the name _preserved_ is a bit odd,
because the substring *after* the matched substring is also preserved.
So rename _preserved_ as _preceding_,
and define _following_ to complement it.
Every caller of GetSubstitution refers to its returned value as _replacement_,
so it's somewhat confusing for GetSubstitution
to use _replacement_ as one of its parameter names.

Rename the parameter from _replacement_ to _replacementTemplate_.
…c39#2484)

The current assertion ignores the possibility
that an element of _captures_ could be *undefined*.
…c39#2484)

... to `_captures_[_index_ - 1]` in `GetSubstitution`.

Also, in `RegExp.prototype [ @@replace ]`
(the only place that passes a non-empty `_captures_` to `GetSubstitution`),
add a NOTE to underline the fact that
`_captures_[_n_ - 1]` is the `_n_`th capture.
@ljharb ljharb merged commit f2671ca into tc39:main Dec 7, 2021
@jmdyck jmdyck deleted the GetSubstitution branch December 8, 2021 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial change ready to merge Editors believe this PR needs no further reviews, and is ready to land.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inline Table 57: Replacement Text Symbol Substitutions
6 participants