-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Incorporate GetSubstitution's table into the algorithm #2484
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on details, but otherwise looks great.
spec.html
Outdated
1. Let _m_ be the number of elements in _captures_. | ||
1. Let _result_ be the String value derived from _replacement_ by copying code unit elements from _replacement_ to _result_ while performing replacements as specified in <emu-xref href="#table-replacement-text-symbol-substitutions"></emu-xref>. These `$` replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements. | ||
1. Return _result_. | ||
1. If _tailPos_ ≥ _stringLength_, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to your PR, but this is a dumb case. Assuming I understand this correctly, it only happens when code has overridden the default behavior of exec
in such a way that it claims to have find a matching substring at a position where that substring is longer than the number of characters left following that position in the input string.
That might warrant a note (though not necessarily in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it struck me as odd too. I think that's the only way that _tailPos_ > _stringLength_
can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe have that branch covered in es-abstract's GetSubstitution with this invocation:
GetSubstitution('abcdef', 'abcdefghi', 0, [], '>$`<'),
does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljharb With that invocation, you should have matchLength = 'abcdef'.length = 6
and stringLength = 'abcdefghi'.length = 9
, and hence tailPos = position + matchLength = 0 + 6 = 6
, and hence tailPos < stringLength
. So I don't think that invocation should cover the _tailPos_ > _stringLength_
case I'm referring to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i may have grabbed the wrong test case.
GetSubstitution('def', 'abcdefghi', 3, [], '>$`<'),
?
If that's still wrong, I'd love help triggering it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same deal with that one:
position = 3
, matchLength = 'def'.length = 3
, stringLength = 'abcdefghi'.length = 9
, and hence tailPos = position + matchLength = 3 + 3 = 6
, and hence tailPos < stringLength
. (Oh, and in both cases the replacement-template-string needs to contain $'
, not $`
, to be in the relevant row/branch at all.)
The cases which actually exercise it look like GetSubstitution("1234567", "abc", 0, [], "$'")
or GetSubstitution("x", "abc", 3, [], "$'")
- in the former case by having the match be longer than the input string, in the latter by having the match be nonempty and occur at the end of the input string.
Obviously that's a very strange thing to do, but you can trigger it by overriding .exec
, as in
let evil = new RegExp;
evil.exec = () => ({ 0: '1234567', length: 1, index: 0 });
'abc'.replace(evil, `$'`);
or
let evil = new RegExp;
evil.exec = () => ({ 0: 'x', length: 1, index: 3 });
'abc'.replace(evil, `$'`);
(Fun fact: I just tested these snippets and found bugs in four different JS engines: SpiderMonkey, Chakra, XS, and GraalJS.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I haven't added a Note yet.)
spec.html
Outdated
1. If _tailPos_ ≥ _stringLength_, then | ||
1. Let _following_ be the empty String. | ||
1. Else, | ||
1. Let _following_ be the substring of _str_ from _tailPos_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an if-else, I'd probably write this as
1. Let _following_ be the substring of _str_ from _tailPos_. | |
1. Let _following_ be the substring of _str_ from min(_tailPos_, _stringLength_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, except, if we did want to insert a Note, we might want to do it like so:
1. If _tailPos_ > _stringLength_, then
1. NOTE: This can only happen when ...
1. Let _following_ be the empty String.
1. Else,
1. Let _following_ be the substring of _str_ from _tailPos_.
(Note that this shifts the "=" case from the "then" to the "else" so that the Note doesn't apply, but that's valid, because it's an empty String either way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with the min
phrasing, the note can read NOTE: _tailPos_ can exceed _stringLength_ only when [...]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put in the min
formulation, but I think it makes the step a bit hard to read (too much happening in one step). I think I'd prefer something like:
1. Let _tailPos_ be _position_ + _matchLength_.
1. If _tailPos_ >_stringLength_, then
1. NOTE: This can happen if ...
1. Set _tailPos_ to _stringLength_.
1. Let _refReplacement_ be the substring of _str_ from _tailPos_.
I like it because it gives you a space to consider the weirdness of _tailPos_ > _stringLength_
separately from the definition of _refReplacement_
for this case.
spec.html
Outdated
1. NOTE: No replacement is done. | ||
1. Return _ref_. | ||
1. Else, | ||
1. Let _capture_ be the _index_<sup>th</sup> element of _captures_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially ambiguous, especially with captures being a 0-indexed List while capture references are 1-indexed.
1. Let _capture_ be the _index_<sup>th</sup> element of _captures_. | |
1. Let _capture_ be _captures_[_index_ - 1]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, I don't think it's ambiguous, since the definition of Lists sets up the equivalence between the _n_<sup>th</sup> element of _L_
and _L_[_n_ - 1]
. However, that's a fairly obscure point, so the suggested change is clearer.
It's a pre-existing condition, so I'll deal with it in a separate commit. I'm thinking I might also add a Note in RegExp.prototype [ @@replace ], maybe after 14.i.iii.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's plenty appropriate to change it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I meant a separate commit in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added that commit.
spec.html
Outdated
1. Else, | ||
1. Let _following_ be the substring of _str_ from _tailPos_. | ||
1. Let _interpretNumericRef_ be a new Abstract Closure with parameters (_ref_) that captures _captures_ and performs the following steps when called: | ||
1. Assert: _ref_ starts with *"$"*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no definition for "starts with"; would more precise language be better here (and in step 15)?
1. Assert: _ref_ starts with *"$"*. | |
1. Assert: The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no definition for "starts with"
That's true, but it's also true of most other phrases that the spec uses to handle String values. (The only ones for which we define specific wording are "the string-concatenation of ..." and "the substring of ...", but phrases involving length and indexing could probably also be considered 'defined'.)
would more precise language be better here (and in step 15)?
The code unit at index 0 within _ref_ is 0x0024 (DOLLAR SIGN)
We could do that. (That would also take care of the kerning problem for $'
.) The longer refs in step 15 could get pretty wordy, but rather than (e.g.)
If the code unit at index 0 within _replacerRemainder_ is 0x0024 (DOLLAR SIGN)
and the code unit at index 1 within _replacerRemainder_ is 0x0026 (AMPERSAND), then
we could say:
If the first two code units of _replacerRemainder_ are (respectively)
0x0024 (DOLLAR SIGN) and 0x0026 (AMPERSAND), then
(And again, there isn't a definition for "the first N code units of S", but it's used in parseInt
, and we could define it if we felt it was necessary.)
[Edit to add: And actually, both of these have a problem: the code unit at index 1 within _X_
and the first two code units of _X_
aren't well-defined if _X_
is of length 1. Ditto if we tried to use (e.g.) the substring of _replacerRemainder_ from 0 to 2
. That's the nice thing about "starts with": if the left string is shorter than the right string, the result is simply false.]
We could also consider defining "starts with". Defining "A starts with B" where A and B are Strings would be fairly easy, but it's debatable whether that would handle step 15.e and 15.f:
15.e Else if _replacerRemainder_ starts with *"$"* followed by 2 (or more) decimal digits, then
15.f Else if _replacerRemainder_ starts with *"$"* followed by 1 decimal digit
because (e.g.) *"$"* followed by 1 decimal digit
doesn't denote a String, it denotes a set of Strings. Or you can see it as an implicit existential quantification:
If there exists a String _x_ consisting of *"$"* followed by 1 decimal digit,
such that _replacerRemainder_ starts with _x_, then
I don't have a conclusion, it's just stuff that comes up when you pull on that thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own part, I don't think we need to define "starts with", given that strings are defined to be sequences and I think readers should be able to understand what it means for one sequence to start with another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is true, and there are many variations. I admit that I did have the $'
problem on my mind as well, and that I don't have a better answer to 15.e than something like
1. Repeat, while _replacerRemainder_ is not the empty String,
1. Let _len_ be the length of _replacerRemainder_.
…
1. Else if _len_ ≥ 3 and …, then
1. Let _ref_ be the substring of _replacerRemainder_ from 0 to 3.
1. Let _refReplacement_ be _interpretNumericRef_(_ref_).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I've continued to use "starts with".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from open comments.
dc1358a
to
d89851f
Compare
Force-pushed to rebase to master and resolve merge conflicts. (I should be getting back to this soon.) |
d89851f
to
8f2e8c8
Compare
force-pushed to rebase to master and make changes to address most of the comments (see individual replies) |
My current plan for resolving #1426 is to change step 5.f.ii.1.h.i to:
(Note that that's a normative change, so I won't be adding it to this PR.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM other than the kerning thing (for which I have a proposed workaround). Changes seem good.
spec.html
Outdated
1. Let _ref_ be *"$'"*. | ||
1. Let _matchLength_ be the number of code units in _matched_. | ||
1. Let _tailPos_ be _position_ + _matchLength_. | ||
1. Let _refReplacement_ be the substring of _str_ from min(_tailPos_, _stringLength_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmdyck I know you said you think this step has too much happening now, but it seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[shrug] I'll live with it.
But that reminds me: we were talking about maybe having a Note to explain how the weirdness could occur. Should we do that now or leave it to another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy either way. If you don't add it here I'll do it in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to you then.
d7ca379
to
5ca0e59
Compare
force-pushed to:
|
5ca0e59
to
2fceaeb
Compare
2fceaeb
to
eb7ec68
Compare
I'll defer to @bakkot and @michaelficarra's review since they have both reviewed. |
_preserved_ is the substring before the matched substring, but the name _preserved_ is a bit odd, because the substring *after* the matched substring is also preserved. So rename _preserved_ as _preceding_, and define _following_ to complement it.
Every caller of GetSubstitution refers to its returned value as _replacement_, so it's somewhat confusing for GetSubstitution to use _replacement_ as one of its parameter names. Rename the parameter from _replacement_ to _replacementTemplate_.
…c39#2484) The current assertion ignores the possibility that an element of _captures_ could be *undefined*.
eb7ec68
to
f2671ca
Compare
Resolves #2479.
I tried various ways to express the table as algorithm steps; I think this is the clearest.
The PR consists of:
32 commits of preliminary refactoring,