Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Make B.1 "Additional Syntax" normative #1651

Closed
wants to merge 12 commits into from

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented Jul 31, 2019

This PR was originally created to make all of B.1 "Additional Syntax" normative (incorporating it into the body of the spec, as part of the reform described in PR #1595).

But then I split off PR #1867 to handle:

  • B.1.1 "Numeric Literals" and
  • B.1.2 "String Literals"

So now this PR only handles:

  • B.1.3 "HTML-like comments" and
  • B.1.4 "Regular Expressions Patterns"

Notes:

  • I've added "allowed but deprecated" notes. (should maybe be phrased in terms of "legacy")
  • I dropped the entries in 16.2 "Forbidden Extensions", because the syntax is no longer an extension (although it's still disallowed in strict mode code, of course). [Later: See Normative: Make B.1.{1,2} normative #1867 for some discussion of this topic.]
  • I left placeholder summaries in B.1.* to keep section numbers in sync, and in case a tombstone is wanted. [Later: Not wanted in Normative: Make B.1.{1,2} normative #1867, so probably not wanted here either.]
  • It's possible that rebasing has done some damage, so I'll have to re-check everything if this PR goes forward.

@jmdyck jmdyck mentioned this pull request Jul 31, 2019
22 tasks
Copy link

@erights erights left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make available a rendered form? For my purposes, I'd rather review by reading that.
Thanks.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated
LegacyOctalLikeDecimalIntegerLiteral ::
`0` OctalDigit
LegacyOctalLikeDecimalIntegerLiteral OctalDigit
<p>The following syntax from <emu-xref href="#sec-literals-numeric-literals"></emu-xref>, and its associated semantics, used to be normative optional:</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter what it used to be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not my call. Like I said:

I left a placeholder summary in B.1.1 to keep section numbers in sync, and in case a tombstone is wanted.

See this comment and following.

Did the July meeting make a decision on what (if anything) is wanted here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not discussed, to my knowledge.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jul 31, 2019

Could you make available a rendered form?

Is it possible to opt into the auto-rendering that happens for tc39/ecma262?

@ljharb
Copy link
Member

ljharb commented Jul 31, 2019

@jmdyck not at this time; we do plan on making all PRs have auto-render behavior at some point, but for now you'll have to do it manually :-/

@jmdyck jmdyck force-pushed the one-grammar branch 2 times, most recently from 0768ad3 to e79b676 Compare August 1, 2019 02:49
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 1, 2019

Could you make available a rendered form?

Here you go

[Edit:] Now here

@jmdyck jmdyck force-pushed the one-grammar branch 4 times, most recently from 39bea71 to f7db111 Compare August 4, 2019 21:43
@waldemarhorwat
Copy link

I'm extremely uncomfortable with pulling B.1 into the main body of the spec. This was decided at a time when I wasn't present, and it wasn't clearly on the agenda prior to the meeting, so I had no notice of it.

There are significant problems with integrating B.1 into the spec. This breaks the separation between things we wish weren't in the spec from ones we do. Worse, some of the things in B.1 do not parse unambiguously — the order in which the grammar rules are written matters — and it would do terrible violence to the main grammar's readability to make its rules order-dependent.

@bakkot
Copy link
Contributor

bakkot commented Aug 22, 2019

@waldemarhorwat, I can tell you from personal experience that B.1 not being in the main body of the spec causes implementors (of tooling, at the very least) to waste significant amounts of time implementing a grammar which is not in practice used by anyone anywhere, which they then have to go back and replace once they discover that they've implemented the wrong thing. (Specifically, a coworker of mine did exactly this earlier this year.)

I don't see the value in keeping the non-annex-B grammar. It's true that it is simpler, but it is also irrelevant.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 22, 2019

This breaks the separation between things we wish weren't in the spec from ones we do.

The spec can still make that distinction. (E.g., @littledan says "In my opinion, it would be fine to include a note that we are somehow sad about these parts of the grammar, ...") I haven't included such notes in my commits so far because I'm hoping for some consensus on wording from the committee or the editors. Maybe I'll go back and put in stubs. [Later: Done.]

You're saying there's a benefit to having all that deprecated stuff in one place, separate from the "core" of the language. Maybe there is. But does it outweigh the costs? If you want to convince the committee of that, issue #1595 might be the more appropriate spot. (This PR is just the first among presumably many.)

Worse, some of the things in B.1 do not parse unambiguously — the order in which the grammar rules are written matters

Yeah, that's true. And it's icky from a theoretical standpoint. But given that that's what implementers have to implement, what's the benefit in hiding the ick in an annex?

— and it would do terrible violence to the main grammar's readability to make its rules order-dependent.

So we mostly won't do that. There's only 6 productions (all from B.1.4 RegExps) that actually rely on order-disambiguation, so those will be the only ones that call for it. (I'll be suggesting ::! as the relevant connector, but that's up for bikeshedding.)

@waldemarhorwat
Copy link

Unlike the main body grammar, the Annex B grammar is not a valid LR(1) grammar; in fact, it's not a valid LR(n) grammar for any n. This means that it cannot be validated.

The bigger problem is that integrating the Annex B grammar into the main grammar is infectious and makes it impossible to validate the rest of the main grammar. Claiming that only a few productions introduce order dependence that cause the grammar to be invalid is true but irrelevant — validity is a property of the entire grammar, and the entire grammar becomes invalid. With the loss of validity we'd lose crucial invariants for the entire grammar, such as:

  • If you want to see how some text parses, all you need is a tree of nonterminal productions from the grammar that recursively expand into your text. You can then look up the semantic rules for each production and then read off the semantics, any additional static semantic rules, etc.

This invariant is so obvious that most people take it for granted. If we proceed with this integration, we'll lose it for the main grammar.

Losing this invariant, in addition to impacting readability, will make the grammar unmaintainable. At that point there is no way to contain the damage to the main grammar and prevent it from spreading. When adding new syntax extensions, it's a bad idea for a number of practical reasons to introduce additional order dependencies, but we would have no way of knowing that if we can't validate the grammar.

I agree with the comments above that not realizing that some grammar productions are modified by Annex B can be an issue, but a much less dangerous solution to that is to just put notes next to the affected main grammar productions.

On the other hand, there will likely be future instances in which the Annex B grammar does get updated but is no longer a superset of the main grammar. That's easy to do unintentionally and results in actual bugs, and making this comparison of the two grammars is a crucial check of figuring out that the grammar someone is proposing for some new extension isn't doing what they think it's doing.

@waldemarhorwat
Copy link

I added a topic to the next meeting to discuss this. For now I'd appreciate it if the Annex B grammars were not merged into the main grammar.

@bakkot
Copy link
Contributor

bakkot commented Aug 22, 2019

the Annex B grammar is not a valid LR(1) grammar

I believe this is only true of the modified Pattern grammar, which is a distinct grammar and not a part of the main grammar. Any invariants which are true of the main grammar would remain true, and it should remain as easy to validate them as it currently is.

(The Pattern grammar is already significantly complicated by the throw-away-and-reparse behavior required to parse named capturing groups, so it already has fewer invariants than the main grammar.)

there will likely be future instances in which the Annex B grammar does get updated but is no longer a superset of the main grammar. That's easy to do unintentionally and results in actual bugs

I think the reverse problem, where the grammars get out of sync because they are updated incorrectly, is more common and more severe: see #1102, etc. If there is only a single grammar to maintain, all that needs to be checked is that each update to the grammar does not render any previously valid source text invalid - a check which already needs to occur for any grammar change. I don't see the advantage of maintaining two copies for specifically this subset of the grammar.

@erights
Copy link

erights commented Aug 23, 2019

I disagree with @waldemarhorwat 's conclusion, i.e., I continue to support moving all the grammar into the main text. However, I agree with @waldemarhorwat 's procedural request. The only content I made clear on the agenda prior to the Berlin meeting was "Annex B Reform". I was aware that Waldemar would likely argue and had prepared to counter-argue.

However, I was not prepared for Waldemar's absence. Given that absence, it did not occur to me at the time to explicitly postpone declaration of consensus until checking with Waldemar. I would expect such a courtesy were others to discuss, without warning, an issue known to be central to my concerns. I should have extended this same courtesy to Waldemar, and wish to do so now, retroactively.

Let us consider this issue reopened until the next meeting.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 23, 2019

Unlike the main body grammar, the Annex B grammar is not a valid LR(1) grammar;

I'd dispute that the main-body grammar is LR(1), because it isn't technically a CFG to begin with. But I'll grant that it has one less non-CFG-ism than the Annex B grammar.

The bigger problem is that integrating the Annex B grammar into the main grammar is infectious and makes it impossible to validate the rest of the main grammar.

As @bakkot points out, order-disambiguation is confined to B.1.4's extensions to the Pattern grammar. Are you saying that this infection can cross the gap between Pattern and RegularExpressionBody? And then cross the gap between the lexical and syntactic grammars?

With the loss of validity we'd lose crucial invariants for the entire grammar, such as:

If you want to see how some text parses, all you need is a tree of nonterminal productions from the grammar that recursively expand into your text. You can then look up the semantic rules for each production and then read off the semantics, any additional static semantic rules, etc.

ES lost that invariant ages ago. (Probably never had it.) For example:

  • Any text that relies on ASI has no such tree of productions, by definition.
  • Any text that relies on a cover grammar has a tree, but you can't read off the semantics from that tree, because the semantics aren't associated with the cover grammar.
  • Any text that has a 'dangling else' has multiple such trees, most of which are wrong semantically.

Losing this invariant, in addition to impacting readability, will make the grammar unmaintainable.

As @bakkot said, the separation of main-body from Annex B is already the source of maintenance bugs.

When adding new syntax extensions, it's a bad idea for a number of practical reasons to introduce additional order dependencies, but we would have no way of knowing that if we can't validate the grammar.

The ambiguity of IfStatements doesn't prevent us from detecting a new ambiguity. It seems to me that similarly, existing order dependencies don't prevent us from detecting a new one.

On the other hand, there will likely be future instances in which the Annex B grammar does get updated but is no longer a superset of the main grammar. That's easy to do unintentionally and results in actual bugs, and making this comparison of the two grammars is a crucial check of figuring out that the grammar someone is proposing for some new extension isn't doing what they think it's doing.

But apparently, that crucial check isn't happening.


Ultimately, I think my point is this: What difference does it make if we can validate the main-body Pattern grammar if that's not the grammar that people have to implement?

If we can't validate the "real" grammar, then keeping Annex B separate isn't going to change that.

@waldemarhorwat
Copy link

ES lost that invariant ages ago. (Probably never had it.) For example:
* Any text that relies on ASI has no such tree of productions, by definition.

And due to that ASI continues to be an ongoing source of compatibility problems.

* Any text that relies on a cover grammar has a tree, but you can't read off the semantics from that tree, because the semantics aren't associated with the cover grammar.

You can; the semantics of the cover grammar tell you what to reparse it as.

* Any text that has a 'dangling else' has multiple such trees, most of which are wrong semantically.

The grammar could have been worded without the dangling else problem, but for historical reasons we decided not to; it's a straightforward and well-understood transformation and the version of the grammar I've been validating does include that transformation. Note that in more complex situations such as nesting function calls and new, we did write the grammar to avoid the "dangling ()" problem.

If we can't validate the "real" grammar, then keeping Annex B separate isn't going to change that.

We can validate the grammar in the main body of the spec. It's not possible to validate the Annex B grammar because invalid behavior is correct by definition. If we integrate it into the main body of the spec, then we won't be able to validate future changes to the main body of the spec.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 23, 2019

  • Any text that relies on ASI has no such tree of productions, by definition.

And due to that ASI continues to be an ongoing source of compatibility problems.

Indeed. I'm not saying ASI is good, I'm saying it refutes the invariant you claimed.

  • Any text that relies on a cover grammar has a tree, but you can't read off the semantics from that tree, because the semantics aren't associated with the cover grammar.

You can; the semantics of the cover grammar tell you what to reparse it as.

That seems a bit of a stretch, but I'll take it. So you're validating the covered grammars in addition to the covering grammars?

  • Any text that has a 'dangling else' has multiple such trees, most of which are wrong semantically.

The grammar could have been worded without the dangling else problem, but for historical reasons we decided not to; it's a straightforward and well-understood transformation and the version of the grammar I've been validating does include that transformation.

Okay, so you took the spec's grammar, transformed away an ambiguity, and validated the result. Are you sure you can't apply a similar process to the order dependencies of B.1.4?

If we can't validate the "real" grammar, then keeping Annex B separate isn't going to change that.

We can validate the grammar in the main body of the spec. It's not possible to validate the Annex B grammar because invalid behavior is correct by definition.

Are you saying that the Annex B grammar doesn't distinguish valid behavior from invalid? That seems patently false, so I have no idea what you mean.

If we integrate it into the main body of the spec, then we won't be able to validate future changes to the main body of the spec.

But if the main-body Pattern grammar isn't the grammar that people have to implement, what benefit is there to validating it?

Also, I'm still curious about these questions:

Are you saying that this infection can cross the gap between Pattern and RegularExpressionBody? And then cross the gap between the lexical and syntactic grammars?

@waldemarhorwat
Copy link

  • Any text that relies on ASI has no such tree of productions, by definition.

And due to that ASI continues to be an ongoing source of compatibility problems.

Indeed. I'm not saying ASI is good, I'm saying it refutes the invariant you claimed.

ASI is a separable concern. Just because the invariant fails for insertion of implicit semicolons doesn't mean that it doesn't apply to the grammar.

We can validate the grammar in the main body of the spec. It's not possible to validate the Annex B grammar because invalid behavior is correct by definition.

Are you saying that the Annex B grammar doesn't distinguish valid behavior from invalid? That seems patently false, so I have no idea what you mean.

I'm talking about validating a grammar to ensure that it has no ambiguities.

If we integrate it into the main body of the spec, then we won't be able to validate future changes to the main body of the spec.

But if the main-body Pattern grammar isn't the grammar that people have to implement, what benefit is there to validating it?

I already answered that. See my responses above.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 24, 2019

But if the main-body Pattern grammar isn't the grammar that people have to implement, what benefit is there to validating it?

I already answered that. See my responses above.

Well, I've re-read all your responses, but I didn't discern an answer to that particular question. Oh well.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 27, 2019

This is pretty much the sequence of commits I intend for the final PR. If reviewers would like a different breakdown, let me know.

  • 3 normative commits, one for each of B.1.{1,2,3}. 1 normative commit for B.1.3.

  • Then an interlude of 7 editorials:

    • 4 commits to reduce inconsistencies between the main body and Annex A (Grammar Summary), making it easier to keep them in synch going forward.

    • 3 commits to slightly reorganize 21.2 RegExp Objects to ease the subsequent merge of B.1.4. (At one point, I had more radical reorg ideas, but I scaled back.)

  • 1 normative commit to merge B.1.4.

  • 7 editorial commits to clean up the Pattern grammar (and associated semantics) after the merge.

I don't advise looking at the total diff; you can make sense of it for the first three merges, but after that it's too messy. I advise either looking at each commit individually, or just at the final result.


I've gone back and added syntax deprecation notes. For the Pattern grammar, the note is incomplete, because the difference between the two versions of the grammar is more that just a set of productions. I could flesh it out, but I don't want to bother until it's been decided that we're going to have such notes.


The HTML rendering is now here

@littledan
Copy link
Member

Great to see this PR. Thanks for your diligent work on it, @jmdyck .

I agree with @waldemarhorwat that the ECMAScript specification is for more than just browsers--it's also for tools, like the ones maintained by @erights and @bakkot , which suffer from the obscurity of the Annex B split. It's unfortunate that the ecosystem-compatible ECMAScript grammar is in the state as it is (I'm sure everyone would prefer to implement a simpler grammar), but pushing it off into Annex B only adds confusion.

I look forward to discussing this at the upcoming TC39 meeting. Probably close reviews of this PR (and detailed editorial decisions, like whether to mark formerly Annex B things as "sad") should wait until that discussion concludes.

@jmdyck jmdyck added the normative change Affects behavior required to correctly evaluate some ECMAScript source text label Sep 17, 2019
@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 17, 2019

I look forward to discussing this at the upcoming TC39 meeting.

And I look forward to reading the notes!

Probably close reviews of this PR (and detailed editorial decisions, like whether to mark formerly Annex B things as "sad") should wait until that discussion concludes.

Yup, that's what I figured after the comment above from @erights.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 15, 2021

force-pushed to:

spec.html Outdated
<ul>
<li>It is a Syntax Error if a |Module| contains the source code matching this production.</li>
</ul>
<emu-note>In a |Script|, this syntax is allowed, but deprecated.</emu-note>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify in https://tc39.es/ecma262/#sec-syntactic-and-lexical-grammars that "deprecated/legacy" means "optional for implementation".

However, I am in line with @jmdyck about putting deprecated behavior under something like if HostImplementsOptional() instead, but without "units" for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify in https://tc39.es/ecma262/#sec-syntactic-and-lexical-grammars that "deprecated/legacy" means "optional for implementation".

It doesn't mean that. See here for a pending definition of "legacy".

However, I am in line with @jmdyck about putting deprecated behavior under something like if HostImplementsOptional() instead, but without "units" for now.

That suggestion is for when you take stuff from Annex B and incorporate it into the main spec, keeping it normative-optional. This PR (and #1867) take stuff from Annex B and incorporate it into the main spec, but make it normative.

... namely
    Syntax for Patterns
and
    Static Semantics for Patterns

And for consistency, rename
    Pattern Semantics
to
    Runtime Semantics for Patterns

... so that clause-headers more precisely convey what content is where.
... namely:
 - Patterns
 - Group Specifiers
 - Character Classes
 - Escapes

(This moves productions around, but doesn't alter them at all.)

Also, rearrange Early Errors rules and runtime semantics rules
to reflect the same order of productions.
(Merge its Syntax, Static Semantics, and Runtime Semantics into the main body.)

(Part of Annex B reform, see PR tc39#1595.)
... by using 'QuantifiableAssertion' under [+U] too.
(This involves adding the [U] parameter to QuantifiableAssertion.)
... down to its proper place in production order.

(This commit's diff probably shows a complicated combination of tweaks,
but it's really just taking a block of 22 lines
and shifting it down the file.)
Note that:

ExtendedAtom was only ever 'invoked' under [~U],
and when the merged production is invoked with [~U],
it exactly reproduces the RHSs of ExtendedAtom.

Atom was only ever invoked under [+U],
and when the merged production is invoked with [+U],
it reproduces the RHSs of former Atom
except for the placement of the PatternCharacter RHS,
but that's okay, because former Atom wasn't an order-disambiguated production.
... by merging three pairs of RHSs.

(Preserves the order of alternatives under [~U], but not under [+U],
but that's okay, because the [+U] sides aren't order-disambiguated.)
... by merging the two capturing-group alternatives.

(This may be affected by the outcome of issue tc39#1673.)
Specifically, add [N] parameter to
    CharacterClass
    ClassRanges
    NonemptyClassRanges
    NonemptyClassRangesNoDash
    ClassAtom

These were implied when commit 95ec0c6 (of PR tc39#1027)...

- added [?N] to RHS occurrences of CharacterClass
  without explicitly adding [N] to the LHS occurrence CharacterClass; and

- added [N] to the LHS occurrence of ClassAtomNoDash (in Annex B)
  without adding [?N] to any RHS occurrence.

This commit propagates [N] across that gap.

(See issue tc39#1081.)
@jmdyck
Copy link
Collaborator Author

jmdyck commented Nov 7, 2022

I'm closing this PR, because:
(a) it seems unlikely to get committee approval;
(b) even if it does, I'd probably have to start from scratch; and
(c) PR #2445 is probably a better approach.

@jmdyck jmdyck closed this Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
normative change Affects behavior required to correctly evaluate some ECMAScript source text
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants