src/grammar for the canonical grammar of the Rust language #1331

nagisa · 2015-10-21T21:48:32Z

steveklabnik · 2015-10-21T22:42:30Z

👍 😍 💯

bstrie · 2015-10-27T03:17:41Z

The semi-canonical step may be unnecessary. I don't expect there to be very many discrepancies between what we think the grammar is and what the grammar actually is, so we could just switch to the fully canonical stage directly. Even if a discrepancy is discovered after that point, all it would take would be an RFC to determine whether it should result in a change to the grammar or a change to the parser (obviously taking into account how much breakage would result from a change to the parser).

matklad · 2015-11-03T15:09:41Z

I'm hugely interested in this, many thanks!

One question that bothers me is how the consistency between libsyntax and src/grammar is to be enforced? I am afraid that if automatic comparison of parsers is not a part of the test suite, then the grammar will never become canonical (or the rust parser will never become conforming, if you prefer this half of the glass).

And if we speak of test suite, it would be ideal if it is implementation independent and included in something like ctrs.

steveklabnik · 2015-11-03T15:21:56Z

We have vestigial grammar tests still in the makefiles, I think.

Aatch · 2015-11-06T03:59:10Z

I share @matklad's concerns. It'd be awesome to have a canonical grammar, but maintaining consistency seems like a major challenge.

I'm not sure how we could actually go about testing it in a rigorous manner. Since the compiler's parser doesn't have to parse in the same way a generated parser (from the grammar) would, figuring out whether or not they are consistent with each other seems difficult.

That said, less-rigorous testing could be done with a combination of fuzzing and regression tests. The grammar could be used to generate syntactically-correct (if nonsensical) code which is run through the compiler's parser. If the sample fails to parse, we can look at it and decide if it's a "real" failure or not. If it is, we can fix the bug and use the sample (or a reduced version) as the regression test. The downside here is that it doesn't enforce the same parse. The compiler could parse a code fragment without error, but not actually be correct. I can't see this being a problem though.

With all that said, some discussion around testing the grammar and parser would be nice to see in the RFC proper.

steveklabnik · 2015-11-06T12:51:37Z

For reference https://github.com/rust-lang/rust/blob/master/src/grammar/check.sh and https://github.com/rust-lang/rust/blob/master/src/grammar/verify.rs and other stuff in src/grammar is how we used to do the testing.

matklad · 2015-11-06T17:01:20Z

@steveklabnik I thought those are only lexer tests

steveklabnik · 2015-11-06T17:48:15Z

They at least used to be referred to as "grammar tests", though maybe I cleared some of them out and they're not there anymore. We were like, using antlr and such...

nikomatsakis · 2015-11-06T20:31:52Z

text/0000-grammar-is-canonical.md

+
+This process is sure to become ambiguous over time as syntax is increasingly adjusted (it is harder
+to “blame” syntax changes compared to syntax additions), therefore the resolution process of
+discrepancies will also depend more on a decision from the Rust team.


Nit: this should be the final decision of the language subteam, I think. Seems like the most logical decision making body for this sort of question. (And it's already what we do in practice.)

nikomatsakis · 2015-11-06T20:36:46Z

@nagisa I am very positive on this idea. For some time now I have been wanting to have a grammar changes written up more rigorously. I think your plan is pretty coherent -- basically there's going to be a messy period where we have to make ad-hoc judgement calls on what is the 'canonical' grammar for a bit of time, but eventually we'll get to a clean state.

I do think it's important that we add automatic checking of the Rust grammar against the in-tree sources (at least). Ideally we'd maybe even do crater runs or something, but you gotta run before you can walk.

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness? (Also, is one project based on the other?) I know that @rprichard has been filing various issues about discrepancies between what rustc does and what rust-grammar does, so they must have some automated means of testing it.

nikomatsakis · 2015-11-06T20:37:48Z

cc @rprichard @bleibig @dgrunwald as people who seem likely to have an opinion on this RFC and related matters.

nagisa · 2015-11-06T20:45:47Z

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness?

As far as I can see our in-tree LARL grammar is pretty much a copy of this project (sans the changes that have been made to the grammar). I suspect, without much digging, that it was imported from the rust-grammar project itself.

nagisa · 2015-11-06T20:54:41Z

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

Answers to similar question are also pretty hand-wavy.

nikomatsakis · 2015-11-11T14:36:11Z

@nagisa

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

dgrunwald · 2015-11-11T14:59:58Z

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

Different grammar productions can interact in surprising ways.

For example, the operator as allows invoking a lambda expression without parenthesizing the lambda expression:

fn main() {
    || println!("Hello World") as ()()
}

It's not obvious that this is even possible (and it probably shouldn't be), so I'd be surprised if you can find such code in the wild.

matklad · 2015-11-11T15:09:59Z

@nikomatsakis sample inputs will allow to verify that the formal grammar accepts what rustc accepts. But even the trivial grammar can do this, namely, the grammar of all token sequences.

The hard problem is to make sure that the formal grammar does not accept invalid Rust. As I understand @Aatch suggestion was about this direction of verification.

And the second hard problem is to make sure that both the grammar and the rustc produce similar parse trees: you can easily mix up operator precedence, for a start.

nikomatsakis · 2015-11-11T15:24:35Z

@dgrunwald @matklad yes, ok, right, forgot about @Aatch's question. Nonetheless, I think that comparing the two against all the Rust code we can get our hands on (and yes, comparing the parse trees, which seems eminently doable) provides a fairly high degree of confidence that we are parsing what we want to, despite the theoretical disparities. Certainly generating inputs would let us go further (though we'd presumably be filtering these inputs against the rustc impl, which still means that there may be cases where both do the wrong thing in the same way).

graydon · 2015-11-12T14:39:21Z

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

nikomatsakis · 2015-11-12T14:41:26Z

@graydon

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

It is, of course. =)

nagisa · 2015-11-12T15:05:52Z

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

To me it sounds like generating parser is a natural step to take after we have a proper (canonical and verified to be correct) grammar. We need some way to ensure a grammar matches current implementation (and/or becomes the canonical representation of the Rust’s grammar) before we migrate rustc to a parser generator of any sort, don’t we?

However, that sounds like something that would need to be discussed once we have a generator itself ready, right?

dgrunwald · 2015-11-12T17:30:36Z

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a (partial) syntax tree for code completion, you'll have to puzzle it together from the partially matched productions remaining on the parser's stack. Or (equivalently) introduce a whole bunch of error productions, and you'll always forget some and have cases where nodes aren't integrated into the syntax tree and code completion suffers. This stuff will be much much easier with the existing hand-written parser (since it's LL-style, there's always an obvious choice for connecting the current node to the root of the tree when you encounter an error).

nikomatsakis · 2015-11-12T21:00:25Z

@dgrunwald Note that we did not necessarily say an LR parser. I was hoping
for LL(k) or maybe at worst GLL. But I agree that the benefits of an
automatic parser do not necessarily outweigh the downsides.

On Thu, Nov 12, 2015 at 12:30 PM, Daniel Grunwald notifications@github.com
wrote:

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written
recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a
(partial) syntax tree for code completion, you'll have to puzzle it
together from the partially matched productions remaining on the parser's
stack. Or (equivalently) introduce a whole bunch of error productions, and
you'll always forget some and have cases where nodes aren't integrated into
the syntax tree and code completion suffers. This stuff will be much much
easier with the existing hand-written parser (since it's LL-style, there's
always an obvious choice for connecting the current node to the root of the
tree when you encounter an error).

—
Reply to this email directly or view it on GitHub
#1331 (comment).

dgrunwald · 2015-11-13T00:37:32Z

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify an unambiguous formal grammar that matches rustc's behavior (modulo rustc bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of rust's expression grammar (allowing integer literals, the "as" operator, and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not: e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
    | expr "as" type
    | expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
    | expr "as" type_LT_follow
    | expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such disambiguation rules, and our grammar increases exponentially in size. It becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is also the option taken in the C# language specification: the formal grammar in the appendix is just a rough approximation, and sections like "7.6.4.2 Grammar ambiguities" explain how the parser should handle specific tricky cases.

But if the formal grammar is just an approximation, that makes automated consistency checks rather tricky...

nikomatsakis · 2015-11-13T00:55:23Z

@dgrunwald

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

We'll have to see how it goes. I'll note though that LALRPOP's macros
include an explicit feature for just this case, but I've not tried using it
"at scale" yet. This is the kind of thing I was thinking of when designing
it, though. (In particular, they let you specify flags that can filter
which productions are used, etc.) I seem to recall that JavaScript uses a
similar mechanism for cases like this in their spec, but I don't recall for
certain.

On Thu, Nov 12, 2015 at 7:37 PM, Daniel Grunwald notifications@github.com
wrote:

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify
an unambiguous formal grammar that matches rustc's behavior (modulo rustc
bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in
English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of
rust's expression grammar (allowing integer literals, the "as" operator,
and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not:
e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to
be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
| expr "as" type
| expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
| expr "as" type_LT_follow
| expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is
also the option taken in the C# language specification: the formal grammar
in the appendix is just a rough approximation, and sections like "7.6.4.2
Grammar ambiguities" explain how the parser should handle specific tricky
cases.

But if the formal grammar is just an approximation, that makes automated
consistency checks rather tricky...

—
Reply to this email directly or view it on GitHub
#1331 (comment).

rprichard · 2015-11-13T08:39:02Z

@dgrunwald

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

FWIW, the parser-lalr grammar does seem to handle this ambiguity correctly using precedence directives rather than duplicating the expression non-terminal. I'm guessing a precedence directive is also the reason this parser-lalr bug isn't caught at compile-time.

glaebhoerl · 2015-11-18T10:56:22Z

This seems relevant.

nikomatsakis · 2015-12-21T10:06:03Z

(Nominating for final comment period discussion, since it seems like we've reached a fixed point here in this comment thread.)

nikomatsakis · 2016-01-08T17:20:40Z

Hear ye, hear ye! This RFC is now entering final comment period.

jarcane · 2016-01-12T10:03:08Z

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

nikomatsakis · 2016-01-15T09:48:11Z

@jarcane

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

It would certainly help, though probably not by THAT much.

nikomatsakis · 2016-01-15T19:34:41Z

Huzzah! This RFC has been accepted by the language design subteam.

nagisa changed the title ~~src/grammar for the canonical gramar of the Rust language~~ src/grammar for the canonical grammar of the Rust language Oct 21, 2015

nagisa force-pushed the grammar branch from 6e25ec4 to 5e0efc2 Compare October 21, 2015 21:50

First draft on making src/grammar the grammar

44e88d3

nagisa force-pushed the grammar branch from 5e0efc2 to 44e88d3 Compare October 21, 2015 22:02

nrc added the T-lang Relevant to the language team, which will review and decide on the RFC. label Oct 22, 2015

nikomatsakis self-assigned this Oct 22, 2015

nikomatsakis reviewed Nov 6, 2015
View reviewed changes

nikomatsakis added the I-nominated label Dec 21, 2015

nikomatsakis added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed I-nominated labels Jan 8, 2016

nikomatsakis mentioned this pull request Jan 15, 2016

Canonical Rust grammar distinct from parser (tracking issue for RFC #1331) rust-lang/rust#30942

Closed

nikomatsakis added a commit to nikomatsakis/rfcs that referenced this pull request Jan 15, 2016

merge and rename RFC rust-lang#1331

a2968d4

nikomatsakis merged commit 44e88d3 into rust-lang:master Jan 15, 2016

chriskrycho mentioned this pull request Mar 29, 2017

Document all features rust-lang/reference#9

Closed

48 tasks

ehuss mentioned this pull request Oct 11, 2018

Post a list of useful/interesting links? rust-lang/wg-grammar#8

Closed

Centril added the A-syntax Syntax related proposals & ideas label Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/grammar for the canonical grammar of the Rust language #1331

src/grammar for the canonical grammar of the Rust language #1331

nagisa commented Oct 21, 2015

steveklabnik commented Oct 21, 2015

bstrie commented Oct 27, 2015

matklad commented Nov 3, 2015

steveklabnik commented Nov 3, 2015

Aatch commented Nov 6, 2015

steveklabnik commented Nov 6, 2015

matklad commented Nov 6, 2015

steveklabnik commented Nov 6, 2015

nikomatsakis Nov 6, 2015

nikomatsakis commented Nov 6, 2015

nikomatsakis commented Nov 6, 2015

nagisa commented Nov 6, 2015

nagisa commented Nov 6, 2015

nikomatsakis commented Nov 11, 2015

dgrunwald commented Nov 11, 2015

matklad commented Nov 11, 2015

nikomatsakis commented Nov 11, 2015

graydon commented Nov 12, 2015

nikomatsakis commented Nov 12, 2015

nagisa commented Nov 12, 2015

dgrunwald commented Nov 12, 2015

nikomatsakis commented Nov 12, 2015

dgrunwald commented Nov 13, 2015

nikomatsakis commented Nov 13, 2015

rprichard commented Nov 13, 2015

glaebhoerl commented Nov 18, 2015

nikomatsakis commented Dec 21, 2015

nikomatsakis commented Jan 8, 2016

jarcane commented Jan 12, 2016

nikomatsakis commented Jan 15, 2016

nikomatsakis commented Jan 15, 2016

src/grammar for the canonical grammar of the Rust language #1331

src/grammar for the canonical grammar of the Rust language #1331

Conversation

nagisa commented Oct 21, 2015

steveklabnik commented Oct 21, 2015

bstrie commented Oct 27, 2015

matklad commented Nov 3, 2015

steveklabnik commented Nov 3, 2015

Aatch commented Nov 6, 2015

steveklabnik commented Nov 6, 2015

matklad commented Nov 6, 2015

steveklabnik commented Nov 6, 2015

nikomatsakis Nov 6, 2015

Choose a reason for hiding this comment

nikomatsakis commented Nov 6, 2015

nikomatsakis commented Nov 6, 2015

nagisa commented Nov 6, 2015

nagisa commented Nov 6, 2015

nikomatsakis commented Nov 11, 2015

dgrunwald commented Nov 11, 2015

matklad commented Nov 11, 2015

nikomatsakis commented Nov 11, 2015

graydon commented Nov 12, 2015

nikomatsakis commented Nov 12, 2015

nagisa commented Nov 12, 2015

dgrunwald commented Nov 12, 2015

nikomatsakis commented Nov 12, 2015

dgrunwald commented Nov 13, 2015

nikomatsakis commented Nov 13, 2015

rprichard commented Nov 13, 2015

glaebhoerl commented Nov 18, 2015

nikomatsakis commented Dec 21, 2015

nikomatsakis commented Jan 8, 2016

jarcane commented Jan 12, 2016

nikomatsakis commented Jan 15, 2016

nikomatsakis commented Jan 15, 2016