New parser #346

sbillig · 2021-03-29T22:18:14Z

What was wrong?

It'd be very tricky to emit nice diagnostic messages from the existing parser. (See discussion below)

How was it fixed?

A complete rewrite, of course!

To-Do

OPTIONAL: Update Spec if applicable
Add entry to the release notes (may forgo for trivial changes)
Clean up commit history

davesque · 2021-04-17T23:08:43Z

Hey folks. I'm not directly involved in this project lately, but I'm curious what about the current parser makes it so hard to emit diagnostic messages. I did think about that a bit when I was originally writing it and I didn't foresee too much difficulty. Is there any issue ticket somewhere that has some more discussion on this topic?

davesque · 2021-04-17T23:31:32Z

Also, were there any other issues, besides the diagnostic message thing, that are becoming a problem with the current parser design? I had foreseen some of those which I documented in #3 . But I'm curious if any other problems have become apparent since then. Also, I took some queues from the Rust project parser itself:

Some of the code in those modules is surprisingly readable. They take a different approach entirely to their parser architecture (no use of combinators, more traditional), but it's still a good place for inspiration or to answer some questions about how you might go about parsing things. Of course, Rust is very straight forward to lex so it's kind of a different animal.

Also of note, at least the last time I read through those modules, there's no dependence on external packages in the Rust parser. It's all custom parser code. That made me think that it might not be entirely unreasonable to avoid external dependencies in the parser. At first, I was using the nom library to write the Fe parser. Eventually, it seemed like the generality and flexibility of nom (through its use of generics and parseable type traits and such) actually made it a pain to use. It necessitated adding one or two layers of abstraction to the parsing result types and such and lots of additional type variables and trait constraints to all parser signatures. So I just copied their API and re-implemented their combinators in such a way that was suited very specifically to our use of tokens coming out of the lexer. It meant that the result and cursor types and parser code could all be a lot simpler. And it didn't feel like a ton of code to incorporate directly into the Fe parser lib.

Anyway, I mention all this because I wonder if it's necessary to bring in too many external dependencies. Python is a weird language to lex and I had a lot of trouble initially finding parsing or lexing libraries that seemed like they would accommodate the requirements of python lexing easily. So I wonder, has the use of an external lib such as logos to do lexing really ended up simplifying the parser re-write? I suppose I could take a closer look at this PR and answer my own question, but I thought I'd just pose it directly the author to spur some discussion.

By the way, love that this project is getting so much attention. It's kind of a thrill to see so much of that initial work I did getting used for things. Maybe that's also why I have some of these questions about the parser re-write: the original parser was one of my main contributions to this project!

davesque · 2021-04-17T23:41:59Z

I should also mention that one goal I had with minimizing external deps was to ease the distribution of Fe as a WASM binary. Also, I had this sort of crazy idea that eventually Fe could even be a smart contract itself since Ethereum had a plan to move towards WASM as the generally accepted binary format. If that ever happened, minimizing binary size (and thus, external dependencies) would be nice. Of course, that's a pretty far flung scenario at this point. But the question of binary size still seems like a relevant one for other reasons.

sbillig · 2021-04-19T00:41:46Z

Hi @davesque,
The parser rewrite was instigated by me, and I talked Grant and Christoph into the idea :)

I've tasked myself with improving the error/diagnostic messages across the parser/analyzer/compiler. The goal is to (eventually) have rust/elm-quality diagnostics with nice underlines and hints. For the parser, we'd like to attempt to recover from some amount of bad syntax, continue parsing, and emit diagnostic messages for several issues at once, rather than stopping on the first. Longer term, I'd like this error recovery to be robust enough for use in an lsp-server/IDE use case, where the parser is running as the user enters code, and it always produces a syntax tree, perhaps with some Error nodes (this also requires the parser to be very fast).

I've made a couple small changes to Fe's current parser, and I found it to be a fine experience; I can't say whether there have been any major pains. FWIW, my suspicion is that a more traditional recursive descent parser may be easier for newcomers to the codebase to understand and contribute to, because it's such boring code (or should be; not claiming that my current code is necessarily super approachable yet), but I think the combinator style is fine to work with.

Despite my glib description in this PR, I spent a decent amount of time considering whether a rewrite was a good idea. I did see your #3 issue, and took that into consideration. Pratt/TDOP parsing is really nice way to handle operator precedence, and I expect the new parser implementation to be very quick, as we only need one token of lookahead and shouldn't need to backtrack. (But I doubt parsing speed will be an issue in practice anyway; Fe codebases probably won't be very big). Another thing I considered are the couple bugs in the tokenizer that agroce's fuzzing turned up. There were easy fixes for these, but they did make me question whether we might be limited by the current design. Eg #226: I'd like the parser to detect extraneous or unbalanced brackets/braces/parens, emit an error, and recover gracefully. To do so, it'll probably need some broader context of what's being parsed. The current situation is that the bracket mismatch is a lexer error, and the lexer probably doesn't/shouldn't have the contextual knowledge to try to recover from that error. So I thought that a reasonable solution to this particular problem would be to make the lexer dumb, and track things like indent and parens in the parser.

I also thought about how to handle contextual error recovery in the combinator-heavy parsing code. Maybe I'm lacking in creativity, but I think in my hands it would end up looking more like a boring recursive descent parser; eg we'd maybe replace some many1 calls with loops so we can recognize some syntax issue partway through, emit an error, skip past it and carry on. Or maybe one could achieve smart error recovery while maintaining the functional style, but in my imagination that either looks like pretty cryptic code, or takes on a verbosity that basically negates the appeal of the combinator style.

I should say that I came into this with some bias against combinator-heavy parsing, specifically relating to error recovery and reporting. I worked on a previous language project, and after failing to achieve satisfactory errors from a generated parser (old version of ANTLR), then failing again with a prototype combinator parser (using parsec), we decided to hand-roll a recursive descent parser with Pratt-style expression parsing, and were quite happy with that decision. I don't doubt that it's possible to achieve good results using the combinator style, but I've banged my head against that wall and given up in the past, and that's certainly influencing the decision here. One could probably achieve good results with just about any parsing technique with enough effort, but I figure that since most first-class compilers have landed on hand-written recursive descent (gcc, clang, rustc, rust-analyzer, ...), so doing the same is at least defensible, particularly if we're going to invest significant time in parser improvements.

Regarding external dependencies: as mentioned above, I decided to handle the indentation/paren state tracking in the parser, and keep the lexer simple. When there's no tricky business like that in the lexer, IMO there's no reason not to outsource that fairly boring job to a library; logos claims to generate super fast code (it's all proc macro wizardry), it's easy to use, and is no_std (and thus presumably wasm)-friendly. I doubt it'll ever cause us trouble, but if it does it'll be easy to replace with custom lexing code. I'm also using codespan-reporting for the fancy underlining and code context printing; I believe it's wasm friendly. Neither of these are critical to the design, but they're both definitely worth using IMO.

davesque · 2021-04-20T16:34:39Z

Hey @sbillig . Thanks for that long and well thought out response. I'm really happy to hear that you put so much thought into this. Given that, I have a lot fewer concerns. It sounds like you had already considered a lot of the things I mentioned. I think I agree that a hand-written recursive descent parser is nice since it's very straight forward to implement. And, as you mentioned (and as I had also previously seen in my own work), rustc uses one and the code for it is pretty readable. So no real concerns there. Also, your explanation for moving the paren and indentation handling into the parser makes sense given your goal of allowing the parser to recover from errors and output more messages at once. Anyway, I just wanted a chance to voice some of my reactions to this since I was so heavily involved in this part of the code. Based on your response, it sounds like the work is in good hands.

davesque · 2021-04-20T16:42:24Z

@sbillig By the way, I'm interested in helping with this work if you could use another person on it.

sbillig · 2021-04-21T20:47:36Z

@sbillig By the way, I'm interested in helping with this work if you could use another person on it.

@davesque That would be great! The current state is basically that it works, other than not supporting some stuff that the rest of the compiler doesn't support yet anyway, it needs more tests, a lot of the error messages are pretty basic, and it could generally use some polish. I'm currently working on swapping the new parser in and tying up a few loose ends to get it ready for review. I'm hoping we can merge it in soon to unblock things like #337 and #333, and make it easier for you and whoever else to jump in.

sbillig · 2021-04-23T04:30:02Z

@cburgdorf @g-r-a-n-t Ok, this should be ready now. Like I mentioned, I plan to keep iterating on this for a while, but I'd like to do it in separate PRs. This one is big enough. Also there should be more parser tests, but I'd like to work on some nicer ast printing and/or fe code generation (pretty printing) first, so the parser (and lowering) test fixtures are easier to comprehend.

I'll create a bunch of gh issues for the missing functionality and ast refactorings that are on my mental todo list.

codecov-commenter · 2021-04-23T04:49:28Z

Codecov Report

Merging #346 (b7fb706) into master (3f10a8f) will decrease coverage by 10.00%.
The diff coverage is 66.13%.

@@             Coverage Diff             @@
##           master     #346       +/-   ##
===========================================
- Coverage   92.81%   82.80%   -10.01%     
===========================================
  Files          59       62        +3     
  Lines        4022     3862      -160     
===========================================
- Hits         3733     3198      -535     
- Misses        289      664      +375

Impacted Files	Coverage Δ
analyzer/src/namespace/types.rs	`86.12% <ø> (ø)`
compiler/src/errors.rs	`0.00% <0.00%> (-23.53%)`	⬇️
parser/src/lexer/token.rs	`11.11% <11.11%> (ø)`
common/src/files.rs	`32.14% <32.14%> (ø)`
parser/src/parser.rs	`47.32% <47.32%> (ø)`
common/src/span.rs	`58.82% <58.82%> (ø)`
parser/src/grammar/module.rs	`63.63% <63.63%> (ø)`
compiler/src/lib.rs	`78.94% <64.70%> (-11.38%)`	⬇️
parser/src/grammar/contracts.rs	`65.30% <65.30%> (ø)`
parser/src/grammar/types.rs	`73.33% <73.33%> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f10a8f...b7fb706. Read the comment docs.

cburgdorf

Amazing work 👏 ❤️
Left some inline comments but all just minor stuff. I have to confess that I rushed through it (I have a limited time budget today). I'll try to give it another pass in the evening when I have more time.

common/src/files.rs

cburgdorf · 2021-04-23T09:10:23Z

common/src/files.rs

+}
+
+#[derive(PartialEq, Copy, Clone, Eq, Hash)]
+pub struct SourceFileId(u128);


Any specific reason for choosing u128 here? Just efficiency, because it should be big enough? I would probably have choosen usize here so I'm just curious 😄

No great reason. I thought it would be reasonable to use content hash as the file id (though I'm not convinced that's ideal; rustc just uses file path I think), keccak256 is available in the same crate, u128 is the biggest built-in number, so I figured I'd just take half a keccak and stick it in the u128 and call it good :)

common/src/files.rs

newsfragments/346.internal.md

parser/src/grammar/contracts.rs

parser/src/newparser/syntax_notes.md

sbillig · 2021-04-23T21:14:59Z

@cburgdorf I think I've addressed everything; thanks for the feedback! I also noticed a missing error message (on a missing colon in Parser::enter_block), and that the compile() function wasn't failing when there were non-fatal parser errors.

g-r-a-n-t

Awesome 🎊

Looking forward to pulling this down and checking out the error messages. I'm good with merging this.

g-r-a-n-t · 2021-04-23T23:44:52Z

parser/src/ast.rs

 #[derive(Serialize, Deserialize, Debug, PartialEq, Clone)]
 pub enum FuncStmt {
    Return {
        value: Option<Node<Expr>>,
    },
    VarDecl {
-        target: Node<Expr>,
+        target: Node<Expr>, // TODO: change to String


I imagine this would prevent us from destructuring in variable declarations.

(foo, bar): (u256, u256) = self.get_my_tuple()

Maybe it's worth creating a special expression kind for the subset expressions that may be used as targets in declarations. For now that would just be single identifiers and multiple identifiers in a tuple.

davesque · 2021-04-24T05:52:55Z

I also looked this whole thing over just now and I'm pretty psyched to see how everything looks. Very readable. Amazing work, @sbillig !

cburgdorf · 2021-04-24T09:25:22Z

Awesome! I went ahead and merged it. Amazing work @sbillig ❤️

sbillig changed the title ~~rough sketch~~ New parser Mar 29, 2021

sbillig force-pushed the newparser branch from c06dbbb to 9a22b25 Compare March 31, 2021 23:42

sbillig force-pushed the newparser branch from d6aa1c4 to 5aa0814 Compare April 8, 2021 05:47

Y-Nak mentioned this pull request Apr 15, 2021

Allow hexadecimal/octal/binary numeric literals #333

Closed

sbillig force-pushed the newparser branch 3 times, most recently from ad04121 to 90b0aee Compare April 23, 2021 04:19

sbillig marked this pull request as ready for review April 23, 2021 04:20

sbillig requested review from g-r-a-n-t and cburgdorf April 23, 2021 04:22

cburgdorf approved these changes Apr 23, 2021

View reviewed changes

sbillig added 4 commits April 23, 2021 09:31

Add new parser and lexer alongside old

99287f1

Add fe_common::{diagnostics, files}

8b50025

Replace old parser with new; some compiler error refactoring

ece346f

Docs, parser test cleanup, compile() fails on non-fatal parse erros

b7fb706

sbillig force-pushed the newparser branch from 90b0aee to b7fb706 Compare April 23, 2021 21:10

g-r-a-n-t approved these changes Apr 24, 2021

View reviewed changes

cburgdorf merged commit e64b98e into ethereum:master Apr 24, 2021

sbillig mentioned this pull request May 7, 2021

Parser efficiency/semantic improvements #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parser #346

New parser #346

sbillig commented Mar 29, 2021 •

edited

Loading

davesque commented Apr 17, 2021

davesque commented Apr 17, 2021 •

edited

Loading

davesque commented Apr 17, 2021

sbillig commented Apr 19, 2021

davesque commented Apr 20, 2021

davesque commented Apr 20, 2021 •

edited

Loading

sbillig commented Apr 21, 2021

sbillig commented Apr 23, 2021

codecov-commenter commented Apr 23, 2021 •

edited

Loading

cburgdorf left a comment

cburgdorf Apr 23, 2021

sbillig Apr 23, 2021

sbillig commented Apr 23, 2021 •

edited

Loading

g-r-a-n-t left a comment

g-r-a-n-t Apr 23, 2021

davesque commented Apr 24, 2021

cburgdorf commented Apr 24, 2021

New parser #346

New parser #346

Conversation

sbillig commented Mar 29, 2021 • edited Loading

What was wrong?

How was it fixed?

To-Do

davesque commented Apr 17, 2021

davesque commented Apr 17, 2021 • edited Loading

davesque commented Apr 17, 2021

sbillig commented Apr 19, 2021

davesque commented Apr 20, 2021

davesque commented Apr 20, 2021 • edited Loading

sbillig commented Apr 21, 2021

sbillig commented Apr 23, 2021

codecov-commenter commented Apr 23, 2021 • edited Loading

Codecov Report

cburgdorf left a comment

Choose a reason for hiding this comment

cburgdorf Apr 23, 2021

Choose a reason for hiding this comment

sbillig Apr 23, 2021

Choose a reason for hiding this comment

sbillig commented Apr 23, 2021 • edited Loading

g-r-a-n-t left a comment

Choose a reason for hiding this comment

g-r-a-n-t Apr 23, 2021

Choose a reason for hiding this comment

davesque commented Apr 24, 2021

cburgdorf commented Apr 24, 2021

sbillig commented Mar 29, 2021 •

edited

Loading

davesque commented Apr 17, 2021 •

edited

Loading

davesque commented Apr 20, 2021 •

edited

Loading

codecov-commenter commented Apr 23, 2021 •

edited

Loading

sbillig commented Apr 23, 2021 •

edited

Loading