Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser #346

Merged
merged 4 commits into from
Apr 24, 2021
Merged

New parser #346

merged 4 commits into from
Apr 24, 2021

Conversation

sbillig
Copy link
Collaborator

@sbillig sbillig commented Mar 29, 2021

What was wrong?

It'd be very tricky to emit nice diagnostic messages from the existing parser. (See discussion below)

How was it fixed?

A complete rewrite, of course!

To-Do

  • OPTIONAL: Update Spec if applicable

  • Add entry to the release notes (may forgo for trivial changes)

  • Clean up commit history

@davesque
Copy link
Contributor

Hey folks. I'm not directly involved in this project lately, but I'm curious what about the current parser makes it so hard to emit diagnostic messages. I did think about that a bit when I was originally writing it and I didn't foresee too much difficulty. Is there any issue ticket somewhere that has some more discussion on this topic?

@davesque
Copy link
Contributor

davesque commented Apr 17, 2021

Also, were there any other issues, besides the diagnostic message thing, that are becoming a problem with the current parser design? I had foreseen some of those which I documented in #3 . But I'm curious if any other problems have become apparent since then. Also, I took some queues from the Rust project parser itself:

Some of the code in those modules is surprisingly readable. They take a different approach entirely to their parser architecture (no use of combinators, more traditional), but it's still a good place for inspiration or to answer some questions about how you might go about parsing things. Of course, Rust is very straight forward to lex so it's kind of a different animal.

Also of note, at least the last time I read through those modules, there's no dependence on external packages in the Rust parser. It's all custom parser code. That made me think that it might not be entirely unreasonable to avoid external dependencies in the parser. At first, I was using the nom library to write the Fe parser. Eventually, it seemed like the generality and flexibility of nom (through its use of generics and parseable type traits and such) actually made it a pain to use. It necessitated adding one or two layers of abstraction to the parsing result types and such and lots of additional type variables and trait constraints to all parser signatures. So I just copied their API and re-implemented their combinators in such a way that was suited very specifically to our use of tokens coming out of the lexer. It meant that the result and cursor types and parser code could all be a lot simpler. And it didn't feel like a ton of code to incorporate directly into the Fe parser lib.

Anyway, I mention all this because I wonder if it's necessary to bring in too many external dependencies. Python is a weird language to lex and I had a lot of trouble initially finding parsing or lexing libraries that seemed like they would accommodate the requirements of python lexing easily. So I wonder, has the use of an external lib such as logos to do lexing really ended up simplifying the parser re-write? I suppose I could take a closer look at this PR and answer my own question, but I thought I'd just pose it directly the author to spur some discussion.

By the way, love that this project is getting so much attention. It's kind of a thrill to see so much of that initial work I did getting used for things. Maybe that's also why I have some of these questions about the parser re-write: the original parser was one of my main contributions to this project!

@davesque
Copy link
Contributor

I should also mention that one goal I had with minimizing external deps was to ease the distribution of Fe as a WASM binary. Also, I had this sort of crazy idea that eventually Fe could even be a smart contract itself since Ethereum had a plan to move towards WASM as the generally accepted binary format. If that ever happened, minimizing binary size (and thus, external dependencies) would be nice. Of course, that's a pretty far flung scenario at this point. But the question of binary size still seems like a relevant one for other reasons.

@sbillig
Copy link
Collaborator Author

sbillig commented Apr 19, 2021

Hi @davesque,
The parser rewrite was instigated by me, and I talked Grant and Christoph into the idea :)

I've tasked myself with improving the error/diagnostic messages across the parser/analyzer/compiler. The goal is to (eventually) have rust/elm-quality diagnostics with nice underlines and hints. For the parser, we'd like to attempt to recover from some amount of bad syntax, continue parsing, and emit diagnostic messages for several issues at once, rather than stopping on the first. Longer term, I'd like this error recovery to be robust enough for use in an lsp-server/IDE use case, where the parser is running as the user enters code, and it always produces a syntax tree, perhaps with some Error nodes (this also requires the parser to be very fast).

I've made a couple small changes to Fe's current parser, and I found it to be a fine experience; I can't say whether there have been any major pains. FWIW, my suspicion is that a more traditional recursive descent parser may be easier for newcomers to the codebase to understand and contribute to, because it's such boring code (or should be; not claiming that my current code is necessarily super approachable yet), but I think the combinator style is fine to work with.

Despite my glib description in this PR, I spent a decent amount of time considering whether a rewrite was a good idea. I did see your #3 issue, and took that into consideration. Pratt/TDOP parsing is really nice way to handle operator precedence, and I expect the new parser implementation to be very quick, as we only need one token of lookahead and shouldn't need to backtrack. (But I doubt parsing speed will be an issue in practice anyway; Fe codebases probably won't be very big). Another thing I considered are the couple bugs in the tokenizer that agroce's fuzzing turned up. There were easy fixes for these, but they did make me question whether we might be limited by the current design. Eg #226: I'd like the parser to detect extraneous or unbalanced brackets/braces/parens, emit an error, and recover gracefully. To do so, it'll probably need some broader context of what's being parsed. The current situation is that the bracket mismatch is a lexer error, and the lexer probably doesn't/shouldn't have the contextual knowledge to try to recover from that error. So I thought that a reasonable solution to this particular problem would be to make the lexer dumb, and track things like indent and parens in the parser.

I also thought about how to handle contextual error recovery in the combinator-heavy parsing code. Maybe I'm lacking in creativity, but I think in my hands it would end up looking more like a boring recursive descent parser; eg we'd maybe replace some many1 calls with loops so we can recognize some syntax issue partway through, emit an error, skip past it and carry on. Or maybe one could achieve smart error recovery while maintaining the functional style, but in my imagination that either looks like pretty cryptic code, or takes on a verbosity that basically negates the appeal of the combinator style.

I should say that I came into this with some bias against combinator-heavy parsing, specifically relating to error recovery and reporting. I worked on a previous language project, and after failing to achieve satisfactory errors from a generated parser (old version of ANTLR), then failing again with a prototype combinator parser (using parsec), we decided to hand-roll a recursive descent parser with Pratt-style expression parsing, and were quite happy with that decision. I don't doubt that it's possible to achieve good results using the combinator style, but I've banged my head against that wall and given up in the past, and that's certainly influencing the decision here. One could probably achieve good results with just about any parsing technique with enough effort, but I figure that since most first-class compilers have landed on hand-written recursive descent (gcc, clang, rustc, rust-analyzer, ...), so doing the same is at least defensible, particularly if we're going to invest significant time in parser improvements.

Regarding external dependencies: as mentioned above, I decided to handle the indentation/paren state tracking in the parser, and keep the lexer simple. When there's no tricky business like that in the lexer, IMO there's no reason not to outsource that fairly boring job to a library; logos claims to generate super fast code (it's all proc macro wizardry), it's easy to use, and is no_std (and thus presumably wasm)-friendly. I doubt it'll ever cause us trouble, but if it does it'll be easy to replace with custom lexing code. I'm also using codespan-reporting for the fancy underlining and code context printing; I believe it's wasm friendly. Neither of these are critical to the design, but they're both definitely worth using IMO.

@davesque
Copy link
Contributor

Hey @sbillig . Thanks for that long and well thought out response. I'm really happy to hear that you put so much thought into this. Given that, I have a lot fewer concerns. It sounds like you had already considered a lot of the things I mentioned. I think I agree that a hand-written recursive descent parser is nice since it's very straight forward to implement. And, as you mentioned (and as I had also previously seen in my own work), rustc uses one and the code for it is pretty readable. So no real concerns there. Also, your explanation for moving the paren and indentation handling into the parser makes sense given your goal of allowing the parser to recover from errors and output more messages at once. Anyway, I just wanted a chance to voice some of my reactions to this since I was so heavily involved in this part of the code. Based on your response, it sounds like the work is in good hands.

@davesque
Copy link
Contributor

davesque commented Apr 20, 2021

@sbillig By the way, I'm interested in helping with this work if you could use another person on it.

@sbillig
Copy link
Collaborator Author

sbillig commented Apr 21, 2021

@sbillig By the way, I'm interested in helping with this work if you could use another person on it.

@davesque That would be great! The current state is basically that it works, other than not supporting some stuff that the rest of the compiler doesn't support yet anyway, it needs more tests, a lot of the error messages are pretty basic, and it could generally use some polish. I'm currently working on swapping the new parser in and tying up a few loose ends to get it ready for review. I'm hoping we can merge it in soon to unblock things like #337 and #333, and make it easier for you and whoever else to jump in.

@sbillig sbillig force-pushed the newparser branch 3 times, most recently from ad04121 to 90b0aee Compare April 23, 2021 04:19
@sbillig sbillig marked this pull request as ready for review April 23, 2021 04:20
@sbillig
Copy link
Collaborator Author

sbillig commented Apr 23, 2021

@cburgdorf @g-r-a-n-t Ok, this should be ready now. Like I mentioned, I plan to keep iterating on this for a while, but I'd like to do it in separate PRs. This one is big enough. Also there should be more parser tests, but I'd like to work on some nicer ast printing and/or fe code generation (pretty printing) first, so the parser (and lowering) test fixtures are easier to comprehend.

I'll create a bunch of gh issues for the missing functionality and ast refactorings that are on my mental todo list.

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2021

Codecov Report

Merging #346 (b7fb706) into master (3f10a8f) will decrease coverage by 10.00%.
The diff coverage is 66.13%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #346       +/-   ##
===========================================
- Coverage   92.81%   82.80%   -10.01%     
===========================================
  Files          59       62        +3     
  Lines        4022     3862      -160     
===========================================
- Hits         3733     3198      -535     
- Misses        289      664      +375     
Impacted Files Coverage Δ
analyzer/src/namespace/types.rs 86.12% <ø> (ø)
compiler/src/errors.rs 0.00% <0.00%> (-23.53%) ⬇️
parser/src/lexer/token.rs 11.11% <11.11%> (ø)
common/src/files.rs 32.14% <32.14%> (ø)
parser/src/parser.rs 47.32% <47.32%> (ø)
common/src/span.rs 58.82% <58.82%> (ø)
parser/src/grammar/module.rs 63.63% <63.63%> (ø)
compiler/src/lib.rs 78.94% <64.70%> (-11.38%) ⬇️
parser/src/grammar/contracts.rs 65.30% <65.30%> (ø)
parser/src/grammar/types.rs 73.33% <73.33%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f10a8f...b7fb706. Read the comment docs.

Copy link
Collaborator

@cburgdorf cburgdorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work 👏 ❤️
Left some inline comments but all just minor stuff. I have to confess that I rushed through it (I have a limited time budget today). I'll try to give it another pass in the evening when I have more time.

common/src/files.rs Show resolved Hide resolved
}

#[derive(PartialEq, Copy, Clone, Eq, Hash)]
pub struct SourceFileId(u128);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for choosing u128 here? Just efficiency, because it should be big enough? I would probably have choosen usize here so I'm just curious 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No great reason. I thought it would be reasonable to use content hash as the file id (though I'm not convinced that's ideal; rustc just uses file path I think), keccak256 is available in the same crate, u128 is the biggest built-in number, so I figured I'd just take half a keccak and stick it in the u128 and call it good :)

common/src/files.rs Outdated Show resolved Hide resolved
common/src/files.rs Outdated Show resolved Hide resolved
common/src/files.rs Outdated Show resolved Hide resolved
newsfragments/346.internal.md Outdated Show resolved Hide resolved
parser/src/grammar/contracts.rs Show resolved Hide resolved
parser/src/grammar/contracts.rs Show resolved Hide resolved
parser/src/grammar/contracts.rs Show resolved Hide resolved
parser/src/newparser/syntax_notes.md Outdated Show resolved Hide resolved
@sbillig
Copy link
Collaborator Author

sbillig commented Apr 23, 2021

@cburgdorf I think I've addressed everything; thanks for the feedback! I also noticed a missing error message (on a missing colon in Parser::enter_block), and that the compile() function wasn't failing when there were non-fatal parser errors.

Copy link
Member

@g-r-a-n-t g-r-a-n-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🎊

Looking forward to pulling this down and checking out the error messages. I'm good with merging this.

#[derive(Serialize, Deserialize, Debug, PartialEq, Clone)]
pub enum FuncStmt {
Return {
value: Option<Node<Expr>>,
},
VarDecl {
target: Node<Expr>,
target: Node<Expr>, // TODO: change to String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this would prevent us from destructuring in variable declarations.

(foo, bar): (u256, u256) = self.get_my_tuple()

Maybe it's worth creating a special expression kind for the subset expressions that may be used as targets in declarations. For now that would just be single identifiers and multiple identifiers in a tuple.

@davesque
Copy link
Contributor

I also looked this whole thing over just now and I'm pretty psyched to see how everything looks. Very readable. Amazing work, @sbillig !

@cburgdorf cburgdorf merged commit e64b98e into ethereum:master Apr 24, 2021
@cburgdorf
Copy link
Collaborator

Awesome! I went ahead and merged it. Amazing work @sbillig ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants