-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parser #346
New parser #346
Conversation
Hey folks. I'm not directly involved in this project lately, but I'm curious what about the current parser makes it so hard to emit diagnostic messages. I did think about that a bit when I was originally writing it and I didn't foresee too much difficulty. Is there any issue ticket somewhere that has some more discussion on this topic? |
Also, were there any other issues, besides the diagnostic message thing, that are becoming a problem with the current parser design? I had foreseen some of those which I documented in #3 . But I'm curious if any other problems have become apparent since then. Also, I took some queues from the Rust project parser itself:
Some of the code in those modules is surprisingly readable. They take a different approach entirely to their parser architecture (no use of combinators, more traditional), but it's still a good place for inspiration or to answer some questions about how you might go about parsing things. Of course, Rust is very straight forward to lex so it's kind of a different animal. Also of note, at least the last time I read through those modules, there's no dependence on external packages in the Rust parser. It's all custom parser code. That made me think that it might not be entirely unreasonable to avoid external dependencies in the parser. At first, I was using the nom library to write the Fe parser. Eventually, it seemed like the generality and flexibility of nom (through its use of generics and parseable type traits and such) actually made it a pain to use. It necessitated adding one or two layers of abstraction to the parsing result types and such and lots of additional type variables and trait constraints to all parser signatures. So I just copied their API and re-implemented their combinators in such a way that was suited very specifically to our use of tokens coming out of the lexer. It meant that the result and cursor types and parser code could all be a lot simpler. And it didn't feel like a ton of code to incorporate directly into the Fe parser lib. Anyway, I mention all this because I wonder if it's necessary to bring in too many external dependencies. Python is a weird language to lex and I had a lot of trouble initially finding parsing or lexing libraries that seemed like they would accommodate the requirements of python lexing easily. So I wonder, has the use of an external lib such as logos to do lexing really ended up simplifying the parser re-write? I suppose I could take a closer look at this PR and answer my own question, but I thought I'd just pose it directly the author to spur some discussion. By the way, love that this project is getting so much attention. It's kind of a thrill to see so much of that initial work I did getting used for things. Maybe that's also why I have some of these questions about the parser re-write: the original parser was one of my main contributions to this project! |
I should also mention that one goal I had with minimizing external deps was to ease the distribution of Fe as a WASM binary. Also, I had this sort of crazy idea that eventually Fe could even be a smart contract itself since Ethereum had a plan to move towards WASM as the generally accepted binary format. If that ever happened, minimizing binary size (and thus, external dependencies) would be nice. Of course, that's a pretty far flung scenario at this point. But the question of binary size still seems like a relevant one for other reasons. |
Hi @davesque, I've tasked myself with improving the error/diagnostic messages across the parser/analyzer/compiler. The goal is to (eventually) have rust/elm-quality diagnostics with nice underlines and hints. For the parser, we'd like to attempt to recover from some amount of bad syntax, continue parsing, and emit diagnostic messages for several issues at once, rather than stopping on the first. Longer term, I'd like this error recovery to be robust enough for use in an lsp-server/IDE use case, where the parser is running as the user enters code, and it always produces a syntax tree, perhaps with some Error nodes (this also requires the parser to be very fast). I've made a couple small changes to Fe's current parser, and I found it to be a fine experience; I can't say whether there have been any major pains. FWIW, my suspicion is that a more traditional recursive descent parser may be easier for newcomers to the codebase to understand and contribute to, because it's such boring code (or should be; not claiming that my current code is necessarily super approachable yet), but I think the combinator style is fine to work with. Despite my glib description in this PR, I spent a decent amount of time considering whether a rewrite was a good idea. I did see your #3 issue, and took that into consideration. Pratt/TDOP parsing is really nice way to handle operator precedence, and I expect the new parser implementation to be very quick, as we only need one token of lookahead and shouldn't need to backtrack. (But I doubt parsing speed will be an issue in practice anyway; Fe codebases probably won't be very big). Another thing I considered are the couple bugs in the tokenizer that agroce's fuzzing turned up. There were easy fixes for these, but they did make me question whether we might be limited by the current design. Eg #226: I'd like the parser to detect extraneous or unbalanced brackets/braces/parens, emit an error, and recover gracefully. To do so, it'll probably need some broader context of what's being parsed. The current situation is that the bracket mismatch is a lexer error, and the lexer probably doesn't/shouldn't have the contextual knowledge to try to recover from that error. So I thought that a reasonable solution to this particular problem would be to make the lexer dumb, and track things like indent and parens in the parser. I also thought about how to handle contextual error recovery in the combinator-heavy parsing code. Maybe I'm lacking in creativity, but I think in my hands it would end up looking more like a boring recursive descent parser; eg we'd maybe replace some many1 calls with loops so we can recognize some syntax issue partway through, emit an error, skip past it and carry on. Or maybe one could achieve smart error recovery while maintaining the functional style, but in my imagination that either looks like pretty cryptic code, or takes on a verbosity that basically negates the appeal of the combinator style. I should say that I came into this with some bias against combinator-heavy parsing, specifically relating to error recovery and reporting. I worked on a previous language project, and after failing to achieve satisfactory errors from a generated parser (old version of ANTLR), then failing again with a prototype combinator parser (using parsec), we decided to hand-roll a recursive descent parser with Pratt-style expression parsing, and were quite happy with that decision. I don't doubt that it's possible to achieve good results using the combinator style, but I've banged my head against that wall and given up in the past, and that's certainly influencing the decision here. One could probably achieve good results with just about any parsing technique with enough effort, but I figure that since most first-class compilers have landed on hand-written recursive descent (gcc, clang, rustc, rust-analyzer, ...), so doing the same is at least defensible, particularly if we're going to invest significant time in parser improvements. Regarding external dependencies: as mentioned above, I decided to handle the indentation/paren state tracking in the parser, and keep the lexer simple. When there's no tricky business like that in the lexer, IMO there's no reason not to outsource that fairly boring job to a library; logos claims to generate super fast code (it's all proc macro wizardry), it's easy to use, and is no_std (and thus presumably wasm)-friendly. I doubt it'll ever cause us trouble, but if it does it'll be easy to replace with custom lexing code. I'm also using codespan-reporting for the fancy underlining and code context printing; I believe it's wasm friendly. Neither of these are critical to the design, but they're both definitely worth using IMO. |
Hey @sbillig . Thanks for that long and well thought out response. I'm really happy to hear that you put so much thought into this. Given that, I have a lot fewer concerns. It sounds like you had already considered a lot of the things I mentioned. I think I agree that a hand-written recursive descent parser is nice since it's very straight forward to implement. And, as you mentioned (and as I had also previously seen in my own work), rustc uses one and the code for it is pretty readable. So no real concerns there. Also, your explanation for moving the paren and indentation handling into the parser makes sense given your goal of allowing the parser to recover from errors and output more messages at once. Anyway, I just wanted a chance to voice some of my reactions to this since I was so heavily involved in this part of the code. Based on your response, it sounds like the work is in good hands. |
@sbillig By the way, I'm interested in helping with this work if you could use another person on it. |
@davesque That would be great! The current state is basically that it works, other than not supporting some stuff that the rest of the compiler doesn't support yet anyway, it needs more tests, a lot of the error messages are pretty basic, and it could generally use some polish. I'm currently working on swapping the new parser in and tying up a few loose ends to get it ready for review. I'm hoping we can merge it in soon to unblock things like #337 and #333, and make it easier for you and whoever else to jump in. |
ad04121
to
90b0aee
Compare
@cburgdorf @g-r-a-n-t Ok, this should be ready now. Like I mentioned, I plan to keep iterating on this for a while, but I'd like to do it in separate PRs. This one is big enough. Also there should be more parser tests, but I'd like to work on some nicer ast printing and/or fe code generation (pretty printing) first, so the parser (and lowering) test fixtures are easier to comprehend. I'll create a bunch of gh issues for the missing functionality and ast refactorings that are on my mental todo list. |
Codecov Report
@@ Coverage Diff @@
## master #346 +/- ##
===========================================
- Coverage 92.81% 82.80% -10.01%
===========================================
Files 59 62 +3
Lines 4022 3862 -160
===========================================
- Hits 3733 3198 -535
- Misses 289 664 +375
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work 👏 ❤️
Left some inline comments but all just minor stuff. I have to confess that I rushed through it (I have a limited time budget today). I'll try to give it another pass in the evening when I have more time.
common/src/files.rs
Outdated
} | ||
|
||
#[derive(PartialEq, Copy, Clone, Eq, Hash)] | ||
pub struct SourceFileId(u128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for choosing u128
here? Just efficiency, because it should be big enough? I would probably have choosen usize
here so I'm just curious 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No great reason. I thought it would be reasonable to use content hash as the file id (though I'm not convinced that's ideal; rustc just uses file path I think), keccak256 is available in the same crate, u128 is the biggest built-in number, so I figured I'd just take half a keccak and stick it in the u128 and call it good :)
@cburgdorf I think I've addressed everything; thanks for the feedback! I also noticed a missing error message (on a missing colon in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🎊
Looking forward to pulling this down and checking out the error messages. I'm good with merging this.
#[derive(Serialize, Deserialize, Debug, PartialEq, Clone)] | ||
pub enum FuncStmt { | ||
Return { | ||
value: Option<Node<Expr>>, | ||
}, | ||
VarDecl { | ||
target: Node<Expr>, | ||
target: Node<Expr>, // TODO: change to String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this would prevent us from destructuring in variable declarations.
(foo, bar): (u256, u256) = self.get_my_tuple()
Maybe it's worth creating a special expression kind for the subset expressions that may be used as targets in declarations. For now that would just be single identifiers and multiple identifiers in a tuple.
I also looked this whole thing over just now and I'm pretty psyched to see how everything looks. Very readable. Amazing work, @sbillig ! |
Awesome! I went ahead and merged it. Amazing work @sbillig ❤️ |
What was wrong?
It'd be very tricky to emit nice diagnostic messages from the existing parser. (See discussion below)
How was it fixed?
A complete rewrite, of course!
To-Do
OPTIONAL: Update Spec if applicable
Add entry to the release notes (may forgo for trivial changes)
Clean up commit history