-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librustc_lexer: Refactor the module #66015
Conversation
// string with single quotes). | ||
if self.first() == '\'' { | ||
self.bump(); | ||
let kind = Char { terminated: true }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I'm not sure why we're consuming the literal suffix above, but do not consume here.
As a result, we have a different errors for non-single character single-quoted literals with suffixes depending on the first symbol:
Playground 1 / Playground 2
That's pretty esoteric, I know, but nevertheless it seems a bit inconsistent to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not sure if that can even be called a bug since the code in example is completely invalid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we detect error in the char literal, it's better to recover the next token as identifier, rather than treat it as a suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have strong opinions on any changes suggested here :)
rutsc_lexer
is mostly a from-scratch implementation, so the current shape of code is mostly what I'd consider the most readable implementation.
I am also feel slightly uneasy because it's not trivially clear that this doesn't change behavior. I wish we did lexer spec and full-coverage test-suite already :)
// string with single quotes). | ||
if self.first() == '\'' { | ||
self.bump(); | ||
let kind = Char { terminated: true }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we detect error in the char literal, it's better to recover the next token as identifier, rather than treat it as a suffix
fn float_exponent(&mut self) -> Result<(), ()> { | ||
/// Eats the float exponent. Returns true if at least one digit was met, | ||
/// and returns false otherwise. | ||
fn eat_float_exponent(&mut self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other eat_x
functions have a contract that, if they return false
, they don't consume anything.
This function always consumed something, and, if it returns an Err
, you must report it, hence this weird owl-result/bool. It definitely could use a comment though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, do they? For example, eat_decimal_digits
will consume _______
and return false.
I don't think this split into multiple files is an improvement. |
Could you put the code into its old place? Then I'll be able to review the remaining diff. |
Sure, I'll bring everything back soon. |
f671200
to
b93c988
Compare
+1 on |
b93c988
to
13da2aa
Compare
// Newline without following '\'' means unclosed quote, stop parsing. | ||
'\n' if self.nth_char(1) != '\'' => break, | ||
// End of file, stop parsing. | ||
EOF_CHAR if self.is_eof() => break, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why the order of match
arms was changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I had two-level motivation here:
- I ordered match arms depending on the termination level (first match has
return
, then go exceptional cases withbreak
, then go char-skipping arms (escaped char and any other char)). - I thought that it's a bit more readable to have the normal exit condition to be the first match arm.
r? @matklad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me with a comment about '0
expanded to mention error recovery, and with 13da2aa squashed into fdf74a3 (such that there's no back and forth with outlining and inlining methods).
Overall, I must say that some changes here are clearly wins, while others seem more like just a different equivalent way to write the same code. In the future, I would advise splitting uncontroversial strict improvements from stylistic changes, such that it becomes easier to access and merge both independently.
src/librustc_lexer/src/lib.rs
Outdated
false | ||
} else { | ||
// If the first symbol is valid for identifier | ||
// or it's a digit, it can be a lifetime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't really be a lifetime if second is a digit. Rather, this is a special-cased error recovery.
@@ -682,15 +670,33 @@ impl Cursor<'_> { | |||
if self.eat_decimal_digits() { Ok(()) } else { Err(()) } | |||
} | |||
|
|||
// Eats the suffix if it's an identifier. | |||
// Eats the suffix of the literal, e.g. "_u8". | |||
fn eat_literal_suffix(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this method can be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have it for readability. It's obvious why we are calling "eat_literal_suffix" after parsing the literal, but it's not that obvious when we'll call "eat_identifier" instead.
} | ||
|
||
(n_hashes, started, finished) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure if this is a simplification:
- started/finished can be const-propagated so that there's no need to keep state in your head
- mutable predicates are odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't completely like this too, but IMO it's better than the current approach.
With the current approach instead of having values of "started" and "finished" in the head, you had to remember what exactly "n_hashes, true, false" mean (which at least for me was a bit harder).
Regarding the mutable predicate: that's the cost of not having a eat_at_most_while
. However, the scope of this predicate is pretty small and the context is simple, so I don't find it confusing.
Nevertheless I can revert those changes if you insist :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to revert this: I don't claim that the old version was better. That's really just different ways to work-around the bad signature of the method.
Regarding the splitting changes into several PRs: sure, I've got my lesson, and sorry for putting it all-together this time. |
13da2aa
to
31735b0
Compare
@bors r+ Thanks! |
📌 Commit 31735b0 has been approved by |
…matklad librustc_lexer: Refactor the module This PR introduces a refactoring of the `librustc_lexer` in order to improve readability. All the changes performed are only cosmetic and do not introduce any changes the lexer logic or performance. Newly introduced modules `literal`, `token` and `utils` are just copy-pasted from the `lib.rs` and do not contain even cosmetic changes (I decided to do so so it'll be easier to review changes looking only on diff). r? @petrochenkov cc @Centril @matklad
Rollup of 9 pull requests Successful merges: - #65776 (Rename `LocalInternedString` and more) - #65973 (caller_location: point to macro invocation sites, like file!/line!, and use in core::panic!.) - #66015 (librustc_lexer: Refactor the module) - #66062 (Configure LLVM module PIC level) - #66086 (bump smallvec to 1.0) - #66092 (Use KERN_ARND syscall for random numbers on NetBSD, same as FreeBSD.) - #66103 (Add target thumbv7neon-unknown-linux-musleabihf) - #66133 (Update the bundled `wasi-libc` repository) - #66139 (use American spelling for `pluralize!`) Failed merges: r? @ghost
This PR introduces a refactoring of the
librustc_lexer
in order to improve readability.All the changes performed are only cosmetic and do not introduce any changes the lexer logic or performance.
Newly introduced modules
literal
,token
andutils
are just copy-pasted from thelib.rs
and do not contain even cosmetic changes (I decided to do so so it'll be easier to review changes looking only on diff).r? @petrochenkov
cc @Centril @matklad