-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track quoting style in the tokenizer #10256
Conversation
It looks like some of the benchmarks are showing a 2-3% slowdown for |
|
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
Q004 | 2 | 2 | 0 | 0 | 0 |
Linter (preview)
ℹ️ ecosystem check detected linter changes. (+2 -0 violations, +0 -0 fixes in 1 projects; 42 projects unchanged)
rotki/rotki (+2 -0 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview
+ rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character + rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character
Changes by rule (1 rules affected)
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
Q004 | 2 | 2 | 0 | 0 | 0 |
Formatter (stable)
✅ ecosystem check detected no format changes.
Formatter (preview)
✅ ecosystem check detected no format changes.
(I'll also look through the ecosystem results tomorrow.) |
@AlexWaygood - It's possible that you have a slightly different version of LALRPOP than whatever was used most recently to generate the parser, hence the thousands of lines of changes in the generated |
Ah, I'm using 0.20.2 locally :( I did wonder... |
It's one way to buff up your contribution stats. |
82c1673
to
c18e409
Compare
I fixed up the huge changes from using the wrong lalrpop version, with some help from @BurntSushi on the necessary |
crates/ruff_linter/src/rules/pycodestyle/rules/invalid_escape_sequence.rs
Outdated
Show resolved
Hide resolved
c18e409
to
a2513c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great!
81ac61f
to
eab0e4f
Compare
Can you check if the ecosystem changes are intended? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Nice work. I really like how you abstracted away the bit representation and I think it allows us to abstract away even more.
I recommend changing the API of StringFlags
by introducing a new StringPrefix
enum (similar to the existing StringKind
enum) that guarantees that constructing invalid prefixes is impossible. This should reduce the changes necessary in the lexer and remove the unwrap/except calls (that could be the cause of the perf regression).
I further recommend removing StringFlags
from FStringMiddle
and FStringEnd
except if we have a very specific use case where knowing the flags is required.
crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs
Outdated
Show resolved
Hide resolved
crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs
Outdated
Show resolved
Hide resolved
crates/ruff_linter/src/rules/pycodestyle/rules/invalid_escape_sequence.rs
Outdated
Show resolved
Hide resolved
crates/ruff_linter/src/rules/pylint/rules/bad_string_format_character.rs
Outdated
Show resolved
Hide resolved
crates/ruff_linter/src/rules/pyupgrade/rules/printf_string_formatting.rs
Outdated
Show resolved
Hide resolved
..._python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring_with_format_spec.snap
Outdated
Show resolved
Hide resolved
216f582
to
5990dd9
Compare
I wish I could say I knew why these changes causes these two lines to be flagged when previously they weren't. But the good news is... I think these are true positives! The >>> x = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce\'rm\x0e\xcd\xe9.\xf8\xd2"
>>> y = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce'rm\x0e\xcd\xe9.\xf8\xd2"
>>> x == y
True I'll add some tests to make sure these don't regress in the future. |
The codspeed benchmarks are still showing some regressions in the lexer, but are also now showing some speedups in the linter. |
Reviews have mostly been addressed -- thanks! The outstanding questions are:
|
crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs
Show resolved
Hide resolved
crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs
Show resolved
Hide resolved
I would recommend to go with We could also have a subset of information stored in the |
The performance regressions in the lexer seem to have disappeared from the benchmarks. Some of the benchmarks on codspeed are still showing regressions of around 1%, but I can't really make much sense of them -- I think they're just noise (correct me if I'm wrong!). There's also some nice speedups on some of the lint rules that have been reworked as part of this PR to make use of the information that's now tracked in the tokens: Overall, codspeed measures this PR as performance-neutral. |
(Consider me signed off, though I'll leave it to Micha / Dhruv to give the green light!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here! Small nits but otherwise good to go from my side.
/// The string has a `u` or `U` prefix. | ||
/// While this prefix is a no-op at runtime, | ||
/// strings with this prefix can have no other prefixes set. | ||
const U_PREFIX = 1 << 2; | ||
|
||
/// The string has a `b` or `B` prefix. | ||
/// This means that the string is a sequence of `int`s at runtime, | ||
/// rather than a sequence of `str`s. | ||
/// Strings with this flag can also be raw strings, | ||
/// but can have no other prefixes. | ||
const B_PREFIX = 1 << 3; | ||
|
||
/// The string has a `f` or `F` prefix, meaning it is an f-string. | ||
/// F-strings can also be raw strings, | ||
/// but can have no other prefixes. | ||
const F_PREFIX = 1 << 4; | ||
|
||
/// The string has an `r` or `R` prefix, meaning it is a raw string. | ||
/// F-strings and byte-strings can be raw, | ||
/// as can strings with no other prefixes. | ||
/// U-strings cannot be raw. | ||
const R_PREFIX = 1 << 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can have the same names as in StringPrefix
for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a weak preference for keeping them "inconsistent", as I think it emphasises the fact that just because something has the R_PREFIX
flag set doesn't necessarily mean that's the only prefix it has -- it might have other prefixes as well. In that way, it's semantically different to StringPrefix
in an important way, since StringPrefix
enumerates all the ways in which the prefixes can be validly combined, whereas the bitflag does not.
...python_parser/src/snapshots/ruff_python_parser__lexer__tests__triple_quoted_windows_eol.snap
Show resolved
Hide resolved
07ed0d0
to
4695ec3
Compare
Thanks, all! On to the AST 🚀 |
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for #7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to astral-sh#10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for astral-sh#7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in astral-sh#10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
Summary
This PR changes the tokenizer so that all information about quoting style and prefixes is captured in a bitflag; this bitflag is then stored as a field on
String
,FStringStart
,FStringMiddle
andFStringEnd
tokens.By itself, this change does not fix any bugs. However, it's a necessary first step if we want to start tracking this information in the AST, which is necessary if we want to solve #7799 in a principled and universal way.
It should be easiest to review this PR one commit at a time:
String
tokens.Test Plan
cargo test