Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track quoting style in the tokenizer #10256

Merged
merged 10 commits into from
Mar 8, 2024

Conversation

AlexWaygood
Copy link
Member

Summary

This PR changes the tokenizer so that all information about quoting style and prefixes is captured in a bitflag; this bitflag is then stored as a field on String, FStringStart, FStringMiddle and FStringEnd tokens.

By itself, this change does not fix any bugs. However, it's a necessary first step if we want to start tracking this information in the AST, which is necessary if we want to solve #7799 in a principled and universal way.

It should be easiest to review this PR one commit at a time:

  1. The first commit adds the bitflag and starts storing it on String tokens.
  2. The second commit also changes f-string tokens so that they store the same information
  3. The third commit applies various cleanups and simplifications to various linter rules that are now possible with this refactor.

Test Plan

cargo test

@AlexWaygood AlexWaygood requested a review from MichaReiser as a code owner March 6, 2024 19:39
@AlexWaygood
Copy link
Member Author

AlexWaygood commented Mar 6, 2024

It looks like some of the benchmarks are showing a 2-3% slowdown for lexer::Lexer::next_token(): https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:quotestyle-tokenizer. I'll see tomorrow if there's anything I can do to ameliorate that.

Copy link
Contributor

github-actions bot commented Mar 6, 2024

ruff-ecosystem results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+2 -0 violations, +0 -0 fixes in 1 projects; 42 projects unchanged)

rotki/rotki (+2 -0 violations, +0 -0 fixes)

+ rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character
+ rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character

Changes by rule (1 rules affected)

code total + violation - violation + fix - fix
Q004 2 2 0 0 0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+2 -0 violations, +0 -0 fixes in 1 projects; 42 projects unchanged)

rotki/rotki (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

+ rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character
+ rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character

Changes by rule (1 rules affected)

code total + violation - violation + fix - fix
Q004 2 2 0 0 0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@AlexWaygood
Copy link
Member Author

(I'll also look through the ecosystem results tomorrow.)

@charliermarsh
Copy link
Member

@AlexWaygood - It's possible that you have a slightly different version of LALRPOP than whatever was used most recently to generate the parser, hence the thousands of lines of changes in the generated python.rs. Can you check if you're using // auto-generated: "lalrpop 0.20.0"?

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Mar 6, 2024

Can you check if you're using // auto-generated: "lalrpop 0.20.0"?

Ah, I'm using 0.20.2 locally :(

I did wonder...

@charliermarsh
Copy link
Member

It's one way to buff up your contribution stats.

@AlexWaygood AlexWaygood force-pushed the quotestyle-tokenizer branch from 82c1673 to c18e409 Compare March 6, 2024 21:48
@AlexWaygood
Copy link
Member Author

I fixed up the huge changes from using the wrong lalrpop version, with some help from @BurntSushi on the necessary git-fu to get there 🥳

@AlexWaygood AlexWaygood force-pushed the quotestyle-tokenizer branch from c18e409 to a2513c3 Compare March 6, 2024 22:14
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great!

crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
@AlexWaygood AlexWaygood force-pushed the quotestyle-tokenizer branch 2 times, most recently from 81ac61f to eab0e4f Compare March 6, 2024 23:02
@charliermarsh
Copy link
Member

Can you check if the ecosystem changes are intended?

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Nice work. I really like how you abstracted away the bit representation and I think it allows us to abstract away even more.

I recommend changing the API of StringFlags by introducing a new StringPrefix enum (similar to the existing StringKind enum) that guarantees that constructing invalid prefixes is impossible. This should reduce the changes necessary in the lexer and remove the unwrap/except calls (that could be the cause of the perf regression).

I further recommend removing StringFlags from FStringMiddle and FStringEnd except if we have a very specific use case where knowing the flags is required.

@MichaReiser MichaReiser added the internal An internal refactor or improvement label Mar 7, 2024
@AlexWaygood AlexWaygood force-pushed the quotestyle-tokenizer branch from 216f582 to 5990dd9 Compare March 7, 2024 13:06
@AlexWaygood
Copy link
Member Author

I wish I could say I knew why these changes causes these two lines to be flagged when previously they weren't. But the good news is... I think these are true positives! The \' escape inside both strings does seem to be unnecessary, since the string uses double quotes:

>>> x = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce\'rm\x0e\xcd\xe9.\xf8\xd2"
>>> y = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce'rm\x0e\xcd\xe9.\xf8\xd2"
>>> x == y
True

I'll add some tests to make sure these don't regress in the future.

@AlexWaygood
Copy link
Member Author

The codspeed benchmarks are still showing some regressions in the lexer, but are also now showing some speedups in the linter.

@AlexWaygood
Copy link
Member Author

Reviews have mostly been addressed -- thanks! The outstanding questions are:

  1. What to do about f-string tokens: Track quoting style in the tokenizer #10256 (comment)
  2. Whether StringKind et al deserve to be in their own module/file, or whether they should be moved to string.rs: Track quoting style in the tokenizer #10256 (comment)
  3. The way StringKind stores its data: Track quoting style in the tokenizer #10256 (comment)
  4. Getting a better Debug implementation for StringKind (depends on resolving (3) first): Track quoting style in the tokenizer #10256 (comment)

@dhruvmanila
Copy link
Member

  1. What to do about f-string tokens: Track quoting style in the tokenizer #10256 (comment)

I would recommend to go with FStringStart and FStringMiddle. I think the next steps, as mentioned in the PR description, is to encode the information in the AST. The parser would use the FStringStart token to get the information. Later, we could potentially move some of the rules from token-based to AST-based which would then make having this information in FStringMiddle not required.

We could also have a subset of information stored in the FStringMiddle but I think that's what we're trying to avoid here.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Mar 7, 2024

The performance regressions in the lexer seem to have disappeared from the benchmarks. Some of the benchmarks on codspeed are still showing regressions of around 1%, but I can't really make much sense of them -- I think they're just noise (correct me if I'm wrong!). There's also some nice speedups on some of the lint rules that have been reworked as part of this PR to make use of the information that's now tracked in the tokens:

Performance improvements on some lint rules

image
image

Overall, codspeed measures this PR as performance-neutral.

@charliermarsh
Copy link
Member

(Consider me signed off, though I'll leave it to Micha / Dhruv to give the green light!)

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here! Small nits but otherwise good to go from my side.

Comment on lines +20 to +47
/// The string has a `u` or `U` prefix.
/// While this prefix is a no-op at runtime,
/// strings with this prefix can have no other prefixes set.
const U_PREFIX = 1 << 2;

/// The string has a `b` or `B` prefix.
/// This means that the string is a sequence of `int`s at runtime,
/// rather than a sequence of `str`s.
/// Strings with this flag can also be raw strings,
/// but can have no other prefixes.
const B_PREFIX = 1 << 3;

/// The string has a `f` or `F` prefix, meaning it is an f-string.
/// F-strings can also be raw strings,
/// but can have no other prefixes.
const F_PREFIX = 1 << 4;

/// The string has an `r` or `R` prefix, meaning it is a raw string.
/// F-strings and byte-strings can be raw,
/// as can strings with no other prefixes.
/// U-strings cannot be raw.
const R_PREFIX = 1 << 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can have the same names as in StringPrefix for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a weak preference for keeping them "inconsistent", as I think it emphasises the fact that just because something has the R_PREFIX flag set doesn't necessarily mean that's the only prefix it has -- it might have other prefixes as well. In that way, it's semantically different to StringPrefix in an important way, since StringPrefix enumerates all the ways in which the prefixes can be validly combined, whereas the bitflag does not.

crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer/fstring.rs Outdated Show resolved Hide resolved
@AlexWaygood AlexWaygood force-pushed the quotestyle-tokenizer branch from 07ed0d0 to 4695ec3 Compare March 8, 2024 08:33
@AlexWaygood
Copy link
Member Author

Thanks, all! On to the AST 🚀

@AlexWaygood AlexWaygood enabled auto-merge (squash) March 8, 2024 08:34
@AlexWaygood AlexWaygood merged commit c504d7a into astral-sh:main Mar 8, 2024
17 checks passed
AlexWaygood added a commit that referenced this pull request Mar 8, 2024
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:
- The quoting style used (double or single quotes)
- Whether the string is triple-quoted or not
- Whether the string is raw or not

This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for #7799.

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
@AlexWaygood AlexWaygood deleted the quotestyle-tokenizer branch March 8, 2024 23:13
nkxxll pushed a commit to nkxxll/ruff that referenced this pull request Mar 10, 2024
nkxxll pushed a commit to nkxxll/ruff that referenced this pull request Mar 10, 2024
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:
- The quoting style used (double or single quotes)
- Whether the string is triple-quoted or not
- Whether the string is raw or not

This PR is a followup to astral-sh#10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for astral-sh#7799.

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in astral-sh#10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants