Move file-level rule exemption to lexer-based approach #5567

charliermarsh · 2023-07-06T18:10:26Z

Summary

In addition to # noqa codes, we also support file-level exemptions, which look like:

# flake8: noqa (ignore all rules in the file, for compatibility)
# ruff: noqa (all rules in the file)
# ruff: noqa: F401 (ignore F401 in the file, Flake8 doesn't support this)

This PR moves that logic to something that looks a lot more like our # noqa parser. Performance is actually quite a bit worse than the previous approach (lexing # flake8: noqa goes from 2ns to 11ns; lexing # ruff: noqa: F401, F841 is about the same; lexing # type: ignore # noqa: E501` fgoes from 4ns to 6ns), but the numbers are very small so it's... maybe worth it?

The primary benefit here is that we now properly support flexible whitespace, like: #flake8:noqa. Previously, we required exact string matching, and we also didn't support all case-insensitive variants of noqa.

charliermarsh · 2023-07-06T18:11:02Z

I thought this might end up being faster, but given that it's slower, IDK, open to not merging. I do think handling #flake8: noqa (for example) is nice though.

github-actions · 2023-07-06T18:20:24Z

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.9±0.03ms     5.2 MB/sec    1.00      7.9±0.03ms     5.2 MB/sec
formatter/numpy/ctypeslib.py               1.00   1753.3±2.03µs     9.5 MB/sec    1.00   1756.2±3.49µs     9.5 MB/sec
formatter/numpy/globals.py                 1.00    197.9±0.85µs    14.9 MB/sec    1.01    199.6±0.52µs    14.8 MB/sec
formatter/pydantic/types.py                1.01      3.8±0.01ms     6.7 MB/sec    1.00      3.8±0.00ms     6.8 MB/sec
linter/all-rules/large/dataset.py          1.06     14.5±1.27ms     2.8 MB/sec    1.00     13.6±0.06ms     3.0 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.02      3.5±0.10ms     4.8 MB/sec    1.00      3.4±0.01ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.00    437.9±0.47µs     6.7 MB/sec    1.00    437.4±1.19µs     6.7 MB/sec
linter/all-rules/pydantic/types.py         1.01      6.0±0.03ms     4.2 MB/sec    1.00      6.0±0.02ms     4.3 MB/sec
linter/default-rules/large/dataset.py      1.01      6.8±0.02ms     6.0 MB/sec    1.00      6.7±0.02ms     6.0 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1481.7±3.54µs    11.2 MB/sec    1.00   1468.8±3.68µs    11.3 MB/sec
linter/default-rules/numpy/globals.py      1.00    169.7±0.21µs    17.4 MB/sec    1.00    170.5±1.10µs    17.3 MB/sec
linter/default-rules/pydantic/types.py     1.01      3.1±0.02ms     8.3 MB/sec    1.00      3.0±0.01ms     8.4 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.6±0.03ms     5.3 MB/sec    1.00      7.6±0.03ms     5.3 MB/sec
formatter/numpy/ctypeslib.py               1.00  1592.3±13.92µs    10.5 MB/sec    1.01  1601.8±11.61µs    10.4 MB/sec
formatter/numpy/globals.py                 1.00    170.8±1.46µs    17.3 MB/sec    1.01    172.0±3.35µs    17.2 MB/sec
formatter/pydantic/types.py                1.00      3.5±0.01ms     7.2 MB/sec    1.00      3.6±0.02ms     7.2 MB/sec
linter/all-rules/large/dataset.py          1.04     13.0±0.15ms     3.1 MB/sec    1.00     12.6±0.04ms     3.2 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.3±0.02ms     5.1 MB/sec    1.00      3.2±0.01ms     5.2 MB/sec
linter/all-rules/numpy/globals.py          1.00    347.7±3.18µs     8.5 MB/sec    1.00    346.2±6.27µs     8.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.5±0.04ms     4.7 MB/sec    1.00      5.5±0.02ms     4.6 MB/sec
linter/default-rules/large/dataset.py      1.02      6.7±0.10ms     6.0 MB/sec    1.00      6.6±0.02ms     6.2 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01  1339.7±13.14µs    12.4 MB/sec    1.00  1330.5±20.62µs    12.5 MB/sec
linter/default-rules/numpy/globals.py      1.01    147.8±2.21µs    20.0 MB/sec    1.00    145.7±1.00µs    20.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.9±0.01ms     8.8 MB/sec    1.00      2.9±0.01ms     8.8 MB/sec

MichaReiser

Neat. As mentioned in the other PR. An alternative would have been to write a more "traditional" lexer that returns a sequence of Tokens instead (with their ranges). See SimpleTokenizer for an example. But this works too.

konstin · 2023-07-07T10:10:24Z

crates/ruff/src/noqa.rs

@@ -256,38 +257,148 @@ enum ParsedFileExemption<'a> {
 impl<'a> ParsedFileExemption<'a> {
    /// Return a [`ParsedFileExemption`] for a given comment line.
    fn try_extract(line: &'a str) -> Option<Self> {
-        let line = line.trim_whitespace_start();
+        let line = ParsedFileExemption::lex_whitespace(line);


nit: ParsedFileExemption -> Self

konstin · 2023-07-07T10:16:44Z

crates/ruff/src/noqa.rs

+        let mut chars = line.chars();
+        if chars
+            .next()
+            .map_or(false, |c| c.to_ascii_lowercase() == 'n')


I'm surprised there isn't a better way to write this, but at least i found the issue for the missing methos rust-lang/rfcs#2566

We could match on the byte slice because we only compare with byte characters. But you then still need to do the invariant which is annoying?

match line.as_str().bytes() { ['n' | 'N', 'o' | 'O', 'q' | 'Q', 'a' | 'A'] => tada, _ => nope }

the problem is that we want to select the string after "noqa" and converting back from bytes isn't safe.

Which reminds me, would line.strip_prefix("noqa") work?

I don't think line.strip_prefix("noqa") matches case-insensitively, right?

the problem is that we want to select the string after "noqa" and converting back from bytes isn't safe.

I think it would be safe here right because we know that all variations of noqa have an exact length of 4 bytes

Anyway. I think the current code is fine. it's verbose but does the job

I like this, I changed it.

konstin · 2023-07-07T10:17:54Z

How did you compute those ns timings?

charliermarsh · 2023-07-07T15:30:19Z

How did you compute those ns timings?

I use cargo bench in the ruff crate. I add something like:

use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};

use ruff::noqa::ParsedFileExemption;

pub fn directive_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Directive");
    for i in [
        "# noqa: F401",
        "# noqa: F401, F841",
        "# flake8: noqa: F401, F841",
        "# ruff: noqa: F401, F841",
        "# flake8: noqa",
        "# ruff: noqa",
        "# noqa",
        "# type: ignore # noqa: E501",
        "# type: ignore # nosec",
        "# some very long comment that # is interspersed with characters but # no directive",
    ]
    .iter()
    {
        group.bench_with_input(BenchmarkId::new("Regex", i), i, |b, _i| {
            b.iter(|| ParsedFileExemption::try_regex(black_box(i)))
        });
        group.bench_with_input(BenchmarkId::new("Lexer", i), i, |b, _i| {
            b.iter(|| ParsedFileExemption::try_extract(black_box(i)))
        });
    }
    group.finish();
}

criterion_group!(benches, directive_benchmark);
criterion_main!(benches);

In crates/ruff/benches/benchmark.rs. Then add:

[[bench]]
name = "benchmark"
harness = false

To crates/ruff/Cargo.toml, along with `criterion = "0.5.1".

Then cargo bench in crates/ruff gives you benchmarks!

## Summary Similar to #5567, we can remove the use of regex, plus simplify the representation (use `Option`), add snapshot tests, etc. This is about 100x faster than using a regex for cases that match (2.5ns vs. 250ns). It's obviously not a hot path, but I prefer the consistency with other similar comment-parsing. I may DRY these up into some common functionality later on.

charliermarsh requested a review from MichaReiser July 6, 2023 18:10

charliermarsh force-pushed the charlie/exemption-parser branch 2 times, most recently from 3c26030 to 92f34e5 Compare July 6, 2023 20:16

MichaReiser approved these changes Jul 7, 2023

View reviewed changes

konstin reviewed Jul 7, 2023

View reviewed changes

charliermarsh added 3 commits July 7, 2023 11:31

ADd exemtpion parser

d6805dd

Revert benches

3348090

Review feedback

5f78342

charliermarsh force-pushed the charlie/exemption-parser branch from 92f34e5 to 5f78342 Compare July 7, 2023 15:34

charliermarsh enabled auto-merge (squash) July 7, 2023 15:34

charliermarsh merged commit 5640c31 into main Jul 7, 2023

charliermarsh deleted the charlie/exemption-parser branch July 7, 2023 15:41

This was referenced Jul 8, 2023

Replace noqa directive parser with an actual lexer #5616

Closed

Refactor shebang parsing to remove regex dependency #5690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move file-level rule exemption to lexer-based approach #5567

Move file-level rule exemption to lexer-based approach #5567

charliermarsh commented Jul 6, 2023

charliermarsh commented Jul 6, 2023

github-actions bot commented Jul 6, 2023 •

edited

Loading

MichaReiser left a comment •

edited

Loading

konstin Jul 7, 2023

konstin Jul 7, 2023

MichaReiser Jul 7, 2023 •

edited

Loading

konstin Jul 7, 2023

charliermarsh Jul 7, 2023

MichaReiser Jul 7, 2023

charliermarsh Jul 7, 2023

konstin commented Jul 7, 2023

charliermarsh commented Jul 7, 2023 •

edited

Loading

Move file-level rule exemption to lexer-based approach #5567

Move file-level rule exemption to lexer-based approach #5567

Conversation

charliermarsh commented Jul 6, 2023

Summary

charliermarsh commented Jul 6, 2023

github-actions bot commented Jul 6, 2023 • edited Loading

PR Check Results

Ecosystem

Benchmark

Linux

Windows

MichaReiser left a comment • edited Loading

Choose a reason for hiding this comment

konstin Jul 7, 2023

Choose a reason for hiding this comment

konstin Jul 7, 2023

Choose a reason for hiding this comment

MichaReiser Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

konstin Jul 7, 2023

Choose a reason for hiding this comment

charliermarsh Jul 7, 2023

Choose a reason for hiding this comment

MichaReiser Jul 7, 2023

Choose a reason for hiding this comment

charliermarsh Jul 7, 2023

Choose a reason for hiding this comment

konstin commented Jul 7, 2023

charliermarsh commented Jul 7, 2023 • edited Loading

github-actions bot commented Jul 6, 2023 •

edited

Loading

MichaReiser left a comment •

edited

Loading

MichaReiser Jul 7, 2023 •

edited

Loading

charliermarsh commented Jul 7, 2023 •

edited

Loading