Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(linter): apply small file optimization, up to 30% faster #6600

Merged

Conversation

camchenry
Copy link
Contributor

@camchenry camchenry commented Oct 15, 2024

Theory: iterating over the rules three times has slightly worse cache locality, because the prior iterations have pushed rule out of the cache by the time we iterate over it again. By iterating over each rule only once, we improve cache performance (hopefully). We also don't need to collect rules to a Vec, so it saves some CPU/memory there too.

In practice: the behavior here actually depends on the number of AST nodes that are in the program. If the number of nodes is large, then it's better to iterate over the nodes only once and iterate the rules multiple times. But if the number of nodes is small, then it's better to iterate over nodes multiple times and only iterate over the rules once. See this comment for more context: #6600 (comment), as well as the comment inside the PR: https://github.com/oxc-project/oxc/pull/6600/files#diff-207225884c5e031ffd802bb99e4fbacbd8364b1343a1cec5485bf50f29186300R131-R143.

In practice, this can make linting a file 1-45% faster, depending on the size of the file, number of AST nodes, number of files, CPU cache size, etc. To accommodate large and small files better, we have an explicit threshold of 200,000 AST nodes, which is an arbitrary number picked based on some benchmarks on my laptop. For large files, the linter behavior doesn't change. For small files, we switch to iterating over nodes in the inner loop and iterating over rules once in the outer loop.

@github-actions github-actions bot added A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance labels Oct 15, 2024
Copy link
Contributor Author

camchenry commented Oct 15, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @camchenry and the rest of your teammates on Graphite Graphite

Copy link

codspeed-hq bot commented Oct 15, 2024

CodSpeed Performance Report

Merging #6600 will improve performances by 35.14%

Comparing 10-15-perf_linter_iterate_over_each_rule_only_once (8387bac) with main (619d06f)

Summary

⚡ 2 improvements
✅ 28 untouched benchmarks

Benchmarks breakdown

Benchmark main 10-15-perf_linter_iterate_over_each_rule_only_once Change
linter[RadixUIAdoptionSection.jsx] 3.2 ms 2.4 ms +35.14%
linter[cal.com.tsx] 1.3 s 1 s +27.54%

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch 2 times, most recently from 4206d0c to 06000dc Compare October 15, 2024 17:03
@camchenry camchenry changed the base branch from 10-15-perf_linter_no-unused-vars_do_not_construct_regex_for_default_ignore_pattern to main October 15, 2024 17:03
@overlookmotel overlookmotel self-requested a review October 15, 2024 17:17
@overlookmotel
Copy link
Contributor

Benchmark results are wild! I'll take a look at this tomorrow.

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from 06000dc to 5faeea2 Compare October 15, 2024 17:34
@overlookmotel
Copy link
Contributor

overlookmotel commented Oct 15, 2024

This is rather mysterious! Personally I think unlikely this effect is due to cache locality. I could be wrong, but it seems unlikely to me that'd have such a large effect. More likely it's just doing a lot less work turning the loops this way around.

The big question is: what's the big difference between cal.com.tsx and checker.ts?

I suggest running before/after with a proper flamegraph (better than what CodSpeed gives) and see what that reveals.

Also, you might want to remove the intermediate Vec of rules and see if that makes any difference. Just loop over the iterator instead of .collect::<Vec<_>>() first.

@camchenry
Copy link
Contributor Author

Also, you might want to remove the intermediate Vec of rules and see if that makes any difference. Just loop over the iterator instead of .collect::<Vec<_>>() first.

I did try this first, I had just re-added it to see if it made any difference. It didn't improve the regression, so I'm going to leave it out.

This is rather mysterious! Personally I think unlikely this effect is due to cache locality. I could be wrong, but it seems unlikely to me that'd have such a large effect. More likely it's just doing a lot less work turning the loops this way around.

The big question is: what's the big difference between cal.com.tsx and checker.ts?

I suggest running before/after with a proper flamegraph (better than what CodSpeed gives) and see what that reveals.

Yeah, I'm going to work on profiling this to see what's up.

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch 2 times, most recently from 8601a72 to eab9176 Compare October 16, 2024 01:17
@camchenry
Copy link
Contributor Author

Samply didn't help too much in profiling, I'm still a bit confused. The only thing that sticks out is this branch doesn't have a big call to bumpalo::dealloc for some reason. But I feel like the debug symbols are not quite attributed correctly and I wonder if that contributes to some of the weird behavior.

main:
Screenshot 2024-10-15 at 11 19 27 PM

This branch:
Screenshot 2024-10-15 at 11 18 25 PM

The call tree shows some strange differences like oxc_parser::js::module::::parse_export_named_declaration dropping out from the top calls, but that seems unrelated?

main:
Screenshot 2024-10-15 at 11 28 13 PM

This branch:
Screenshot 2024-10-15 at 11 28 04 PM

@Boshen
Copy link
Member

Boshen commented Oct 16, 2024

This is a benchmark fallacy where all rules are turn on.

For the two loops, that's 420 rules * total number of AST nodes, so performance will obviously be different for different file sizes.

For real usages, it's around 100 rules so performance will be different as well.

@camchenry
Copy link
Contributor Author

This is a benchmark fallacy where all rules are turn on.

For the two loops, that's 420 rules * total number of AST nodes, so performance will obviously be different for different file sizes.

For real usages, it's around 100 rules so performance will be different as well.

Yeah, the number on codspeed is much larger than what I am seeing locally in more realistic scenarios. But still when I do -W all with ~100 rules, I do see improvements of ~5-10% which are consistent across runs and different files.

I'm still not sure why checker.ts here has such a large regression in CodSpeed here. I have never seen it slow down when running locally.

@Boshen
Copy link
Member

Boshen commented Oct 16, 2024

What about applying a strategy where we switch the loop when the total number of ast nodes is smaller than a threshold?

@overlookmotel
Copy link
Contributor

overlookmotel commented Oct 16, 2024

I don't see any good reason why parse_export_named_declaration should be dropping anything. That's worth investigating in its own right. I'd suggest trying to resolve that first, and see if it solves the wacky benchmark results. It's hard to determine for sure if this change is good or not when we have this oddity in the mix.

Maybe #6623 will make that go away, but if it does, that'd probably indicate another problem! It would suggest we are storing a large amount of temporary data in the arena, which is an anti-pattern.

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from eab9176 to 90a93af Compare October 17, 2024 11:10
@camchenry
Copy link
Contributor Author

camchenry commented Oct 17, 2024

I'm not in a rush to merge this, so I'm going to continue investigating as best as I can. However, the performance gains here are definitely real and very statistically significant. With default settings used for oxlint, I see ~5-10% performance improvement on my M1 laptop. With all rules/plugins enabled, the performance improvement goes up to ~22%.

hyperfine -i -N --warmup 10 --runs 100 './oxlint-rules-iter --silent ./checker.ts' './oxlint-main --silent ./checker.ts'

Benchmark 1: ./oxlint-rules-iter --silent ./checker.ts
  Time (mean ± σ):     172.7 ms ±   0.8 ms    [User: 162.4 ms, System: 9.2 ms]
  Range (min … max):   171.0 ms … 174.5 ms    100 runs

Benchmark 2: ./oxlint-main --silent ./checker.ts
  Time (mean ± σ):     192.0 ms ±   0.6 ms    [User: 181.8 ms, System: 9.1 ms]
  Range (min … max):   188.7 ms … 193.5 ms    100 runs

Summary
  ./oxlint-rules-iter --silent ./checker.ts ran
    1.11 ± 0.01 times faster than ./oxlint-main --silent ./checker.ts
hyperfine -i -N --warmup 10 --runs 100 './oxlint-rules-iter --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts' './oxlint-main --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts'

Benchmark 1: ./oxlint-rules-iter --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts
  Time (mean ± σ):     612.6 ms ±   2.0 ms    [User: 596.0 ms, System: 15.3 ms]
  Range (min … max):   608.1 ms … 617.4 ms    100 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: ./oxlint-main --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts
  Time (mean ± σ):     753.6 ms ±   6.7 ms    [User: 735.7 ms, System: 16.3 ms]
  Range (min … max):   733.5 ms … 795.0 ms    100 runs

  Warning: Ignoring non-zero exit code.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./oxlint-rules-iter --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts ran
    1.23 ± 0.01 times faster than ./oxlint-main --deny=all --import-plugin --jsdoc-plugin --jest-plugin --vitest-plugin --jsx-a11y-plugin --nextjs-plugin --react-perf-plugin --promise-plugin --node-plugin --threads 1 --silent ./checker.ts

I did some debugging in Xcode and did find that this new version actually has more L1 data cache misses than the code on main which is interesting, but doesn't explain the performance difference quite yet. The main thing that seems to explain the performance is that this branch simply executes fewer cycles.

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from 90a93af to 14ada18 Compare October 17, 2024 12:31
@camchenry
Copy link
Contributor Author

So I noticed an interesting pattern here between main and this PR. In main, there are a bunch of L1 data cache misses at the beginning, which then go away which I assume means that we are using the cache a lot more for the remainder of the process. There are also ~2M L1 cache misses.

image

However, for this PR, the L1 data cache misses are evenly distributed over the duration of the process, and there are ~20x more cache misses with ~45M L1 cache misses:

image

For some reason, ~35M of these cache misses are being attributed to the jest/valid-expect rule:
image

And specifically, it looks like it's a lot of time being spent getting these possible jest nodes:
image

So I'm wondering if by changing the data access pattern here we've actually made the caching worse because now the Jest nodes are constantly getting pushed out of the cache and then put back into the cache. It might be worth working on #6038 and then revisiting this to see if it has improved. It's possible that the performance regression in CodSpeed could maybe be explained by different caching behavior versus my M1 laptop?

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch 5 times, most recently from b3c3c2f to 318bf0e Compare October 21, 2024 16:53
@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch 2 times, most recently from a8e6348 to 41980c3 Compare October 21, 2024 21:12
@camchenry
Copy link
Contributor Author

I think I finally figured this one out. Re-profiling with the latest code changes shows a much healthier usage of L1 cache, where we have a bunch of misses initially but it falls off to a smaller amount.

image

I think the reason why we were encountering issues in the checker.ts benchmark file is because checker.ts is the largest of the three benchmarked files by far. That means checker.ts also has the most AstNode to iterate over in semantic.nodes().iter(). When the number of nodes is relatively small (<200K), the nodes seem more able to fit in the cache with each iteration over the rules. However, when the number of nodes is large, it is better to have rules in the inner loop so that the nodes stay in the cache. Otherwise, the number of nodes completely negates the benefit of only having to iterate over the rules once.

So I've taken @Boshen's suggestion from earlier to have a threshold for when we apply this strategy. This makes this essentially a "small file" optimization where smaller files are linted in a slightly different order but large files are unaffected. This does complicate the linter running code, but I believe it's worth the benefit. Benchmarking shows this can speed up the linter by 10-40%, depending on the number of rules and plugins, but generally the higher the number of rules the better this performs. I think the extra complication here is well worth the slightly duplicated implementations.

@camchenry camchenry marked this pull request as ready for review October 21, 2024 22:13
@overlookmotel
Copy link
Contributor

Great that the perf drop on checker.ts has gone. Spectacular!

Out of interest, how many nodes do the various benchmarks have? Clearly checker.ts has more than 200,000. What about the others?

But... (and I'm sorry to be the guy to piss on your parade when you've achieved such a stellar result)... are you sure you're not over-fitting the data? i.e. optimizing for our specific benchmarks, rather than the general case?

And did you ever get to the bottom of what's going on with the expensive drop calls in parse_export_named_declaration?

Just to be clear... despite me asking questions, what I'm not questioning is that this is excellent work. Bravo!

@overlookmotel
Copy link
Contributor

I think the extra complication here is well worth the slightly duplicated implementations.

Totally agree. It's not a huge complication, and you've made a comment which clearly explains the rationale.

@camchenry
Copy link
Contributor Author

Out of interest, how many nodes do the various benchmarks have? Clearly checker.ts has more than 200,000. What about the others?

Good question, the AST node counts are as follows:

  • checker.ts: 303,463 nodes
  • RadixUIAdoptionSection.jsx: 392 nodes
  • cal.com.tsx: 153,029 nodes

And for some other targets:

  • target/antd.js: 571,672 nodes
  • target/pdf.mjs: 108,379 nodes

But... (and I'm sorry to be the guy to piss on your parade when you've achieved such a stellar result)... are you sure you're not over-fitting the data? i.e. optimizing for our specific benchmarks, rather than the general case?

Not at all, and thank you for your help on this as well! I had written out a much longer response but I accidentally closed the tab and deleted it 😅 It's definitely possible that I've overfitted this to these particular benchmarks and to my personal machine. I've tried to mitigate this somewhat by running this against not just single files but larger codebases that have files of many different shapes and sizes. But I haven't tried this on other computers. We should continue to tweak this threshold.

Even still, I am able to replicate the CodSpeed results across whole codebases not just single files. For example:

Linting supabase (3,748 files) is only ~3% faster:

Screenshot 2024-10-21 at 6 37 45 PM

Linting gitlab (12,598 files) is ~16% faster:

Screenshot 2024-10-21 at 6 43 02 PM

Linting vscode (5,107 files) is ~33% faster:

Screenshot 2024-10-21 at 5 51 43 PM

@camchenry
Copy link
Contributor Author

And did you ever get to the bottom of what's going on with the expensive drop calls in parse_export_named_declaration?

I might have misspoken earlier, when I said "dropped out" I think I just meant it didn't appear in the profiler call tree. But I think there was some strange debug symbol attribution going on there, I didn't see this happening again when profiling in Xcode Instruments. I didn't see anything that pointed to this method being a big issue currently.

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from 41980c3 to f5e1426 Compare October 21, 2024 22:58
@overlookmotel
Copy link
Contributor

Thanks very much for giving the real-world results on various codebases. That's pretty convincing.

Perhaps we could finesse the 200,000 "tipping point", but it seems to me it's clearly good enough. In the real world who (apart from the maniac authors of TS) writes code in files anywhere near as large as that? We might encounter files that big in transformer - large libraries pre-bundled into a single file in node_modules. But that's not relevant for the linter - you don't lint your node_modules folder!

I might have misspoken earlier, when I said "dropped out" I think I just meant it didn't appear in the profiler call tree. But I think there was some strange debug symbol attribution going on there, I didn't see this happening again when profiling in Xcode Instruments. I didn't see anything that pointed to this method being a big issue currently.

Can I ask you a favour? Would you mind just checking if benchmark result on checker.ts is still the same as it was before if you disable the "large file" branch, so checker.ts goes through same branch as the rest? I think very likely your cache lines explanation of what's going on is correct, but I would just like to double-check that the drop calls were not significant. It's possible that it was significant, but was solved by #6623, and that would be another explanation of why it's now disappeared from the flamegraph after that PR was merged.

Beyond that, LGTM!

@camchenry
Copy link
Contributor Author

Can I ask you a favour? Would you mind just checking if benchmark result on checker.ts is still the same as it was before if you disable the "large file" branch, so checker.ts goes through same branch as the rest? I think very likely your cache lines explanation of what's going on is correct, but I would just like to double-check that the drop calls were not significant. It's possible that it was significant, but was solved by #6623, and that would be another explanation of why it's now disappeared from the flamegraph after that PR was merged.

@overlookmotel I think I did benchmark this on main after #6623 was merged (and rebased this PR on top of it). Just want to make sure I understand what you're asking. Should I comment out the large branch case and benchmark it, using the latest changes from main? Or did you want me to benchmark against an older commit?

@camchenry camchenry force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from f5e1426 to bdf6007 Compare October 22, 2024 01:22
@Boshen
Copy link
Member

Boshen commented Oct 22, 2024

The codspeed benchmark has all the benchmark turned on. Is your hyperfine benchmark running all rules or just the default? I would optimize for the default case, where most people run ~100 rules, not 430.

@camchenry
Copy link
Contributor Author

The codspeed benchmark has all the benchmark turned on. Is your hyperfine benchmark running all rules or just the default? I would optimize for the default case, where most people run ~100 rules, not 430.

This benchmarking was done with the default rules and plugins, so it was 98 rules in total.

@Boshen
Copy link
Member

Boshen commented Oct 22, 2024

I'm super happy with the results, eager to merge!

@Boshen Boshen changed the title perf(linter): iterate over each rule only once perf(linter): apply small file optimization, up to 30% faster Oct 22, 2024
@Boshen Boshen added the 0-merge Merge with Graphite Merge Queue label Oct 22, 2024
Copy link
Member

Boshen commented Oct 22, 2024

Merge activity

  • Oct 21, 10:23 PM EDT: The merge label '0-merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
  • Oct 21, 10:24 PM EDT: A user added this pull request to the Graphite merge queue.
  • Oct 21, 10:32 PM EDT: A user merged this pull request with the Graphite merge queue.

Theory: iterating over the rules three times has slightly worse cache locality, because the prior iterations have pushed `rule` out of the cache by the time we iterate over it again. By iterating over each rule only once, we improve cache performance (hopefully). We also don't need to collect rules to a Vec, so it saves some CPU/memory there too.

In practice: the behavior here actually depends on the number of AST nodes that are in the program. If the number of nodes is large, then it's better to iterate over the nodes only once and iterate the rules multiple times. But if the number of nodes is small, then it's better to iterate over nodes multiple times and only iterate over the rules once. See this comment for more context: #6600 (comment), as well as the comment inside the PR: https://github.com/oxc-project/oxc/pull/6600/files#diff-207225884c5e031ffd802bb99e4fbacbd8364b1343a1cec5485bf50f29186300R131-R143.

In practice, this can make linting a file 1-45% faster, depending on the size of the file, number of AST nodes, number of files, CPU cache size, etc. To accommodate large and small files better, we have an explicit threshold of 200,000 AST nodes, which is an arbitrary number picked based on some benchmarks on my laptop. For large files, the linter behavior doesn't change. For small files, we switch to iterating over nodes in the inner loop and iterating over rules once in the outer loop.
@Boshen Boshen force-pushed the 10-15-perf_linter_iterate_over_each_rule_only_once branch from bdf6007 to 8387bac Compare October 22, 2024 02:24
@graphite-app graphite-app bot merged commit 8387bac into main Oct 22, 2024
26 checks passed
@graphite-app graphite-app bot deleted the 10-15-perf_linter_iterate_over_each_rule_only_once branch October 22, 2024 02:32
Boshen added a commit that referenced this pull request Oct 22, 2024
## [0.10.2] - 2024-10-22

### Features

- dbe1972 linter: Import/no-cycle should turn on ignore_types by default
(#6761) (Boshen)
- 619d06f linter: Fix suggestion for `eslint:no_empty_static_block` rule
(#6732) (Tapan Prakash)

### Bug Fixes


### Performance

- 8387bac linter: Apply small file optimization, up to 30% faster
(#6600) (camchenry)

### Refactor

- b884577 linter: All ast_util functions take Semantic (#6753)
(DonIsaac)
- 744aa74 linter: Impl `Deref<Target = Semantic>` for `LintContext`
(#6752) (DonIsaac)
- 6ffdcc0 oxlint: Lint/mod.rs -> lint.rs (#6746) (Boshen)

### Testing

- b03cec6 oxlint: Add `--fix` test case (#6747) (Boshen)

---------

Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0-merge Merge with Graphite Merge Queue A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants