Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ChunkedBitSet and use it for some dataflow analyses. #93984

Merged
merged 1 commit into from
Feb 23, 2022

Conversation

nnethercote
Copy link
Contributor

This reduces peak memory usage significantly for some programs with very
large functions.

r? @ghost

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 14, 2022
@nnethercote
Copy link
Contributor Author

This will be a complex one to evaluate, because there are various contradictory metrics results, e.g. some improving, some worsening. Anyway, let's get a perf run started to serve as the basis for any discussion:

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 14, 2022
@nnethercote nnethercote marked this pull request as draft February 14, 2022 06:25
@nnethercote
Copy link
Contributor Author

@bors try

@bors
Copy link
Contributor

bors commented Feb 14, 2022

⌛ Trying commit fd9228607b35e1094193fe9e423e356544b527f2 with merge 9eec9616e3c91ff7231a3fadff261984e83f251f...

@bors
Copy link
Contributor

bors commented Feb 14, 2022

☀️ Try build successful - checks-actions
Build commit: 9eec9616e3c91ff7231a3fadff261984e83f251f (9eec9616e3c91ff7231a3fadff261984e83f251f)

@rust-timer
Copy link
Collaborator

Queued 9eec9616e3c91ff7231a3fadff261984e83f251f with parent 52dd59e, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9eec9616e3c91ff7231a3fadff261984e83f251f): comparison url.

Summary: This benchmark run shows 115 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.1%
  • Largest regression in instruction counts: 5.5% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 15, 2022
@nnethercote
Copy link
Contributor Author

nnethercote commented Feb 15, 2022

If you look at instruction counts, it looks bad. If you look at every other metric, it looks good. One of the cases where instruction counts it misleading, I think.

A big part of this: x86-64 has "string processing" instructions like REPE MOVS. They effectively implement a loop, and things like memcpy can be implemented with them. A single string processing instruction is counted as one instruction by hardware counters, which perf uses. But Cachegrind counts them as N instructions, where N is the number of repetitions. (See section 4.1 of this paper for details.) And this commit gets rid of a lot of memcpy calls within iterate_to_fixpoint. If we went by Cachegrind's instruction counts, rather than perf's instruction counts, things would look substantially better.

@nnethercote
Copy link
Contributor Author

This also improves the http crate quite a bit, though not as much as keccak. Plus I have some draft code locally that uses Rc within chunks that reduces memory usage quite a bit more, mostly on keccak.

@nnethercote
Copy link
Contributor Author

BTW, this will fix #54208, if it ends up being merged.

@Mark-Simulacrum Mark-Simulacrum self-assigned this Feb 15, 2022
@Mark-Simulacrum
Copy link
Member

Not sure if draft status is intended to reflect some unreadiness, but assigning myself to at least take an initial look.

@nnethercote
Copy link
Contributor Author

Not sure if draft status is intended to reflect some unreadiness, but assigning myself to at least take an initial look.

Thanks! The code quality isn't great right now, there are numerous "njn:" comments indicating things that need to be fixed, and the Rc addition isn't uploaded yet. But I'm happy to hear early feedback :)

@nnethercote
Copy link
Contributor Author

I have updated the code. The second commit contains the use of Rc.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2022
@bors
Copy link
Contributor

bors commented Feb 15, 2022

⌛ Trying commit 3ba04f6462ebd682fb50c5b7123c50e0ee748f99 with merge 5a267675a45b7dcaef6afd739f55f3b207eaba3b...

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left high-level commentary on data structures for now, didn't look at usage/impls.

compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
@bors
Copy link
Contributor

bors commented Feb 15, 2022

☀️ Try build successful - checks-actions
Build commit: 5a267675a45b7dcaef6afd739f55f3b207eaba3b (5a267675a45b7dcaef6afd739f55f3b207eaba3b)

@rust-timer
Copy link
Collaborator

Queued 5a267675a45b7dcaef6afd739f55f3b207eaba3b with parent 6655109, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5a267675a45b7dcaef6afd739f55f3b207eaba3b): comparison url.

Summary: This benchmark run shows 6 relevant improvements 🎉 but 121 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.1%
  • Average relevant improvement: -3.8%
  • Largest improvement in instruction counts: -5.5% on full builds of keccak debug
  • Largest regression in instruction counts: 6.3% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@bors
Copy link
Contributor

bors commented Feb 18, 2022

☀️ Try build successful - checks-actions
Build commit: 9003e92b6f7298995cc9772da0ec54c403a09a7e (9003e92b6f7298995cc9772da0ec54c403a09a7e)

@rust-timer
Copy link
Collaborator

Queued 9003e92b6f7298995cc9772da0ec54c403a09a7e with parent 73a7423, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9003e92b6f7298995cc9772da0ec54c403a09a7e): comparison url.

Summary: This benchmark run shows 12 relevant improvements 🎉 but 105 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.0%
  • Average relevant improvement: -2.8%
  • Largest improvement in instruction counts: -5.4% on full builds of keccak debug
  • Largest regression in instruction counts: 6.1% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 18, 2022
@nnethercote
Copy link
Contributor Author

nnethercote commented Feb 18, 2022

Finished benchmarking commit (9003e92): comparison url.

Summary: This benchmark run shows 12 relevant improvements tada but 105 relevant regressions crying_cat_face to instruction counts.

As mentioned here and here, this is a rare case where instruction counts are misleading. Instead look at wall-time and cycles (some big wins along with the usual noise) along with max-rss and faults (some massive wins along with the usual noise).

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Feb 18, 2022
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall

compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Show resolved Hide resolved
compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved
@Mark-Simulacrum
Copy link
Member

Mostly nits, r=me if you don't make major changes

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2022
This reduces peak memory usage significantly for some programs with very
large functions, such as:
- `keccak`, `unicode_normalization`, and `match-stress-enum`, from
  the `rustc-perf` benchmark suite;
- `http-0.2.6` from crates.io.

The new type is used in the analyses where the bitsets can get huge
(e.g. 10s of thousands of bits): `MaybeInitializedPlaces`,
`MaybeUninitializedPlaces`, and `EverInitializedPlaces`.

Some refactoring was required in `rustc_mir_dataflow`. All existing
analysis domains are either `BitSet` or a trivial wrapper around
`BitSet`, and access in a few places is done via `Borrow<BitSet>` or
`BorrowMut<BitSet>`. Now that some of these domains are `ClusterBitSet`,
that no longer works. So this commit replaces the `Borrow`/`BorrowMut`
usage with a new trait `BitSetExt` containing the needed bitset
operations. The impls just forward these to the underlying bitset type.
This required fiddling with trait bounds in a few places.

The commit also:
- Moves `static_assert_size` from `rustc_data_structures` to
  `rustc_index` so it can be used in the latter; the former now
  re-exports it so existing users are unaffected.
- Factors out some common "clear excess bits in the final word"
  functionality in `bit_set.rs`.
- Uses `fill` in a few places instead of loops.
@nnethercote
Copy link
Contributor Author

@bors r=Mark-Simulacrum

@bors
Copy link
Contributor

bors commented Feb 23, 2022

📌 Commit 36b495f has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 23, 2022
@bors
Copy link
Contributor

bors commented Feb 23, 2022

⌛ Testing commit 36b495f with merge bafe8d0...

@nnethercote
Copy link
Contributor Author

I took some extra measurements on my machine for affected benchmarks:

  • From rustc-perf: cranelift-codegen, inflate, keccak, match-stress-enum, unicode_normalization
  • External: http-0.2.6, rust-language-tags-0.3.2, tinyvec-1.5.1, vte-0.10.1

instructions:u

As mentioned above, these aren't useful for this change, but I've put them here for completeness.

Benchmark & Profile Scenario % Change Significance Factor? Chunk0 Chunk1
match-stress-enum check full 3.56% 17.82x 6206276362.00 6427489110.00
unicode_normalization check full 2.61% 13.05x 4394337716.00 4509070318.00
http check full -2.17% 10.85x 5639733100.00 5517400820.00
inflate check full 1.07% 5.33x 4265448539.00 4310912103.00
keccak check full -1.05% 5.23x 23069621635.00 22828169903.00
cranelift-codegen check full 0.82% 4.08x 14411424274.00 14529074725.00
vte check full -0.73% 3.67x 3260641577.00 3236722513.00
tinyvec check full 0.56% 2.82x 3378759222.00 3397785087.00

cycles

The cycle results for this PR were consistently better on CI than they were for my local builds. Maybe because my local builds don't have PGO? Anyway, these aren't all that reliable.

Benchmark & Profile Scenario % Change Significance Factor? Chunk0 Chunk1
match-stress-enum check full 5.49% 27.47x 2037211439.00 2149120206.00
vte check full -5.06% 25.31x 2192701285.00 2081692906.00
keccak check full -5.05% 25.26x 13616846225.00 12929031896.00
http check full -4.34% 21.71x 4499140807.00 4303815599.00
cranelift-codegen check full 3.07% 15.37x 12658880013.00 13047971256.00
rust-language-tags check full -2.57% 12.85x 2098872505.00 2044922945.00
unicode_normalization check full -1.91% 9.53x 2663754160.00 2613007641.00

wall-time

Same story for wall-time as for cycles.

Benchmark & Profile Scenario % Change Significance Factor? Chunk0 Chunk1
match-stress-enum check full 12.25% 61.26x 0.59 0.67
http check full -8.70% 43.51x 1.26 1.15
keccak check full -7.74% 38.69x 3.56 3.29
vte check full -7.53% 37.67x 0.72 0.67
unicode_normalization check full 4.79% 23.95x 0.84 0.88
inflate check full 1.64% 8.20x 0.63 0.64
cranelift-codegen check full 1.16% 5.80x 3.28 3.31
rust-language-tags check full -0.99% 4.95x 0.68 0.68
tinyvec check full -0.95% 4.75x 0.62 0.62

max-rss

Huge wins here.

Benchmark & Profile Scenario % Change Significance Factor? Chunk0 Chunk1
keccak check full -59.05% 295.25x 974100.00 398892.00
http check full -32.11% 160.54x 390080.00 264832.00
vte check full -26.96% 134.78x 256144.00 187096.00
unicode_normalization check full -15.45% 77.27x 264228.00 223396.00
match-stress-enum check full -11.79% 58.94x 162364.00 143224.00
inflate check full -9.10% 45.49x 205048.00 186392.00
rust-language-tags check full -8.39% 41.93x 229596.00 210344.00
tinyvec check full -7.16% 35.79x 163636.00 151924.00
cranelift-codegen check full -4.03% 20.14x 400552.00 384420.00

faults

Huge wins here, too.

Benchmark & Profile Scenario % Change Significance Factor? Chunk0 Chunk1
keccak check full -60.21% 301.06x 240390.00 95647.00
http check full -46.54% 232.69x 68926.00 36849.00
vte check full -46.42% 232.08x 43003.00 23043.00
unicode_normalization check full -27.03% 135.17x 39006.00 28461.00
match-stress-enum check full -24.91% 124.55x 19325.00 14511.00
inflate check full -21.48% 107.39x 23373.00 18353.00
tinyvec check full -18.21% 91.07x 18809.00 15383.00
rust-language-tags check full -15.42% 77.08x 33191.00 28074.00
cranelift-codegen check full -5.83% 29.13x 75575.00 71172.00

@bors
Copy link
Contributor

bors commented Feb 23, 2022

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing bafe8d0 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 23, 2022
@bors bors merged commit bafe8d0 into rust-lang:master Feb 23, 2022
@rustbot rustbot added this to the 1.61.0 milestone Feb 23, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (bafe8d0): comparison url.

Summary: This benchmark run shows 6 relevant improvements 🎉 but 107 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.0%
  • Average relevant improvement: -3.8%
  • Largest improvement in instruction counts: -5.3% on full builds of keccak debug
  • Largest regression in instruction counts: 6.0% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@nnethercote nnethercote deleted the ChunkedBitSet branch February 23, 2022 06:24
@nnethercote
Copy link
Contributor Author

As mentioned here and here, this is a rare case where instruction counts are misleading. Instead look at wall-time and cycles (some big wins along with the usual noise) along with max-rss and faults (some massive wins along with the usual noise).

@rustbot label: +perf-regression-triaged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants