Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in Rust 1.71 #115917

Open
dantheman3333 opened this issue Sep 17, 2023 · 17 comments
Open

Performance regression in Rust 1.71 #115917

dantheman3333 opened this issue Sep 17, 2023 · 17 comments
Labels
A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@dantheman3333
Copy link

Hi, on linux x86, I'm seeing a 120+ms regression in wall time with 1.71, which is small but is noticeable when running the cli application as it is already quite fast.

1.70:

frost$ hyperfine -w 5 "./target/release/frost info frost/tests/fixtures/test_large.bag"
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     580.8 ms ±   0.9 ms    [User: 311.6 ms, System: 269.4 ms]
  Range (min … max):   579.7 ms … 582.2 ms    10 runs

1.71:
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     708.8 ms ±   4.9 ms    [User: 447.1 ms, System: 261.6 ms]
  Range (min … max):   703.3 ms … 716.7 ms    10 runs
********************************************************************************
Regression in 915aa06700af4a2363639bae70201cd7387470ad
********************************************************************************

Attempting to search unrolled perf builds
ERROR: couldn't find perf build comment
==================================================================================
= Please file this regression report on the rust-lang/rust GitHub repository     =
=        New issue: https://github.com/rust-lang/rust/issues/new                 =
=     Known issues: https://github.com/rust-lang/rust/issues                     =
= Copy and paste the text below into the issue report thread.  Thanks!           =
==================================================================================

searched nightlies: from nightly-2023-01-01 to nightly-2023-09-16
regressed nightly: nightly-2023-04-24
searched commit range: https://github.com/rust-lang/rust/compare/b628260df0587ae559253d8640ecb8738d3de613...7f94b314cead7059a71a265a8b64905ef2511796
regressed commit: https://github.com/rust-lang/rust/commit/915aa06700af4a2363639bae70201cd7387470ad

<details>
<summary>bisected with <a href='https://github.com/rust-lang/cargo-bisect-rustc'>cargo-bisect-rustc</a> v0.6.7</summary>


Host triple: x86_64-unknown-linux-gnu
Reproduce with:
```bash
cargo bisect-rustc --script=./bench.sh --start=2023-01-01 --end=2023-09-16 --preserve 

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

The project is pretty messy but I have steps on how to benchmark here dantheman3333/frost#33.

A differential flamegraph shows flatmap's next has taken a hit. I am unsure if this is an aggregate view of all of the nexts or if this is a specific call site - is it possible to get line numbers from this?

I am unsure of how to proceed from here but I would love to learn if anyone has tips

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Sep 17, 2023
@saethlin
Copy link
Member

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

Those perf reports are about how fast the compiler runs, which often has little to do with code generation quality. So perf reports and perf trigage are not going to be relevant here. There are now some runtime benchmarks, but very few so far.

I'm looking into your benchmark.

@saethlin
Copy link
Member

I think the cause of the regression is inlining of calls to std::io::Error::new. If this were my code, I would resolve the perf regression by not using std::io::Error the way you are. It's a very expensive type to construct and to drop. For this codebase, nearly all parsing errors can be reported just as accurately by an enum with unit variants, so I'd be trying to apply that strategy where possible.

This issue surfaces here because that patch made the MIR inliner slightly more aggressive; not because actually looking at the size of locals was a useful heuristic. -Zmir-inline-threshold=40 improves performance on this benchmark (the default is 50) and we no longer inline std::io::Error::new in the final binary.

Perhaps this indicates it is appropriate to put #[inline(never)] on std::io::Error::new.

@saethlin
Copy link
Member

Demo of the problematic MIR inlining choice: https://godbolt.org/z/5vseT5oqb

@saethlin saethlin added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Sep 18, 2023
@saethlin
Copy link
Member

I'm tagging @rust-lang/wg-mir-opt: Does this indicate a need for a new MIR inlining heuristic?
And libs, but really just @thomcc: Do you think it is a good idea to add #[inline(never)] to std::io::Error::new, or do you think that is too heavy-handed?

@thomcc
Copy link
Member

thomcc commented Sep 18, 2023

I think putting it on Error::_new would be more reasonable, but I could be convinced about Error::new`.

@dantheman3333
Copy link
Author

@saethlin thank you!

@saethlin
Copy link
Member

Adding #[inline(never)] to Error::_new would only protect one of the allocation codepaths from being inlined. Why do you think putting it on that function is better?

@thomcc
Copy link
Member

thomcc commented Sep 18, 2023

Because it's intentionally outlined for that purpose. My concern with making Error::new inline(never) is it could be monomorphized several times. In many cases, we'd want those to be inlined.

@dantheman3333
Copy link
Author

hey @saethlin, just got around to trying replacing all of the expensive errors in the hot loop, but I am still seeing significant differences.

With RUSTFLAGS="-Zinline-mir-threshold=40" cargo build --release (has it been renamed from mir-inline-threshold in newer nightlies?), and varying the threshold, I am still not seeing any improvement

Benchmark 1: /tmp/frost_170 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     606.1 ms ±   2.3 ms    [User: 338.9 ms, System: 267.1 ms]
  Range (min … max):   602.9 ms … 608.9 ms    10 runs
 
Benchmark 2: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     914.3 ms ±   7.8 ms    [User: 656.0 ms, System: 258.2 ms]
  Range (min … max):   906.9 ms … 926.0 ms    10 runs
 
Benchmark 3: /tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     905.8 ms ±   6.2 ms    [User: 641.9 ms, System: 263.8 ms]
  Range (min … max):   898.8 ms … 915.0 ms    10 runs
 
Summary
  '/tmp/frost_170 info frost/tests/fixtures/test_large.bag' ran
    1.49 ± 0.01 times faster than '/tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag'
    1.51 ± 0.01 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

Do you have any recommendations on how to proceed from here?

@saethlin
Copy link
Member

saethlin commented Dec 26, 2023

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

But anyway, I locally deleted them and re-bisected with this --script:

RUSTFLAGS="-Zmir-opt-level=0" cargo b --release --bin frost -Zbuild-std --target=x86_64-unknown-linux-gnu
perf stat -r10 $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/frost info frost/tests/fixtures/test_large.bag 2>&1 | grep "6..\... msec task-clock"

And I found:

searched toolchains nightly-2023-04-01 through nightly-2023-12-22


********************************************************************************
Regression in nightly-2023-06-27
********************************************************************************
...
found 8 bors merge commits in the specified range
  commit[0] 2023-06-25: Auto merge of #113037 - TaKO8Ki:rollup-pqfbxwk, r=TaKO8Ki
  commit[1] 2023-06-25: Auto merge of #113038 - matthiaskrgr:rollup-sdcfkxa, r=matthiaskrgr
  commit[2] 2023-06-26: Auto merge of #111850 - the8472:external-step-by, r=scottmcm
  commit[3] 2023-06-26: Auto merge of #112884 - klensy:ri-drop-old-clap, r=albertlarsan68
  commit[4] 2023-06-26: Auto merge of #113024 - Jerrody:master, r=thomcc
  commit[5] 2023-06-26: Auto merge of #113049 - matthiaskrgr:rollup-41wo5w8, r=matthiaskrgr
  commit[6] 2023-06-26: Auto merge of #112887 - WaffleLapkin:become_unuwuable_in_hir, r=compiler-errors,Nilstrieb
  commit[7] 2023-06-26: Auto merge of #113057 - TaKO8Ki:rollup-071lc9g, r=TaKO8Ki
ERROR: no CI builds available between 8084f397c6710e4748994dd1488b1f399283d469 and 36fb58e433c782e27dd41034284e157cf86d587f within last 167 days

I can see by eye that the hot code that gets inlined into FlatMap::next is indeed different, though to my untrained eye it just looks like blocks of instructions have been moved around the function. There are 4 rollups which is really not encouraging, but I'm going to try locally reverting #111850 to see if that moves us back to the faster codegen. But considering my bisection points at code that's 6 months old it's possible that the revert will be too challenging for me.

@dantheman3333
Copy link
Author

Hey, very much appreciate you doing another deep dive!

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

I tried to balance creating enum variants with the log messages, as I was trying to avoid an explosion of variants just to log specific messages (some folks appreciate the error messages as they were working on their own bag writing tool, so I wanted to keep the messages). Do you think for high performance there's no way around this, and I just need to continue making more specific variants for each message that I want to keep and eventually print out with fmt::Display?

Thanks for the perf stat cmd, it's much simpler than the original bisection script I used. Let me know how the revert goes 🤞

@saethlin
Copy link
Member

I think errors should nearly always be rendered late because that moves the code size to error reporting and not error detection, which often has to be inside a hot code path. The goal is to minimize the impact on the happy path.

I haven't been able to revert, but by checking out the commit directly before that PR I linked and running the benchmarks on that and the merge commit for that PR, I can see that it is the cause of the regression. Which is interesting, considering it was supposed to make things faster. I suppose at this point it would make sense to ask @the8472; do you have any idea why #111850 would have made something slower?

I suspect the answer is no, and in that case the program here needs to be reduced to a minimal program that exhibits slower performance on 1.71 than on 1.70 by exercising the StepBy specialization. Then we could decide if the situation is resolvable in rustc or in LLVM.

@the8472
Copy link
Member

the8472 commented Dec 27, 2023

do you have any idea why #111850 would have made something slower?

No.

The PR does several things to StepBy

  • changes the default next() implementation, which also simplifies the call tree
  • specializes for unsigned integer ranges. this deepens the call-tree in all cases.
  • implements TrustedLen for integer ranges, this affects other specializations

Other than how it might affect inlining they should all be improvements.

@saethlin
Copy link
Member

At least that ticks off a possibility, thanks. Most likely LLVM is just tripping over its own feet here. I'll try reducing this at some point. I have an idea of how to start, but it'll be a slow process to confirm that a reduction still reproduces the important behavior.

@the8472
Copy link
Member

the8472 commented Dec 27, 2023

Looking through the frost codebase I'm only seeing two uses of step_by. Both have this pattern

slice
  .windows(N)
  .step_by(N)
  .flat_map(... -> Result<_, _>)
  .collect::<Vec<_>>()

The range specialization is not applicable here. And flat-mapping Result makes TrustedLen not applicable. So it's likely the changed default impl next().
Previously it called the underlying next() on the first iteration and then nth(N) on each subsequent one. Now it calls always nth() with a computed value. And I only measured that specific change on loops that had a fixed iteration count.

It could be that the computed step size + less predictable loops make things more difficult for LLVM.

Replacing

default fn spec_next(&mut self) -> Option<I::Item> {
let step_size = if self.first_take { 0 } else { self.step };
self.first_take = false;
self.iter.nth(step_size)
}

with the old impl

        if self.first_take {
            self.first_take = false;
            self.iter.next()
        } else {
            self.iter.nth(self.step)
        }

might be worth a try.

If that's not it then it's inlining.

@the8472
Copy link
Member

the8472 commented Dec 27, 2023

@dantheman3333, btw, you should be able to replace

slice
  .windows(N)
  .step_by(N)

with

slice
  .chunks_exact(N)

@dantheman3333
Copy link
Author

--- a/frost/src/lib.rs
+++ b/frost/src/lib.rs
@@ -844,8 +844,7 @@ fn parse_chunk_info<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;
 
     let chunk_info_data: Vec<ChunkInfoData> = data
-        .windows(8)
-        .step_by(8)
+        .chunks_exact(8)
         .flat_map(ChunkInfoData::from)
         .collect();
 
@@ -866,8 +865,7 @@ fn parse_index<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;
 
     let index_data: Vec<IndexData> = data
-        .windows(12)
-        .step_by(12)
+        .chunks_exact(12)
         .flat_map(|buf| IndexData::from(buf, chunk_header_pos, index_data_header.connection_id))
         .collect();
$ hyperfine -w 5 "/tmp/frost_174 info frost/tests/fixtures/test_large.bag" "/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag"
Benchmark 1: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     902.8 ms ±   7.5 ms    [User: 654.2 ms, System: 248.5 ms]
  Range (min … max):   893.5 ms … 908.9 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: /tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     426.6 ms ±   0.4 ms    [User: 180.3 ms, System: 246.2 ms]
  Range (min … max):   426.0 ms … 427.1 ms    10 runs
 
Summary
  '/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag' ran
    2.12 ± 0.02 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

wow, just this change produced much performant code, thank you @the8472 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants