Performance regression in Rust 1.71 #115917

dantheman3333 · 2023-09-17T17:40:11Z

Hi, on linux x86, I'm seeing a 120+ms regression in wall time with 1.71, which is small but is noticeable when running the cli application as it is already quite fast.

1.70:

frost$ hyperfine -w 5 "./target/release/frost info frost/tests/fixtures/test_large.bag"
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     580.8 ms ±   0.9 ms    [User: 311.6 ms, System: 269.4 ms]
  Range (min … max):   579.7 ms … 582.2 ms    10 runs

1.71:
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     708.8 ms ±   4.9 ms    [User: 447.1 ms, System: 261.6 ms]
  Range (min … max):   703.3 ms … 716.7 ms    10 runs

********************************************************************************
Regression in 915aa06700af4a2363639bae70201cd7387470ad
********************************************************************************

Attempting to search unrolled perf builds
ERROR: couldn't find perf build comment
==================================================================================
= Please file this regression report on the rust-lang/rust GitHub repository     =
=        New issue: https://github.com/rust-lang/rust/issues/new                 =
=     Known issues: https://github.com/rust-lang/rust/issues                     =
= Copy and paste the text below into the issue report thread.  Thanks!           =
==================================================================================

searched nightlies: from nightly-2023-01-01 to nightly-2023-09-16
regressed nightly: nightly-2023-04-24
searched commit range: https://github.com/rust-lang/rust/compare/b628260df0587ae559253d8640ecb8738d3de613...7f94b314cead7059a71a265a8b64905ef2511796
regressed commit: https://github.com/rust-lang/rust/commit/915aa06700af4a2363639bae70201cd7387470ad

<details>
<summary>bisected with <a href='https://github.com/rust-lang/cargo-bisect-rustc'>cargo-bisect-rustc</a> v0.6.7</summary>


Host triple: x86_64-unknown-linux-gnu
Reproduce with:
```bash
cargo bisect-rustc --script=./bench.sh --start=2023-01-01 --end=2023-09-16 --preserve

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

The project is pretty messy but I have steps on how to benchmark here dantheman3333/frost#33.

A differential flamegraph shows flatmap's next has taken a hit. I am unsure if this is an aggregate view of all of the nexts or if this is a specific call site - is it possible to get line numbers from this?

I am unsure of how to proceed from here but I would love to learn if anyone has tips

The text was updated successfully, but these errors were encountered:

saethlin · 2023-09-18T00:36:12Z

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

Those perf reports are about how fast the compiler runs, which often has little to do with code generation quality. So perf reports and perf trigage are not going to be relevant here. There are now some runtime benchmarks, but very few so far.

I'm looking into your benchmark.

saethlin · 2023-09-18T01:47:48Z

I think the cause of the regression is inlining of calls to std::io::Error::new. If this were my code, I would resolve the perf regression by not using std::io::Error the way you are. It's a very expensive type to construct and to drop. For this codebase, nearly all parsing errors can be reported just as accurately by an enum with unit variants, so I'd be trying to apply that strategy where possible.

This issue surfaces here because that patch made the MIR inliner slightly more aggressive; not because actually looking at the size of locals was a useful heuristic. -Zmir-inline-threshold=40 improves performance on this benchmark (the default is 50) and we no longer inline std::io::Error::new in the final binary.

Perhaps this indicates it is appropriate to put #[inline(never)] on std::io::Error::new.

saethlin · 2023-09-18T01:51:22Z

Demo of the problematic MIR inlining choice: https://godbolt.org/z/5vseT5oqb

saethlin · 2023-09-18T01:53:33Z

I'm tagging @rust-lang/wg-mir-opt: Does this indicate a need for a new MIR inlining heuristic?
And libs, but really just @thomcc: Do you think it is a good idea to add #[inline(never)] to std::io::Error::new, or do you think that is too heavy-handed?

thomcc · 2023-09-18T01:56:34Z

I think putting it on Error::_new would be more reasonable, but I could be convinced about Error::new`.

dantheman3333 · 2023-09-18T02:37:37Z

@saethlin thank you!

saethlin · 2023-09-18T03:22:58Z

Adding #[inline(never)] to Error::_new would only protect one of the allocation codepaths from being inlined. Why do you think putting it on that function is better?

thomcc · 2023-09-18T03:56:29Z

Because it's intentionally outlined for that purpose. My concern with making Error::new inline(never) is it could be monomorphized several times. In many cases, we'd want those to be inlined.

dantheman3333 · 2023-12-23T21:19:40Z

hey @saethlin, just got around to trying replacing all of the expensive errors in the hot loop, but I am still seeing significant differences.

With RUSTFLAGS="-Zinline-mir-threshold=40" cargo build --release (has it been renamed from mir-inline-threshold in newer nightlies?), and varying the threshold, I am still not seeing any improvement

Benchmark 1: /tmp/frost_170 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     606.1 ms ±   2.3 ms    [User: 338.9 ms, System: 267.1 ms]
  Range (min … max):   602.9 ms … 608.9 ms    10 runs
 
Benchmark 2: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     914.3 ms ±   7.8 ms    [User: 656.0 ms, System: 258.2 ms]
  Range (min … max):   906.9 ms … 926.0 ms    10 runs
 
Benchmark 3: /tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     905.8 ms ±   6.2 ms    [User: 641.9 ms, System: 263.8 ms]
  Range (min … max):   898.8 ms … 915.0 ms    10 runs
 
Summary
  '/tmp/frost_170 info frost/tests/fixtures/test_large.bag' ran
    1.49 ± 0.01 times faster than '/tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag'
    1.51 ± 0.01 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

Do you have any recommendations on how to proceed from here?

saethlin · 2023-12-26T20:13:08Z

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

But anyway, I locally deleted them and re-bisected with this --script:

RUSTFLAGS="-Zmir-opt-level=0" cargo b --release --bin frost -Zbuild-std --target=x86_64-unknown-linux-gnu
perf stat -r10 $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/frost info frost/tests/fixtures/test_large.bag 2>&1 | grep "6..\... msec task-clock"

And I found:

searched toolchains nightly-2023-04-01 through nightly-2023-12-22


********************************************************************************
Regression in nightly-2023-06-27
********************************************************************************
...
found 8 bors merge commits in the specified range
  commit[0] 2023-06-25: Auto merge of #113037 - TaKO8Ki:rollup-pqfbxwk, r=TaKO8Ki
  commit[1] 2023-06-25: Auto merge of #113038 - matthiaskrgr:rollup-sdcfkxa, r=matthiaskrgr
  commit[2] 2023-06-26: Auto merge of #111850 - the8472:external-step-by, r=scottmcm
  commit[3] 2023-06-26: Auto merge of #112884 - klensy:ri-drop-old-clap, r=albertlarsan68
  commit[4] 2023-06-26: Auto merge of #113024 - Jerrody:master, r=thomcc
  commit[5] 2023-06-26: Auto merge of #113049 - matthiaskrgr:rollup-41wo5w8, r=matthiaskrgr
  commit[6] 2023-06-26: Auto merge of #112887 - WaffleLapkin:become_unuwuable_in_hir, r=compiler-errors,Nilstrieb
  commit[7] 2023-06-26: Auto merge of #113057 - TaKO8Ki:rollup-071lc9g, r=TaKO8Ki
ERROR: no CI builds available between 8084f397c6710e4748994dd1488b1f399283d469 and 36fb58e433c782e27dd41034284e157cf86d587f within last 167 days

I can see by eye that the hot code that gets inlined into FlatMap::next is indeed different, though to my untrained eye it just looks like blocks of instructions have been moved around the function. There are 4 rollups which is really not encouraging, but I'm going to try locally reverting #111850 to see if that moves us back to the faster codegen. But considering my bisection points at code that's 6 months old it's possible that the revert will be too challenging for me.

dantheman3333 · 2023-12-26T22:18:00Z

Hey, very much appreciate you doing another deep dive!

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

I tried to balance creating enum variants with the log messages, as I was trying to avoid an explosion of variants just to log specific messages (some folks appreciate the error messages as they were working on their own bag writing tool, so I wanted to keep the messages). Do you think for high performance there's no way around this, and I just need to continue making more specific variants for each message that I want to keep and eventually print out with fmt::Display?

Thanks for the perf stat cmd, it's much simpler than the original bisection script I used. Let me know how the revert goes 🤞

saethlin · 2023-12-27T04:04:35Z

I think errors should nearly always be rendered late because that moves the code size to error reporting and not error detection, which often has to be inside a hot code path. The goal is to minimize the impact on the happy path.

I haven't been able to revert, but by checking out the commit directly before that PR I linked and running the benchmarks on that and the merge commit for that PR, I can see that it is the cause of the regression. Which is interesting, considering it was supposed to make things faster. I suppose at this point it would make sense to ask @the8472; do you have any idea why #111850 would have made something slower?

I suspect the answer is no, and in that case the program here needs to be reduced to a minimal program that exhibits slower performance on 1.71 than on 1.70 by exercising the StepBy specialization. Then we could decide if the situation is resolvable in rustc or in LLVM.

the8472 · 2023-12-27T04:33:32Z

do you have any idea why #111850 would have made something slower?

No.

The PR does several things to StepBy

changes the default next() implementation, which also simplifies the call tree
specializes for unsigned integer ranges. this deepens the call-tree in all cases.
implements TrustedLen for integer ranges, this affects other specializations

Other than how it might affect inlining they should all be improvements.

saethlin · 2023-12-27T04:59:50Z

At least that ticks off a possibility, thanks. Most likely LLVM is just tripping over its own feet here. I'll try reducing this at some point. I have an idea of how to start, but it'll be a slow process to confirm that a reduction still reproduces the important behavior.

the8472 · 2023-12-27T05:27:04Z

Looking through the frost codebase I'm only seeing two uses of step_by. Both have this pattern

slice
  .windows(N)
  .step_by(N)
  .flat_map(... -> Result<_, _>)
  .collect::<Vec<_>>()

The range specialization is not applicable here. And flat-mapping Result makes TrustedLen not applicable. So it's likely the changed default impl next().
Previously it called the underlying next() on the first iteration and then nth(N) on each subsequent one. Now it calls always nth() with a computed value. And I only measured that specific change on loops that had a fixed iteration count.

It could be that the computed step size + less predictable loops make things more difficult for LLVM.

Replacing

rust/library/core/src/iter/adapters/step_by.rs

Lines 204 to 208 in 625c2c4

    
           default fn spec_next(&mut self) -> Option<I::Item> { 
        
               let step_size = if self.first_take { 0 } else { self.step }; 
        
               self.first_take = false; 
        
               self.iter.nth(step_size) 
        
           }

with the old impl

        if self.first_take {
            self.first_take = false;
            self.iter.next()
        } else {
            self.iter.nth(self.step)
        }

might be worth a try.

If that's not it then it's inlining.

the8472 · 2023-12-27T12:53:23Z

@dantheman3333, btw, you should be able to replace

slice
  .windows(N)
  .step_by(N)

with

slice
  .chunks_exact(N)

dantheman3333 · 2023-12-27T16:06:27Z

--- a/frost/src/lib.rs
+++ b/frost/src/lib.rs
@@ -844,8 +844,7 @@ fn parse_chunk_info<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;
 
     let chunk_info_data: Vec<ChunkInfoData> = data
-        .windows(8)
-        .step_by(8)
+        .chunks_exact(8)
         .flat_map(ChunkInfoData::from)
         .collect();
 
@@ -866,8 +865,7 @@ fn parse_index<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;
 
     let index_data: Vec<IndexData> = data
-        .windows(12)
-        .step_by(12)
+        .chunks_exact(12)
         .flat_map(|buf| IndexData::from(buf, chunk_header_pos, index_data_header.connection_id))
         .collect();

$ hyperfine -w 5 "/tmp/frost_174 info frost/tests/fixtures/test_large.bag" "/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag"
Benchmark 1: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     902.8 ms ±   7.5 ms    [User: 654.2 ms, System: 248.5 ms]
  Range (min … max):   893.5 ms … 908.9 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: /tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     426.6 ms ±   0.4 ms    [User: 180.3 ms, System: 246.2 ms]
  Range (min … max):   426.0 ms … 427.1 ms    10 runs
 
Summary
  '/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag' ran
    2.12 ± 0.02 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

wow, just this change produced much performant code, thank you @the8472 !

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in Rust 1.71 #115917

Performance regression in Rust 1.71 #115917

dantheman3333 commented Sep 17, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

thomcc commented Sep 18, 2023

dantheman3333 commented Sep 18, 2023

saethlin commented Sep 18, 2023

thomcc commented Sep 18, 2023

dantheman3333 commented Dec 23, 2023

saethlin commented Dec 26, 2023 •

edited

Loading

dantheman3333 commented Dec 26, 2023

saethlin commented Dec 27, 2023

the8472 commented Dec 27, 2023 •

edited

Loading

saethlin commented Dec 27, 2023

the8472 commented Dec 27, 2023 •

edited

Loading

the8472 commented Dec 27, 2023

dantheman3333 commented Dec 27, 2023

Performance regression in Rust 1.71 #115917

Performance regression in Rust 1.71 #115917

Comments

dantheman3333 commented Sep 17, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

saethlin commented Sep 18, 2023

thomcc commented Sep 18, 2023

dantheman3333 commented Sep 18, 2023

saethlin commented Sep 18, 2023

thomcc commented Sep 18, 2023

dantheman3333 commented Dec 23, 2023

saethlin commented Dec 26, 2023 • edited Loading

dantheman3333 commented Dec 26, 2023

saethlin commented Dec 27, 2023

the8472 commented Dec 27, 2023 • edited Loading

saethlin commented Dec 27, 2023

the8472 commented Dec 27, 2023 • edited Loading

the8472 commented Dec 27, 2023

dantheman3333 commented Dec 27, 2023

saethlin commented Dec 26, 2023 •

edited

Loading

the8472 commented Dec 27, 2023 •

edited

Loading

the8472 commented Dec 27, 2023 •

edited

Loading