-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in Rust 1.71 #115917
Comments
Those perf reports are about how fast the compiler runs, which often has little to do with code generation quality. So perf reports and perf trigage are not going to be relevant here. There are now some runtime benchmarks, but very few so far. I'm looking into your benchmark. |
I think the cause of the regression is inlining of calls to This issue surfaces here because that patch made the MIR inliner slightly more aggressive; not because actually looking at the size of locals was a useful heuristic. Perhaps this indicates it is appropriate to put |
Demo of the problematic MIR inlining choice: https://godbolt.org/z/5vseT5oqb |
I'm tagging @rust-lang/wg-mir-opt: Does this indicate a need for a new MIR inlining heuristic? |
I think putting it on |
@saethlin thank you! |
Adding |
Because it's intentionally outlined for that purpose. My concern with making Error::new inline(never) is it could be monomorphized several times. In many cases, we'd want those to be inlined. |
hey @saethlin, just got around to trying replacing all of the expensive errors in the hot loop, but I am still seeing significant differences. With
Do you have any recommendations on how to proceed from here? |
Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to But anyway, I locally deleted them and re-bisected with this RUSTFLAGS="-Zmir-opt-level=0" cargo b --release --bin frost -Zbuild-std --target=x86_64-unknown-linux-gnu
perf stat -r10 $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/frost info frost/tests/fixtures/test_large.bag 2>&1 | grep "6..\... msec task-clock" And I found:
I can see by eye that the hot code that gets inlined into |
Hey, very much appreciate you doing another deep dive!
I tried to balance creating enum variants with the log messages, as I was trying to avoid an explosion of variants just to log specific messages (some folks appreciate the error messages as they were working on their own bag writing tool, so I wanted to keep the messages). Do you think for high performance there's no way around this, and I just need to continue making more specific variants for each message that I want to keep and eventually print out with fmt::Display? Thanks for the perf stat cmd, it's much simpler than the original bisection script I used. Let me know how the revert goes 🤞 |
I think errors should nearly always be rendered late because that moves the code size to error reporting and not error detection, which often has to be inside a hot code path. The goal is to minimize the impact on the happy path. I haven't been able to revert, but by checking out the commit directly before that PR I linked and running the benchmarks on that and the merge commit for that PR, I can see that it is the cause of the regression. Which is interesting, considering it was supposed to make things faster. I suppose at this point it would make sense to ask @the8472; do you have any idea why #111850 would have made something slower? I suspect the answer is no, and in that case the program here needs to be reduced to a minimal program that exhibits slower performance on 1.71 than on 1.70 by exercising the |
No. The PR does several things to StepBy
Other than how it might affect inlining they should all be improvements. |
At least that ticks off a possibility, thanks. Most likely LLVM is just tripping over its own feet here. I'll try reducing this at some point. I have an idea of how to start, but it'll be a slow process to confirm that a reduction still reproduces the important behavior. |
Looking through the frost codebase I'm only seeing two uses of slice
.windows(N)
.step_by(N)
.flat_map(... -> Result<_, _>)
.collect::<Vec<_>>() The range specialization is not applicable here. And flat-mapping It could be that the computed step size + less predictable loops make things more difficult for LLVM. Replacing rust/library/core/src/iter/adapters/step_by.rs Lines 204 to 208 in 625c2c4
with the old impl if self.first_take {
self.first_take = false;
self.iter.next()
} else {
self.iter.nth(self.step)
} might be worth a try. If that's not it then it's inlining. |
@dantheman3333, btw, you should be able to replace slice
.windows(N)
.step_by(N) with slice
.chunks_exact(N) |
--- a/frost/src/lib.rs
+++ b/frost/src/lib.rs
@@ -844,8 +844,7 @@ fn parse_chunk_info<R: Read + Seek>(
let data = get_lengthed_bytes(reader)?;
let chunk_info_data: Vec<ChunkInfoData> = data
- .windows(8)
- .step_by(8)
+ .chunks_exact(8)
.flat_map(ChunkInfoData::from)
.collect();
@@ -866,8 +865,7 @@ fn parse_index<R: Read + Seek>(
let data = get_lengthed_bytes(reader)?;
let index_data: Vec<IndexData> = data
- .windows(12)
- .step_by(12)
+ .chunks_exact(12)
.flat_map(|buf| IndexData::from(buf, chunk_header_pos, index_data_header.connection_id))
.collect(); $ hyperfine -w 5 "/tmp/frost_174 info frost/tests/fixtures/test_large.bag" "/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag"
Benchmark 1: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 902.8 ms ± 7.5 ms [User: 654.2 ms, System: 248.5 ms]
Range (min … max): 893.5 ms … 908.9 ms 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: /tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 426.6 ms ± 0.4 ms [User: 180.3 ms, System: 246.2 ms]
Range (min … max): 426.0 ms … 427.1 ms 10 runs
Summary
'/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag' ran
2.12 ± 0.02 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag' wow, just this change produced much performant code, thank you @the8472 ! |
Hi, on linux x86, I'm seeing a 120+ms regression in wall time with 1.71, which is small but is noticeable when running the cli application as it is already quite fast.
The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.
The project is pretty messy but I have steps on how to benchmark here dantheman3333/frost#33.
A differential flamegraph shows flatmap's next has taken a hit. I am unsure if this is an aggregate view of all of the nexts or if this is a specific call site - is it possible to get line numbers from this?
I am unsure of how to proceed from here but I would love to learn if anyone has tips
The text was updated successfully, but these errors were encountered: