-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experiment] revert issue #26494 associated pulls #76986 and #79547 #91719
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @oli-obk (or someone else) soon. Please see the contribution instructions for more information. |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 9ea4892 with merge 9b34e70f603fd47fefd65888b574e4277e7fde39... |
☀️ Try build successful - checks-actions |
Queued 9b34e70f603fd47fefd65888b574e4277e7fde39 with parent 0b42dea, future comparison URL. |
Finished benchmarking commit (9b34e70f603fd47fefd65888b574e4277e7fde39): comparison url. Summary: This change led to very large relevant mixed results 🤷 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
Wow I have hit the regression at work so I was following closely the associated issue but the results here are pretty good.! Thank you pushing the the analysis forward! |
Thanks @shampoofactory for pursuing the experimentation, the results looks much better than what I got in #91507. This might be an acceptable hit for the few crates that were hit. |
r? rust-lang/wg-llvm |
hmm.. highfive does not work with some teams... r? @cuviper |
@oli-obk I think it looks not at github teams, but at the groups in https://github.com/rust-lang/highfive/blob/master/highfive/configs/rust-lang/rust.json#L2 |
It's possible that more improvements in autovectorization have been included, or that a subtle change in pass ordering has improved things, yes. Specifically, in order to autovectorize scalar code, LLVM must
And in order to do this LLVM does various flattenings and renestings of code, amongst other transformations. But it's possible for subtle changes like the ones tried before to be seemingly more optimal in the short run yet make it harder to discover vectorization in scalar code, especially if the primary ways of detecting the vectorization involve looking for scalar values being operated on in a certain way. It may induce one transformation that gets in the way of other transformations that are looking for vectorization. But this remark doesn't constitute a specific advisory re: code existing in LLVM right now, this is a sort of general accreted understanding from reading... many... papers on improvements in LLVM autovectorization. |
Please cherry pick (e.g., |
Also, what's the status of this PR? |
my understanding is that some author feedback is needed, so I'll tentatively switch the flag. @rustbot author |
Hi. The goal is to restore basic auto-vectorization (issue #85265). One method is to simply revert the breaking changes, as this commit does. The other, probably the ideal, is to fix some of the underlying ABI issues as discussed above and here. However, this will take a level of expertise that I do not possess. I was rather hoping someone would have time to tackle this. There are some question marks around benchmarking. I'll take some time at the weekend to run the Benchmarks Game program suite with hyperfine. If anyone has a better benchmarking solution then I'm all ears. Assuming the benchmarks show a clear benefit, my personal preference would be to revert the breaking changes. However, as I've previously stated, I'm no expert and my opinions should be taken with that in mind. |
Hello. I am in favor of this landing. The actual thing that is missing from this, as I see it, is a regression test for it. I would like to see an assembly or codegen test included with this PR to verify that the source patterns pub fn case_1(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
[
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3],
]
}
pub fn case_2(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
let mut c = [0.0; 4];
for i in 0..4 {
c[i] = a[i] + b[i];
}
c
} autovectorize into an assembly pattern that looks like example::case_1:
mov rax, rdi
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdx]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
ret or something closely equivalent for the x86-64 target. These must emit These are such obvious cases for autovectorization that they should essentially Always Work, there are no real "heuristics" required, they simply statically match the patterns enabled by SSE2 registers. So if we are missing them, then any other improvement we can gain is fairly unimportant. Since we know they should always work essentially unconditionally, we don't want another change in this area to regress them, and we need to be testing to prevent that. Instructions for how to write such a test are present here, but I can help if you ping me on Zulip or Discord. |
@workingjubilee You might want to check my PR #93564 which is a most targeted and less invasive way to fix the issue. |
LLVM 14 recently hit nightly, and I want to see if this changes now with the latest release. |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
Bors didn't see your message: @bors try |
⌛ Trying commit 84672f71fc3939b68a04e62ddadb61f8ca2985eb with merge c2278bfabb9c64fa778e94c9607f7321bb9b29fd... |
☀️ Try build successful - checks-actions |
Queued c2278bfabb9c64fa778e94c9607f7321bb9b29fd with parent c651ba8, future comparison URL. |
Finished benchmarking commit (c2278bfabb9c64fa778e94c9607f7321bb9b29fd): comparison url. Summary: This benchmark run shows 108 relevant improvements 🎉 but 36 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
That's even nicer than the last run. The highs aren't actually that much higher, but the dips are shallower and it's on average much better. It might be within random variance, but I suspect this actually neatly explains why what was once an optimization is now a regression: the Rust ABI was diverted down a path that LLVM did not focus much on optimizing in the next 5 versions, so as LLVM 14 is somewhat better at this than LLVM 13, likely LLVM 13 was better than LLVM 12, etc., all the way back to LLVM 9, when it was actually a win to take the other path. @rustbot label: +perf-regression-triaged |
Those perf results look great! Especially when they're also exposing autovectorization opportunities. Looks like this is still missing a test for it, so here's a codegen test you can add: // compile-flags: -C opt-level=3
// only-x86_64
#![crate_type = "lib"]
// CHECK-LABEL: @auto_vectorize_direct
#[no_mangle]
pub fn auto_vectorize_direct(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
// CHECK: load <4 x float>
// CHECK: load <4 x float>
// CHECK: fadd <4 x float>
// CHECK: store <4 x float>
[
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3],
]
}
// CHECK-LABEL: @auto_vectorize_loop
pub fn auto_vectorize_loop(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
// CHECK: load <4 x float>
// CHECK: load <4 x float>
// CHECK: fadd <4 x float>
// CHECK: store <4 x float>
let mut c = [0.0; 4];
for i in 0..4 {
c[i] = a[i] + b[i];
}
c
} (I don't think it needs to be an assembly test; seeing that LLVM is using vector ops is good enough for me.) |
Sorry for the delay. What I assumed would be a simple task turned out not to be. The Benchmarks Game repo, as far as I could fathom, does not have a simple pre-configured script to build and bench the Rust program contributions. So I wrote and documented a small collection of Python scripts that do. I used a simple Welch's t-test modified from here. The usual caveats apply when interpreting benchmark results. I'm not endorsing the validity of the included benchmarks programs. I've not disassembled any binaries and I'm not sure to what extent any auto-vectorization opportunities may arise. I've benchmarked the latest stable Rust release against its patched version. The results are below. Most tests show no discernable difference and are rejected (p threshold 0.05). The remainder are a mixed bag. The benchmarking scripts are located here along with documentation. It should be trivial to run them yourself. If there is a problem then kindly open a ticket. Finally. I'm not a benchmarking or stats expert and any feedback would be appreciated.
|
Quickly looking through the results, a lot of the binaries get 30kb bigger with this PR. Cargo however gets 50kb (release mode) or 10kb (debug mode) smaller. |
I did not speak at first because I thought the result might still be interesting, but for the "benchmarks game" repo, unfortunately, if the primary effect is on autovectorization, then we can't expect any appreciable benefit from this. That is because all the Rust code written for the benchmarks game has been carefully hand-tuned already to already generate specific assembly instructions using explicit vectorization intrinsics. It is enough to see that it does not greatly penalize most. You would need all of those benchmarks to be written in the "naive" fashion, and then to bench them against both themselves and these more hand-tuned versions. |
Also I should also note that actually doing so isn't actually required to accept this PR, IMO, only the missing codegen test is. And I agree with @scottmcm that the codegen test for LLVMIR vector-of-floats usage is enough, actually. @shampoofactory Also if you need any help with the rebase issue I am happy to fix that up, or you can ping me on Zulip if you have any other questions. |
@workingjubilee Hi. About the rebase, if you could fix that up it would be great. Left to my own devices, I'm more likely to make things worse. |
...oops, lol. That should not have happened. |
If anyone wants to follow the exciting saga from here, check out #94570. |
…gjubilee Reopen 91719 Reopened rust-lang#91719, which was closed inadvertently due to technical difficulties.
Issue #26494 introduces a
P-high
auto-vectorization regression as discussed here. This pull reverts those changes with a view to profiling the outcome. As a matter of expediency, I've ignored the 'array-equality.rs' test for now.If performance degrades we can take this option off the table. Otherwise, failing a timely fix to the underlying issue, a reversion could be discussed with concrete performance data.
If someone could kindly start a perf run.
Thank you.