Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] revert issue #26494 associated pulls #76986 and #79547 #91719

Closed
wants to merge 0 commits into from
Closed

[Experiment] revert issue #26494 associated pulls #76986 and #79547 #91719

wants to merge 0 commits into from

Conversation

shampoofactory
Copy link
Contributor

@shampoofactory shampoofactory commented Dec 9, 2021

Issue #26494 introduces a P-high auto-vectorization regression as discussed here. This pull reverts those changes with a view to profiling the outcome. As a matter of expediency, I've ignored the 'array-equality.rs' test for now.

If performance degrades we can take this option off the table. Otherwise, failing a timely fix to the underlying issue, a reversion could be discussed with concrete performance data.

If someone could kindly start a perf run.

Thank you.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @oli-obk (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 9, 2021
@oli-obk
Copy link
Contributor

oli-obk commented Dec 9, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 9, 2021
@bors
Copy link
Contributor

bors commented Dec 9, 2021

⌛ Trying commit 9ea4892 with merge 9b34e70f603fd47fefd65888b574e4277e7fde39...

@bors
Copy link
Contributor

bors commented Dec 9, 2021

☀️ Try build successful - checks-actions
Build commit: 9b34e70f603fd47fefd65888b574e4277e7fde39 (9b34e70f603fd47fefd65888b574e4277e7fde39)

@rust-timer
Copy link
Collaborator

Queued 9b34e70f603fd47fefd65888b574e4277e7fde39 with parent 0b42dea, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9b34e70f603fd47fefd65888b574e4277e7fde39): comparison url.

Summary: This change led to very large relevant mixed results 🤷 in compiler performance.

  • Very large improvement in instruction counts (up to -6.9% on full builds of deeply-nested-async)
  • Moderate regression in instruction counts (up to 3.0% on full builds of piston-image)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Dec 9, 2021
@marmeladema
Copy link
Contributor

Wow I have hit the regression at work so I was following closely the associated issue but the results here are pretty good.! Thank you pushing the the analysis forward!

@Urgau
Copy link
Member

Urgau commented Dec 9, 2021

Thanks @shampoofactory for pursuing the experimentation, the results looks much better than what I got in #91507. This might be an acceptable hit for the few crates that were hit.

@oli-obk
Copy link
Contributor

oli-obk commented Dec 10, 2021

r? rust-lang/wg-llvm

@oli-obk
Copy link
Contributor

oli-obk commented Dec 10, 2021

hmm.. highfive does not work with some teams...

r? @cuviper

@rust-highfive rust-highfive assigned cuviper and unassigned oli-obk Dec 10, 2021
@scottmcm
Copy link
Member

@oli-obk I think it looks not at github teams, but at the groups in https://github.com/rust-lang/highfive/blob/master/highfive/configs/rust-lang/rust.json#L2

@cuviper
Copy link
Member

cuviper commented Dec 10, 2021

It's uncommon to speak of reverting an issue -- this is really reverting pull request #79547.

I'm surprised that the new perf changes are so much bigger than before, but maybe it's related to the new pass manager or something else new in LLVM. Cc @rust-lang/wg-llvm in case others have insights.

@shampoofactory
Copy link
Contributor Author

shampoofactory commented Dec 10, 2021

@cuviper

It's uncommon to speak of reverting an issue -- this is really reverting pull request #79547.

Understood. I should have been more careful in my phrasing.

To be specific, this merge reverts pulls #79547 and #76986.

@workingjubilee
Copy link
Member

workingjubilee commented Dec 10, 2021

I'm surprised that the new perf changes are so much bigger than before, but maybe it's related to the new pass manager or something else new in LLVM. Cc @rust-lang/wg-llvm in case others have insights.

It's possible that more improvements in autovectorization have been included, or that a subtle change in pass ordering has improved things, yes. Specifically, in order to autovectorize scalar code, LLVM must

  • discover the opportunity (witness a scalar code pattern that can be translated into a vector code pattern)
  • exploit the opportunity (actually do the translation, picking between multiple possible versions)
  • optimize the result (usual opts, like dead store elimination, still have to happen)

And in order to do this LLVM does various flattenings and renestings of code, amongst other transformations. But it's possible for subtle changes like the ones tried before to be seemingly more optimal in the short run yet make it harder to discover vectorization in scalar code, especially if the primary ways of detecting the vectorization involve looking for scalar values being operated on in a certain way. It may induce one transformation that gets in the way of other transformations that are looking for vectorization.

But this remark doesn't constitute a specific advisory re: code existing in LLVM right now, this is a sort of general accreted understanding from reading... many... papers on improvements in LLVM autovectorization.

@shampoofactory shampoofactory changed the title [Experiment] revert #26494 [Experiment] revert issue #26494 associated pulls #76986 and #79547 Dec 10, 2021
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 14, 2021
@camelid
Copy link
Member

camelid commented Jan 25, 2022

Please cherry pick (e.g., git cherry-pick abcd123) @scottmcm's PR instead of merging it if you can. Thanks!

@camelid
Copy link
Member

camelid commented Jan 25, 2022

Also, what's the status of this PR?

@apiraino
Copy link
Contributor

my understanding is that some author feedback is needed, so I'll tentatively switch the flag.

@rustbot author

@rustbot rustbot added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Feb 17, 2022
@shampoofactory
Copy link
Contributor Author

my understanding is that some author feedback is needed, so I'll tentatively switch the flag.

@rustbot author

Hi. The goal is to restore basic auto-vectorization (issue #85265).

One method is to simply revert the breaking changes, as this commit does.

The other, probably the ideal, is to fix some of the underlying ABI issues as discussed above and here. However, this will take a level of expertise that I do not possess. I was rather hoping someone would have time to tackle this.

There are some question marks around benchmarking. I'll take some time at the weekend to run the Benchmarks Game program suite with hyperfine. If anyone has a better benchmarking solution then I'm all ears.

Assuming the benchmarks show a clear benefit, my personal preference would be to revert the breaking changes. However, as I've previously stated, I'm no expert and my opinions should be taken with that in mind.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 20, 2022

Hello. I am in favor of this landing. The actual thing that is missing from this, as I see it, is a regression test for it. I would like to see an assembly or codegen test included with this PR to verify that the source patterns

pub fn case_1(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}

pub fn case_2(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    let mut c = [0.0; 4];
    for i in 0..4 {
        c[i] = a[i] + b[i];
    }
    c
}

autovectorize into an assembly pattern that looks like

example::case_1:
        mov     rax, rdi
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rdx]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdi], xmm1
        ret

or something closely equivalent for the x86-64 target. These must emit addps, no addss, and any mov-like should be movups (however, as noted, ideally they would include no mov at all). A codegen test for appropriate LLVMIR that verifies it winds up adding <4 x f32> vectors is also fine.

These are such obvious cases for autovectorization that they should essentially Always Work, there are no real "heuristics" required, they simply statically match the patterns enabled by SSE2 registers. So if we are missing them, then any other improvement we can gain is fairly unimportant. Since we know they should always work essentially unconditionally, we don't want another change in this area to regress them, and we need to be testing to prevent that.

Instructions for how to write such a test are present here, but I can help if you ping me on Zulip or Discord.

@Urgau
Copy link
Member

Urgau commented Feb 20, 2022

@workingjubilee You might want to check my PR #93564 which is a most targeted and less invasive way to fix the issue.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 23, 2022

LLVM 14 recently hit nightly, and I want to see if this changes now with the latest release.
@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 23, 2022
@camelid
Copy link
Member

camelid commented Feb 23, 2022

Bors didn't see your message: @bors try

@bors
Copy link
Contributor

bors commented Feb 23, 2022

⌛ Trying commit 84672f71fc3939b68a04e62ddadb61f8ca2985eb with merge c2278bfabb9c64fa778e94c9607f7321bb9b29fd...

@bors
Copy link
Contributor

bors commented Feb 23, 2022

☀️ Try build successful - checks-actions
Build commit: c2278bfabb9c64fa778e94c9607f7321bb9b29fd (c2278bfabb9c64fa778e94c9607f7321bb9b29fd)

@rust-timer
Copy link
Collaborator

Queued c2278bfabb9c64fa778e94c9607f7321bb9b29fd with parent c651ba8, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c2278bfabb9c64fa778e94c9607f7321bb9b29fd): comparison url.

Summary: This benchmark run shows 108 relevant improvements 🎉 but 36 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 0.5%
  • Average relevant improvement: -1.4%
  • Largest improvement in instruction counts: -7.3% on full builds of deeply-nested-async check
  • Largest regression in instruction counts: 1.2% on incr-full builds of deeply-nested opt

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 24, 2022
@workingjubilee
Copy link
Member

workingjubilee commented Feb 26, 2022

That's even nicer than the last run. The highs aren't actually that much higher, but the dips are shallower and it's on average much better. It might be within random variance, but I suspect this actually neatly explains why what was once an optimization is now a regression: the Rust ABI was diverted down a path that LLVM did not focus much on optimizing in the next 5 versions, so as LLVM 14 is somewhat better at this than LLVM 13, likely LLVM 13 was better than LLVM 12, etc., all the way back to LLVM 9, when it was actually a win to take the other path.

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Feb 26, 2022
@scottmcm
Copy link
Member

Those perf results look great! Especially when they're also exposing autovectorization opportunities.

Looks like this is still missing a test for it, so here's a codegen test you can add:

// compile-flags: -C opt-level=3
// only-x86_64
#![crate_type = "lib"]

// CHECK-LABEL: @auto_vectorize_direct
#[no_mangle]
pub fn auto_vectorize_direct(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
// CHECK: load <4 x float>
// CHECK: load <4 x float>
// CHECK: fadd <4 x float>
// CHECK: store <4 x float>
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}

// CHECK-LABEL: @auto_vectorize_loop
pub fn auto_vectorize_loop(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
// CHECK: load <4 x float>
// CHECK: load <4 x float>
// CHECK: fadd <4 x float>
// CHECK: store <4 x float>
    let mut c = [0.0; 4];
    for i in 0..4 {
        c[i] = a[i] + b[i];
    }
    c
}

(I don't think it needs to be an assembly test; seeing that LLVM is using vector ops is good enough for me.)

@shampoofactory
Copy link
Contributor Author

There are some question marks around benchmarking. I'll take some time at the weekend to run the Benchmarks Game program suite with hyperfine. If anyone has a better benchmarking solution then I'm all ears.

Sorry for the delay. What I assumed would be a simple task turned out not to be. The Benchmarks Game repo, as far as I could fathom, does not have a simple pre-configured script to build and bench the Rust program contributions. So I wrote and documented a small collection of Python scripts that do. I used a simple Welch's t-test modified from here.

The usual caveats apply when interpreting benchmark results. I'm not endorsing the validity of the included benchmarks programs. I've not disassembled any binaries and I'm not sure to what extent any auto-vectorization opportunities may arise.

I've benchmarked the latest stable Rust release against its patched version. The results are below. Most tests show no discernable difference and are rejected (p threshold 0.05). The remainder are a mixed bag.

The benchmarking scripts are located here along with documentation. It should be trivial to run them yourself. If there is a problem then kindly open a ticket.

Finally. I'm not a benchmarking or stats expert and any feedback would be appreciated.

Median times are in seconds, with lower values being faster.
Relative times are calculated from the fastest median time, with 1.000 being the fastest and higher values being slower.
p threshold = 0.05
T = true (significant)
F = false (not significant)

1.58.1, unpatched Rust release
1.58.1.revert, patched Rust release
                                1.58.1               1.58.1.revert    
executable                      relative   median    relative   median    significant
-------------------------------------------------------------------------------------
empty.rs                        1.042      0.001s    1.000      0.001s    T    0.000p
binarytrees.rust                1.005      2.268s    1.000      2.257s    F    0.315p
binarytrees.rust-2.rust         1.014      1.221s    1.000      1.204s    F    0.351p
binarytrees.rust-3.rust         1.000      1.566s    1.018      1.595s    F    0.105p
binarytrees.rust-4.rust         1.007      1.581s    1.000      1.570s    F    0.440p
binarytrees.rust-5.rust         1.000      1.177s    1.104      1.298s    T    0.000p
fannkuchredux.rust-2.rust       1.011     20.180s    1.000     19.953s    F    0.456p
fannkuchredux.rust-3.rust       1.000      6.819s    1.008      6.877s    T    0.000p
fannkuchredux.rust-4.rust       1.000      6.822s    1.006      6.861s    F    0.605p
fannkuchredux.rust-5.rust       1.000      6.970s    1.065      7.421s    T    0.000p
fannkuchredux.rust-6.rust       1.016      3.284s    1.000      3.232s    T    0.000p
fasta.rust                      1.000      4.131s    1.002      4.139s    F    0.111p
fasta.rust-2.rust               1.003      0.913s    1.000      0.910s    F    0.232p
fasta.rust-3.rust               1.000      1.825s    1.027      1.874s    T    0.000p
fasta.rust-4.rust               1.009      3.632s    1.000      3.599s    T    0.000p
fasta.rust-5.rust               1.005      0.911s    1.000      0.907s    F    0.307p
fasta.rust-6.rust               1.000      3.636s    1.002      3.644s    T    0.009p
fasta.rust-7.rust               1.000      0.734s    1.000      0.734s    F    0.628p
knucleotide.rust                1.005     19.766s    1.000     19.668s    F    0.360p
knucleotide.rust-2.rust         1.000      6.383s    1.008      6.433s    F    0.890p
knucleotide.rust-4.rust         1.008      6.948s    1.000      6.891s    F    0.068p
knucleotide.rust-5.rust         1.007      3.761s    1.000      3.733s    T    0.000p
knucleotide.rust-6.rust         1.002      2.651s    1.000      2.645s    F    0.447p
knucleotide.rust-7.rust         1.000      4.100s    1.022      4.190s    F    0.186p
knucleotide.rust-8.rust         1.000      4.637s    1.033      4.787s    F    0.296p
knucleotide.rust-9.rust         1.000      6.404s    1.004      6.431s    F    0.326p
mandelbrot.rust                 1.000      2.743s    1.013      2.777s    T    0.000p
mandelbrot.rust-3.rust          1.000      1.254s    1.005      1.261s    T    0.000p
mandelbrot.rust-5.rust          1.000      1.085s    1.000      1.085s    F    0.415p
mandelbrot.rust-6.rust          1.000      0.998s    1.000      0.998s    F    0.407p
mandelbrot.rust-7.rust          1.000      0.829s    1.003      0.832s    F    0.404p
mandelbrot.rust-8.rust          1.000      0.832s    1.001      0.833s    F    0.532p
nbody.rust                      1.000      7.093s    1.008      7.151s    F    0.081p
nbody.rust-2.rust               1.002      7.704s    1.000      7.686s    F    0.126p
nbody.rust-3.rust               1.000      7.291s    1.001      7.295s    F    0.060p
nbody.rust-4.rust               1.002      7.634s    1.000      7.620s    F    0.119p
nbody.rust-5.rust               1.000      8.076s    1.005      8.113s    F    0.959p
nbody.rust-6.rust               1.000      4.625s    1.018      4.710s    F    0.956p
nbody.rust-7.rust               1.005      3.169s    1.000      3.154s    F    0.273p
nbody.rust-8.rust               1.000      7.922s    1.003      7.947s    F    0.070p
nbody.rust-9.rust               1.025      2.841s    1.000      2.772s    T    0.000p
pidigits.rust                   1.000      0.741s    1.015      0.752s    T    0.000p
pidigits.rust-2.rust            1.002      0.748s    1.000      0.747s    F    0.694p
pidigits.rust-3.rust            1.009      0.750s    1.000      0.744s    T    0.026p
pidigits.rust-4.rust            1.000      0.717s    1.000      0.717s    F    0.913p
pidigits.rust-6.rust            1.000      0.655s    1.001      0.656s    F    0.273p
regexredux.rust                 1.000      1.314s    1.000      1.314s    F    0.583p
regexredux.rust-2.rust          1.000      1.338s    1.015      1.358s    T    0.000p
regexredux.rust-3.rust          1.000      1.289s    1.017      1.310s    T    0.011p
regexredux.rust-4.rust          1.001      1.309s    1.000      1.307s    F    0.330p
regexredux.rust-5.rust          1.039      1.311s    1.000      1.262s    T    0.000p
regexredux.rust-6.rust          1.000      1.201s    1.011      1.214s    T    0.000p
regexredux.rust-7.rust          1.001      0.755s    1.000      0.754s    F    0.724p
revcomp.rust                    1.001      0.424s    1.000      0.423s    F    0.955p
revcomp.rust-2.rust             1.000      0.696s    1.001      0.697s    F    0.785p
revcomp.rust-3.rust             1.006      0.723s    1.000      0.719s    T    0.037p
spectralnorm.rust               1.000      1.025s    1.000      1.025s    F    0.147p
spectralnorm.rust-2.rust        1.000      1.025s    1.000      1.025s    F    0.596p
spectralnorm.rust-3.rust        1.000      1.027s    1.008      1.035s    T    0.006p
spectralnorm.rust-4.rust        1.002      1.028s    1.000      1.026s    F    0.533p
spectralnorm.rust-5.rust        1.000      1.021s    1.003      1.024s    F    0.084p
spectralnorm.rust-6.rust        1.000      1.026s    1.000      1.026s    F    0.852p
spectralnorm.rust-7.rust        1.000      1.025s    1.000      1.025s    F    0.498p

Summary:
  test count: 63
  discarded (insignificant) tests: 43
  1.58.1 fastest: 12 times
  1.58.1.revert fastest: 8 times
Machine:
Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Gigabyte Technology Co., Ltd. Z68XP-UD3P
Corsair XMS 8GB DDR3
Crucial MX500 SATA

OS:
Ubuntu 20.04.4 LTS (Focal Fossa)
5.11.0-25-generic

@bjorn3
Copy link
Member

bjorn3 commented Mar 1, 2022

Quickly looking through the results, a lot of the binaries get 30kb bigger with this PR. Cargo however gets 50kb (release mode) or 10kb (debug mode) smaller.

@workingjubilee
Copy link
Member

workingjubilee commented Mar 1, 2022

I did not speak at first because I thought the result might still be interesting, but for the "benchmarks game" repo, unfortunately, if the primary effect is on autovectorization, then we can't expect any appreciable benefit from this. That is because all the Rust code written for the benchmarks game has been carefully hand-tuned already to already generate specific assembly instructions using explicit vectorization intrinsics. It is enough to see that it does not greatly penalize most. You would need all of those benchmarks to be written in the "naive" fashion, and then to bench them against both themselves and these more hand-tuned versions.

@workingjubilee
Copy link
Member

Also I should also note that actually doing so isn't actually required to accept this PR, IMO, only the missing codegen test is. And I agree with @scottmcm that the codegen test for LLVMIR vector-of-floats usage is enough, actually.

@shampoofactory Also if you need any help with the rebase issue I am happy to fix that up, or you can ping me on Zulip if you have any other questions.

@shampoofactory
Copy link
Contributor Author

@workingjubilee Hi. About the rebase, if you could fix that up it would be great. Left to my own devices, I'm more likely to make things worse.

@workingjubilee
Copy link
Member

workingjubilee commented Mar 3, 2022

...oops, lol. That should not have happened.

@shampoofactory shampoofactory mentioned this pull request Mar 3, 2022
@workingjubilee
Copy link
Member

If anyone wants to follow the exciting saga from here, check out #94570.

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 4, 2022
…gjubilee

Reopen 91719

Reopened rust-lang#91719, which was closed inadvertently due to technical difficulties.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.