-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't aggregate homogeneous floats in the Rust ABI #93564
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
// We don't want to aggregate float as Integer because this will | ||
// hurt (#93490) the codegen and performance so instead use a | ||
// vector (this is similar to what clang is doing in C++). | ||
RegKind::Float => arg.cast_to(Reg { kind: RegKind::Vector, size }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsound. Different -Ctarget-features
of different crates can cause ABI incompatibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the comment below, that was also what I taught but trying my PR with a code that use different target features leads me to the correct output:
#![feature(bench_black_box)]
#![allow(non_camel_case_types)]
#[derive(Debug)]
struct f32x4(f32, f32, f32, f32);
#[target_feature(enable = "avx")]
unsafe fn foo() -> f32x4 { std::hint::black_box(f32x4(0., 1., 2., std::hint::black_box(3.))) }
#[target_feature(enable = "sse3")]
unsafe fn bar(arg: f32x4) {
println!("{:?} == f32x4(0.0, 1.0, 2.0, 3.0)", std::hint::black_box(arg));
}
fn main() { unsafe { bar(std::hint::black_box(foo())); } }
$ rustc +stage1 -C opt-level=0 a.rs && ./a
f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0)
$ rustc +stage1 -C opt-level=3 a.rs && ./a
f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just a 128bit vector which fits into a single XMM register. And since that is the lower half of the YMM register in AVX code, that probably just happens to work by accident on x86_64.
I guess using an x86 target with code that has the SSE target feature enabled on only foo or bar might fail (careful, IIRC the i686 arch enables SSE2bby default). Or maybe some of the other supported archs can trigger a bug here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i586 targets disable SSE by default I believe. You can also manually disable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotdash Yes and no, above this code there is a check for the max size by value who is ptr-size * 2, so the created vector cannot be (at least on x86_64) upper than 128bits which as you said fits on the lower half of the YMM register. This is also consistent with clang++ who (accordingly to my test) never pass vectors more than ptr-size * 2 by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks like it doesn't error. It however does change the abi from returning the value in xmm versus on the stack. This means that this makes disabling SSE support UB where previously it wasn't, which is still a breaking change. See https://llvm.godbolt.org/z/YThqqs9Ts (left: SSE disabled, right: SSE enabled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noooo... Well this is bad, really bad. @bjorn3 Any recommendation(s) on how to proceed from now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. For vector types we "fix" this by forcing them to always be passed on the stack for the rust abi and hoping llvm's mem2reg pass will eliminate the stack usage after inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjorn3 I've updated the PR to not modify the Reg representation of homogeneous floats. I think this now completely resolve your concern about ABI compatibility as the generated IR doesn't use a vector type and is the same as rust <= 1.47.
For information as before it doesn't compile with -Ctarget-feature=-sse
but does compile with -C target-feature=-sse,+soft-float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you satisfied with the changes, @bjorn3?
Passing structs of four f32 values as a single vector isn't always what you want. For general purpose structs where you don't perform SIMD-like operations, the code might end up worse this way. https://godbolt.org/z/Ecoz3ThWz shows a (made up) example where the single 4-element vector variant produces worse code than the one that uses two 2-element vectors. There are trade-offs but be made here. I suppose that in the long run, the Rust ABI may have to become arch specific as well. There's a lot to consider. |
9150cc5
to
bb38320
Compare
I have updated this PR to now only take in account homogeneous floats in account and only in x86_64 (for now) use a vector to represent them. On other archs their current representation is not modified instead of previously by aggregated as a unique Integer. I think this should resolve the concern raised by @bjorn3 and @dotdash. |
I agree this is not ideal but this is definitely better than the current situation.
For now this PR does the Vector cast only in x86_64 per your concern. It may be extended in the future. |
bb38320
to
5a71f32
Compare
Over the (new) concern of @bjorn3, I've once again updated the logic in this PR to not change the Reg representation at all. |
If this is intended to fix a regression in the assembly emitted then this requires assembly tests before it can land. |
5a71f32
to
5a9f3d9
Compare
5a9f3d9
to
7b69d21
Compare
This is basically is ripoff of src/test/ui/simd/target-feature-mixup.rs but for floats and without #[repr(simd)]
This is required because depending on the version of llvm FileCheck it could fail.
Then I think I have the requisite investiture to do this... |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 74a429e with merge d557e0f29befbe3f15344eb15c19364a660ce35b... |
☀️ Try build successful - checks-actions |
Queued d557e0f29befbe3f15344eb15c19364a660ce35b with parent 45e2c28, future comparison URL. |
Finished benchmarking commit (d557e0f29befbe3f15344eb15c19364a660ce35b): comparison url. Summary: This benchmark run shows 4 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
The regressions are purely in the Investigating locally Anyway I don't think we should worry about these localized regressions because:
@rustbot label: +perf-regression-triaged |
Note: this benchmark was reduced from real world code triggering #65147 |
Yeah, the benchmark is artificial but the regression would hit real code. If it was a handful of tiny regressions, I wouldn't think anything of it, but it's a very telling regression on a single case, and I can only infer that this test case was added precisely to prevent such a regression. I don't know that this isn't worth the changes, either, I just am not sure I have sufficient proof for posterity, when e.g. someone files a regression report after this gets merged and files a PR to revert this change. We also have the option to land #91719 instead. |
This PR removes the aggregate "optimization" done to small (<= ptr-size * 2)
and non-homogeneoushomogeneous floats arguments as a result of these problems:The logic the replaced with a more inner types aware approach (integers vs floats, homogeneous vs heterogeneous).
Specifically the behavior before the this PR was to aggregate every-type (struct, array) that was pass or returned and was small enough (<= ptr-size * 2) into a single and unique integer, no matter the underline type. This would means for example that
[f32; 3]
(which is a very common representation for a vector in 3 dimensions) would be represented as a uniquei96
in the LLVM-IR that would need to be pack to be returned and than unpack at the caller to be used. #91447 (comment)The new behavior with this PR is now to consider homogeneous:
integer type for aggregating into a single and unique integerRust code:
ASM Before:
ASM After:
Note that this pull-request is a continuation of #93405 that I closed due to the performance regressions.
Fixes #85265
Fixes #91447
Fixes #93490