Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't aggregate homogeneous floats in the Rust ABI #93564

Closed
wants to merge 4 commits into from

Conversation

Urgau
Copy link
Member

@Urgau Urgau commented Feb 1, 2022

This PR removes the aggregate "optimization" done to small (<= ptr-size * 2) and non-homogeneous homogeneous floats arguments as a result of these problems:

The logic the replaced with a more inner types aware approach (integers vs floats, homogeneous vs heterogeneous).

Specifically the behavior before the this PR was to aggregate every-type (struct, array) that was pass or returned and was small enough (<= ptr-size * 2) into a single and unique integer, no matter the underline type. This would means for example that [f32; 3] (which is a very common representation for a vector in 3 dimensions) would be represented as a unique i96 in the LLVM-IR that would need to be pack to be returned and than unpack at the caller to be used. #91447 (comment)

The new behavior with this PR is now to consider homogeneous:

  • integer type for aggregating into a single and unique integer
  • floats: pass them a array if <= ptr-size and by pointer on array if > ptr-size as an optimization

Rust code:

#[no_mangle]
pub fn sum_f32(a: f32, b: f32) -> f32 {
    a + b
}

#[no_mangle]
pub fn sum_f32x2(a: [f32; 2], b: [f32; 2]) -> [f32; 2] {
    [
        a[0] + b[0],
        a[1] + b[1],
    ]
}

#[no_mangle]
pub fn sum_f32x4(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}

ASM Before:

sum_f32:
	addss	xmm0, xmm1
	ret

sum_f32x2:
	.cfi_startproc
	movd	xmm0, edi
	shr	rdi, 32
	movd	xmm1, edi
	movd	xmm2, esi
	addss	xmm2, xmm0
	shr	rsi, 32
	movd	xmm0, esi
	addss	xmm0, xmm1
	movd	eax, xmm2
	movd	ecx, xmm0
	shl	rcx, 32
	or	rax, rcx
	ret

sum_f32x4:
	movd	xmm0, esi
	mov	rax, rsi
	shld	rax, rdi, 32
	shr	rsi, 32
	movq	xmm1, rax
	movq	xmm2, rsi
	punpckldq	xmm2, xmm1
	movd	xmm1, edi
	movd	xmm3, edx
	addps	xmm3, xmm1
	movd	xmm1, ecx
	addss	xmm1, xmm0
	shrd	rdx, rcx, 32
	shr	rcx, 32
	movq	xmm0, rdx
	movq	xmm4, rcx
	punpckldq	xmm4, xmm0
	addps	xmm4, xmm2
	movd	eax, xmm3
	movd	ecx, xmm1
	movd	edx, xmm4
	shufps	xmm4, xmm4, 85
	movd	esi, xmm4
	shl	rsi, 32
	shl	rdx, 32
	or	rdx, rcx
	or	rax, rsi
	ret

ASM After:

sum_f32:
	addss	xmm0, xmm1
	ret

sum_f32x2:
	addss	xmm0, xmm2
	addss	xmm1, xmm3
	ret

sum_f32x4:
	mov	rax, rdi
	movups	xmm0, xmmword ptr [rsi]
	movups	xmm1, xmmword ptr [rdx]
	addps	xmm1, xmm0
	movups	xmmword ptr [rdi], xmm1
	ret

Note that this pull-request is a continuation of #93405 that I closed due to the performance regressions.

Fixes #85265
Fixes #91447
Fixes #93490

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 1, 2022
@rust-highfive
Copy link
Collaborator

r? @matthewjasper

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 1, 2022
// We don't want to aggregate float as Integer because this will
// hurt (#93490) the codegen and performance so instead use a
// vector (this is similar to what clang is doing in C++).
RegKind::Float => arg.cast_to(Reg { kind: RegKind::Vector, size }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsound. Different -Ctarget-features of different crates can cause ABI incompatibilities.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the comment below, that was also what I taught but trying my PR with a code that use different target features leads me to the correct output:

#![feature(bench_black_box)]
#![allow(non_camel_case_types)]

#[derive(Debug)]
struct f32x4(f32, f32, f32, f32);

#[target_feature(enable = "avx")]
unsafe fn foo() -> f32x4 { std::hint::black_box(f32x4(0., 1., 2., std::hint::black_box(3.))) }

#[target_feature(enable = "sse3")]
unsafe fn bar(arg: f32x4) {
    println!("{:?} == f32x4(0.0, 1.0, 2.0, 3.0)", std::hint::black_box(arg));
}

fn main() { unsafe { bar(std::hint::black_box(foo())); } }
$ rustc +stage1 -C opt-level=0 a.rs && ./a
f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0)
$ rustc +stage1 -C opt-level=3 a.rs && ./a
f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just a 128bit vector which fits into a single XMM register. And since that is the lower half of the YMM register in AVX code, that probably just happens to work by accident on x86_64.

I guess using an x86 target with code that has the SSE target feature enabled on only foo or bar might fail (careful, IIRC the i686 arch enables SSE2bby default). Or maybe some of the other supported archs can trigger a bug here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The i586 targets disable SSE by default I believe. You can also manually disable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotdash Yes and no, above this code there is a check for the max size by value who is ptr-size * 2, so the created vector cannot be (at least on x86_64) upper than 128bits which as you said fits on the lower half of the YMM register. This is also consistent with clang++ who (accordingly to my test) never pass vectors more than ptr-size * 2 by value.

Copy link
Member

@bjorn3 bjorn3 Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks like it doesn't error. It however does change the abi from returning the value in xmm versus on the stack. This means that this makes disabling SSE support UB where previously it wasn't, which is still a breaking change. See https://llvm.godbolt.org/z/YThqqs9Ts (left: SSE disabled, right: SSE enabled)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noooo... Well this is bad, really bad. @bjorn3 Any recommendation(s) on how to proceed from now ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. For vector types we "fix" this by forcing them to always be passed on the stack for the rust abi and hoping llvm's mem2reg pass will eliminate the stack usage after inlining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjorn3 I've updated the PR to not modify the Reg representation of homogeneous floats. I think this now completely resolve your concern about ABI compatibility as the generated IR doesn't use a vector type and is the same as rust <= 1.47.
For information as before it doesn't compile with -Ctarget-feature=-sse but does compile with -C target-feature=-sse,+soft-float.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you satisfied with the changes, @bjorn3?

@dotdash
Copy link
Contributor

dotdash commented Feb 2, 2022

Passing structs of four f32 values as a single vector isn't always what you want. For general purpose structs where you don't perform SIMD-like operations, the code might end up worse this way.

https://godbolt.org/z/Ecoz3ThWz shows a (made up) example where the single 4-element vector variant produces worse code than the one that uses two 2-element vectors.

There are trade-offs but be made here. I suppose that in the long run, the Rust ABI may have to become arch specific as well. There's a lot to consider.

@Urgau Urgau force-pushed the abi-arguments-optimization branch from 9150cc5 to bb38320 Compare February 2, 2022 13:36
@Urgau Urgau changed the title Only aggregate homogeneous types in the Rust ABI Don't aggregate homogeneous floats in the Rust ABI Feb 2, 2022
@Urgau
Copy link
Member Author

Urgau commented Feb 2, 2022

I have updated this PR to now only take in account homogeneous floats in account and only in x86_64 (for now) use a vector to represent them. On other archs their current representation is not modified instead of previously by aggregated as a unique Integer. I think this should resolve the concern raised by @bjorn3 and @dotdash.

@Urgau
Copy link
Member Author

Urgau commented Feb 2, 2022

Passing structs of four f32 values as a single vector isn't always what you want. For general purpose structs where you don't perform SIMD-like operations, the code might end up worse this way.

I agree this is not ideal but this is definitely better than the current situation.

There are trade-offs but be made here. I suppose that in the long run, the Rust ABI may have to become arch specific as well. There's a lot to consider.

For now this PR does the Vector cast only in x86_64 per your concern. It may be extended in the future.

@Urgau Urgau force-pushed the abi-arguments-optimization branch from bb38320 to 5a71f32 Compare February 2, 2022 23:52
@Urgau
Copy link
Member Author

Urgau commented Feb 3, 2022

Over the (new) concern of @bjorn3, I've once again updated the logic in this PR to not change the Reg representation at all.
The "new" logic in basically the same as Rust 1.47 and below (only for homogeneous floats): pass them a array if <= ptr-size and by pointer on array if > ptr-size as an optimization.

@workingjubilee
Copy link
Member

If this is intended to fix a regression in the assembly emitted then this requires assembly tests before it can land.

@Urgau Urgau force-pushed the abi-arguments-optimization branch from 5a71f32 to 5a9f3d9 Compare February 21, 2022 00:06
@Urgau Urgau force-pushed the abi-arguments-optimization branch from 5a9f3d9 to 7b69d21 Compare February 21, 2022 00:27
This is basically is ripoff of src/test/ui/simd/target-feature-mixup.rs
but for floats and without #[repr(simd)]
This is required because depending on the version of llvm FileCheck
it could fail.
@workingjubilee
Copy link
Member

Then I think I have the requisite investiture to do this...
@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022
@bors
Copy link
Contributor

bors commented Feb 21, 2022

⌛ Trying commit 74a429e with merge d557e0f29befbe3f15344eb15c19364a660ce35b...

@bors
Copy link
Contributor

bors commented Feb 21, 2022

☀️ Try build successful - checks-actions
Build commit: d557e0f29befbe3f15344eb15c19364a660ce35b (d557e0f29befbe3f15344eb15c19364a660ce35b)

@rust-timer
Copy link
Collaborator

Queued d557e0f29befbe3f15344eb15c19364a660ce35b with parent 45e2c28, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d557e0f29befbe3f15344eb15c19364a660ce35b): comparison url.

Summary: This benchmark run shows 4 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 8.5%
  • Largest regression in instruction counts: 8.8% on full builds of await-call-tree opt

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Feb 21, 2022
@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022
@Urgau
Copy link
Member Author

Urgau commented Feb 21, 2022

The regressions are purely in the await-call-tree compiler benchmark. Looking closely at the performance data, the regressions are located in fn_abi_of_instance section which is expected as this PR touch this area. The time delta of these 4 regressions is well over 3000% from ~0.000s to ~0.001s (accordingly to rustc-perf).

Investigating locally await-call-tree with my changes, told me that there is now 21 calls to the homogeneous_aggregate function most of them are redurent (4 different types) but adding caching seems overkill and counterproductive as the cost of calculating the hash will most certainly take more time than just doing the homogeneous_aggregate call. Maybe adding #[inline] to the homogeneous_aggregate function could help ? (because some matching is unnecessary; we only pass Abi::Aggregate)

Anyway I don't think we should worry about these localized regressions because:

  • await-call-tree is a micro benchmark (not representative of any real world code)
  • the regressions are very small (in term of time)
  • the changes in this PR fixes major regressions with floats (re-enabling auto-vectorization in most cases)
  • and the rest of the benchmarks doesn't show regressions

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Feb 21, 2022
@lqd
Copy link
Member

lqd commented Feb 21, 2022

await-call-tree is a micro benchmark (not representative of any real world code)

Note: this benchmark was reduced from real world code triggering #65147

@workingjubilee
Copy link
Member

workingjubilee commented Feb 23, 2022

Yeah, the benchmark is artificial but the regression would hit real code. If it was a handful of tiny regressions, I wouldn't think anything of it, but it's a very telling regression on a single case, and I can only infer that this test case was added precisely to prevent such a regression. I don't know that this isn't worth the changes, either, I just am not sure I have sufficient proof for posterity, when e.g. someone files a regression report after this gets merged and files a PR to revert this change.

We also have the option to land #91719 instead.

@workingjubilee
Copy link
Member

We landed the other PR #91719 in #94570, which is unfortunately incompatible with this approach. Thank you for exploring this issue with us, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
10 participants