-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM generates branch soup for array partition point #129530
Comments
This is loop unrolling. So...
Why do you think this would be better? If you think the unrolled version is slower, do you have a benchmark? If you think the code size is problematic, does |
std's |
Is there a reason that std's Is this code not faster for the small array size you are concerned about? |
The "a bunch of jumps" approach will need fewer comparisons than the "a bunch of cmovs" approach if the zero is usually near the start of the array. Since the compiler doesn't know if this is often the case in your workload, it assumes that you know what you're doing and therefore preserves what your code does. |
FWIW if want a branchless implementation, then SIMD will likely be 2-5x faster than the scalar version, depending on what target features you can afford to use. #![feature(portable_simd)]
use std::simd::cmp::SimdPartialEq;
use std::simd::Simd;
pub fn partition_point_simd(array: &[usize; 24]) -> usize {
let mut array_zext = [0; 32];
array_zext[..24].copy_from_slice(array);
let array = Simd::from_array(array_zext);
let mask = array.simd_eq(Simd::splat(0));
mask.to_bitmask().trailing_zeros() as usize
} However, in the latest nightly, pub fn partition_point_std(array: &[usize; 24]) -> usize {
array.partition_point(|x| *x != 0)
} It's also remarkably small: playground::partition_point_std:
mov rax, qword ptr [rcx + 96]
test rax, rax
mov edx, 12
cmove rdx, rax
lea rax, [rdx + 6]
mov r8d, eax
cmp qword ptr [rcx + 8*r8], 0
cmovne rdx, rax
lea rax, [rdx + 3]
mov r8d, eax
cmp qword ptr [rcx + 8*r8], 0
cmovne rdx, rax
lea r8, [rdx + 1]
cmp qword ptr [rcx + 8*rdx + 8], 0
cmove r8, rdx
lea rax, [r8 + 1]
cmp qword ptr [rcx + 8*r8 + 8], 0
cmove rax, r8
cmp qword ptr [rcx + 8*rax], 1
sbb rax, -1
ret
fn partition_point_branchless<const N: usize>(array: &[u32; N]) -> usize {
let mut mask: u32 = 0;
for i in 0..N {
if array[i] == 0 {
mask |= 1 << i;
}
}
return mask.trailing_zeros() as usize;
} |
I think this can be closed - LLVM is doing something reasonable here with the context it has and there are some alternatives provided. |
I tried this code:
I expected to see this happen: Assembly composed of
cmov
and/oradox
instruction. Or at leastmov
+je
to a single exit branch.Instead, this happened:
First occurs in 1.19.0 with the alternative code snippet:
The text was updated successfully, but these errors were encountered: