-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slice::contains generates suboptimal assembly code #88204
Comments
@rustbot label +I-slow |
It can be made vectorizer-friendly by using |
It looks like something changed in one of the latest nightly version (I retested this on I can't figure out why the generated assembly when using arrays is so much different than when using borrowed slices. |
Update to LLVM 13. #87570 |
On the current nightly available in godbolt ( The only weird thign I found was on the assembly generated by the I would like to know if there is a reason for this discrepency, and if a serie of PS: rustc 1.55 vs current nightly for pub fn test(slice: &[u64; 16], val: u64) -> bool {
slice[0] == val
|| slice[1] == val
|| slice[2] == val
|| slice[3] == val
|| slice[4] == val
|| slice[5] == val
|| slice[6] == val
|| slice[7] == val
|| slice[8] == val
|| slice[9] == val
|| slice[10] == val
|| slice[11] == val
|| slice[12] == val
|| slice[13] == val
|| slice[14] == val
|| slice[15] == val
} rustc_1_55::test:
vmovq xmm0, rsi
vpbroadcastq ymm0, xmm0
vpcmpeqq ymm1, ymm0, ymmword ptr [rdi + 96]
vpcmpeqq ymm2, ymm0, ymmword ptr [rdi + 64]
vpcmpeqq ymm3, ymm0, ymmword ptr [rdi + 32]
vpcmpeqq ymm0, ymm0, ymmword ptr [rdi]
vpackssdw ymm1, ymm2, ymm1
vpackssdw ymm0, ymm0, ymm3
vpermq ymm1, ymm1, 216
vpermq ymm0, ymm0, 216
vpackssdw ymm0, ymm0, ymm1
vpmovmskb eax, ymm0
test eax, -1431655766
setne al
vzeroupper
ret
nightly::test:
mov al, 1
cmp qword ptr [rdi], rsi
je .LBB0_16
cmp qword ptr [rdi + 8], rsi
je .LBB0_16
cmp qword ptr [rdi + 16], rsi
je .LBB0_16
cmp qword ptr [rdi + 24], rsi
je .LBB0_16
cmp qword ptr [rdi + 32], rsi
je .LBB0_16
cmp qword ptr [rdi + 40], rsi
je .LBB0_16
cmp qword ptr [rdi + 48], rsi
je .LBB0_16
cmp qword ptr [rdi + 56], rsi
je .LBB0_16
cmp qword ptr [rdi + 64], rsi
je .LBB0_16
cmp qword ptr [rdi + 72], rsi
je .LBB0_16
cmp qword ptr [rdi + 80], rsi
je .LBB0_16
cmp qword ptr [rdi + 88], rsi
je .LBB0_16
cmp qword ptr [rdi + 96], rsi
je .LBB0_16
cmp qword ptr [rdi + 104], rsi
je .LBB0_16
cmp qword ptr [rdi + 112], rsi
je .LBB0_16
cmp qword ptr [rdi + 120], rsi
sete al
.LBB0_16:
ret |
If you use |
Could this be linked to #83623 in relation with the update to LLVM 13 ? The code vectorizes when using rustc 1.53-1.55 but not on the current nightly. |
Given
val: u8
,slice: &[u8; 8]
andarr: [u8; 8]
, I expected the following statements to compile down to the same thing :However, the resulting assembly differs quite a lot:
a
statement compiles down to a loop, checking one element at a time, except forT = u8|i8
andN < 16
where it instead call fall on the fast path ofmemchr
which gets optimized a little bit better.b
statement compiles down to a unrolled-loop, checking one element at a time in a branchless fashion. Most of the time it doesn't give any SIMD instructions.c
statement always compiles down to a loop, checking one element at a time, except forT = u8|i8
andN >= 16
where it instead callmemchr_general_case
d
statement always compiles down to a few branchless SIMD instructions for any primitive type used and any array size.Because the slice/array size is known at compile-time and the type checker guarantees that it will be respected by any calling function, I expected the compiler to take this into account while optimizing the resulting assembly. However, this information seems to be lost at some point when using the
contains
method.arr.contains(&val)
andslice.contains(&val)
are simplified asarr.as_ref().iter().any(|e| *e == val)
andslice.iter().any(|e| *e == val)
if I'm not mistaken (which is wierd because for some N and T, they don't yield the same assembly). The compiler does not seem to be able to unroll this case.godbolt links for
T=u8; N=8
T=u16; N=8
T=u32; N=8
T=u64; N=8
T=u8; N=16
T=u16; N=16
T=u32; N=16
T=u64; N=16
The text was updated successfully, but these errors were encountered: