Slice::contains generates suboptimal assembly code #88204

SadiinsoSnowfall · 2021-08-21T09:16:34Z

Given val: u8, slice: &[u8; 8] and arr: [u8; 8], I expected the following statements to compile down to the same thing :

// a
slice.contains(&val);

// b
slice[0] == val
  || slice[1] == val
  || slice[2] == val
  || slice[3] == val
  || slice[4] == val
  || slice[5] == val
  || slice[6] == val
  || slice[7] == val;
  
// c
arr.contains(&val);

// d
arr[0] == val
  || arr[1] == val
  || arr[2] == val
  || arr[3] == val
  || arr[4] == val
  || arr[5] == val
  || arr[6] == val
  || arr[7] == val;

However, the resulting assembly differs quite a lot:

The a statement compiles down to a loop, checking one element at a time, except for T = u8|i8 and N < 16 where it instead call fall on the fast path of memchr which gets optimized a little bit better.
The b statement compiles down to a unrolled-loop, checking one element at a time in a branchless fashion. Most of the time it doesn't give any SIMD instructions.
The c statement always compiles down to a loop, checking one element at a time, except for T = u8|i8 and N >= 16 where it instead call memchr_general_case
The d statement always compiles down to a few branchless SIMD instructions for any primitive type used and any array size.

Because the slice/array size is known at compile-time and the type checker guarantees that it will be respected by any calling function, I expected the compiler to take this into account while optimizing the resulting assembly. However, this information seems to be lost at some point when using the contains method.

arr.contains(&val) and slice.contains(&val) are simplified as arr.as_ref().iter().any(|e| *e == val) and slice.iter().any(|e| *e == val) if I'm not mistaken (which is wierd because for some N and T, they don't yield the same assembly). The compiler does not seem to be able to unroll this case.

godbolt links for
T=u8; N=8
T=u16; N=8
T=u32; N=8
T=u64; N=8

T=u8; N=16
T=u16; N=16
T=u32; N=16
T=u64; N=16

The text was updated successfully, but these errors were encountered:

hkratz · 2021-08-21T12:55:09Z

@rustbot label +I-slow

the8472 · 2021-08-21T13:44:33Z

It can be made vectorizer-friendly by using as_chunks. It also works for slices of unknown size.

https://rust.godbolt.org/z/4GKrxoKxz

SadiinsoSnowfall · 2021-08-25T12:12:13Z

It looks like something changed in one of the latest nightly version (I retested this on a49e38e67 2021-08-23 according to godbolt), the contains method is now full-unrolled by the compiler and produced a serie of cmp and jumps but almost never produce any SIMD code. Maybe using an unrolled loop like that is preferable for small slices. But when using larger slices, the compiler does not even unroll the loop.

I can't figure out why the generated assembly when using arrays is so much different than when using borrowed slices.

SadiinsoSnowfall · 2021-08-25T12:52:57Z

Apparently, both Clang and GCC also avoid using SIMD in this case (preferring unrolled loops or not depending on the targeted architecture). So it may be a non-issue.

the8472 · 2021-08-28T12:22:24Z

It looks like something changed in one of the latest nightly version

Update to LLVM 13. #87570

SadiinsoSnowfall · 2021-09-16T14:34:57Z

On the current nightly available in godbolt (rustc 1.57.0-nightly (9bb77da74 2021-09-13)). Every function compiles to a serie of cmp and jmp instructions except for the memchr_general_case case (u8 slice with more than 8 elements) because this function is not marked as inline.

The only weird thign I found was on the assembly generated by the b and d expressions (see first message) on [T=u8; N=16] slices/array. In this case, the d expression produce a serie of jmp and cmp instructions, but the b expression produce a lot more assembly code including SIMD instructions.

I would like to know if there is a reason for this discrepency, and if a serie of cmp/jmp is in fact the fastest way to perform the operation described in my first message. If everything is ok, I will close the issue.

PS: rustc 1.55 vs current nightly for [T=u64; N=16] case:

pub fn test(slice: &[u64; 16], val: u64) -> bool {
    slice[0] == val
    || slice[1] == val
    || slice[2] == val
    || slice[3] == val
    || slice[4] == val
    || slice[5] == val
    || slice[6] == val
    || slice[7] == val
    || slice[8] == val
    || slice[9] == val
    || slice[10] == val
    || slice[11] == val
    || slice[12] == val
    || slice[13] == val
    || slice[14] == val
    || slice[15] == val
}

rustc_1_55::test:
        vmovq   xmm0, rsi
        vpbroadcastq    ymm0, xmm0
        vpcmpeqq        ymm1, ymm0, ymmword ptr [rdi + 96]
        vpcmpeqq        ymm2, ymm0, ymmword ptr [rdi + 64]
        vpcmpeqq        ymm3, ymm0, ymmword ptr [rdi + 32]
        vpcmpeqq        ymm0, ymm0, ymmword ptr [rdi]
        vpackssdw       ymm1, ymm2, ymm1
        vpackssdw       ymm0, ymm0, ymm3
        vpermq  ymm1, ymm1, 216
        vpermq  ymm0, ymm0, 216
        vpackssdw       ymm0, ymm0, ymm1
        vpmovmskb       eax, ymm0
        test    eax, -1431655766
        setne   al
        vzeroupper
        ret
        
nightly::test:
        mov     al, 1
        cmp     qword ptr [rdi], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 8], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 16], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 24], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 32], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 40], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 48], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 56], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 64], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 72], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 80], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 88], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 96], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 104], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 112], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 120], rsi
        sete    al
.LBB0_16:
        ret

the8472 · 2021-09-16T17:16:30Z

If you use | instead of || it vectorizes again.

SadiinsoSnowfall · 2021-09-16T18:10:25Z

Could this be linked to #83623 in relation with the update to LLVM 13 ? The code vectorizes when using rustc 1.53-1.55 but not on the current nightly.

rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 21, 2021

the8472 mentioned this issue Jan 25, 2022

Function calls in logic expressions sometimes cause optimization misses #93279

Open

SadiinsoSnowfall closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slice::contains generates suboptimal assembly code #88204

Slice::contains generates suboptimal assembly code #88204

SadiinsoSnowfall commented Aug 21, 2021 •

edited

Loading

hkratz commented Aug 21, 2021

the8472 commented Aug 21, 2021

SadiinsoSnowfall commented Aug 25, 2021 •

edited

Loading

SadiinsoSnowfall commented Aug 25, 2021

the8472 commented Aug 28, 2021

SadiinsoSnowfall commented Sep 16, 2021 •

edited

Loading

the8472 commented Sep 16, 2021

SadiinsoSnowfall commented Sep 16, 2021

Slice::contains generates suboptimal assembly code #88204

Slice::contains generates suboptimal assembly code #88204

Comments

SadiinsoSnowfall commented Aug 21, 2021 • edited Loading

hkratz commented Aug 21, 2021

the8472 commented Aug 21, 2021

SadiinsoSnowfall commented Aug 25, 2021 • edited Loading

SadiinsoSnowfall commented Aug 25, 2021

the8472 commented Aug 28, 2021

SadiinsoSnowfall commented Sep 16, 2021 • edited Loading

the8472 commented Sep 16, 2021

SadiinsoSnowfall commented Sep 16, 2021

SadiinsoSnowfall commented Aug 21, 2021 •

edited

Loading

SadiinsoSnowfall commented Aug 25, 2021 •

edited

Loading

SadiinsoSnowfall commented Sep 16, 2021 •

edited

Loading