Sub-optimal code-gen for branchless pointer increment #117128

Voultapher · 2023-10-24T15:38:02Z

A common idiom in branchless (jumpless) code is to either increment a pointer by 0 or 1 which can be derived from a bool. The obvious way to do this is ptr = ptr.add(bool_val as usize) however doing so results in slightly sub-optimal code-gen on both x86_64 and Arm.

A minimal example:

pub unsafe fn inc_direct(v: &[u64], mut side_effect_out: *mut *const u64) {
    for elem in v {
        let cond = *elem < 30;
        *side_effect_out = elem;
        side_effect_out = side_effect_out.add(cond as usize);
    }
}

Generated machine-code for the loop:

x86_64

.LBB0_2:
        xor     eax, eax
        cmp     qword ptr [rdi], 30
        mov     qword ptr [rdx], rdi
        lea     rdi, [rdi + 8]
        setb    al
        lea     rdx, [rdx + 8*rax]
        add     rsi, -8
        jne     .LBB0_2

Arm

.LBB0_2:
        ldr     x10, [x9], #8
        str     x0, [x2]
        mov     x0, x9
        cmp     x10, #30
        cset    w10, lo
        subs    x8, x8, #8
        add     x2, x2, w10, uxtw #3
        b.ne    .LBB0_2

Notably, on x86_64 the use of xor + setb + lea. And cset on Arm.

Compared to a version using an additional counter with counter += cond as usize;

pub unsafe fn inc_via_counter(v: &[u64], mut side_effect_out: *mut *const u64) {
    let mut cond_count = 0;
    for elem in v {
        let cond = *elem < 30;
        *(side_effect_out.add(cond_count)) = elem;
        cond_count += cond as usize;
    }
}

Generated machine-code for the loop:

x86_64

.LBB1_2:
        cmp     qword ptr [rdi], 30
        mov     qword ptr [rdx + 8*rax], rdi
        lea     rdi, [rdi + 8]
        adc     rax, 0
        add     rsi, -8
        jne     .LBB1_2

Arm

.LBB1_2:
        ldr     x11, [x10], #8
        str     x0, [x2, x8, lsl #3]
        mov     x0, x10
        cmp     x11, #30
        cinc    x8, x8, lo
        subs    x9, x9, #8
        b.ne    .LBB1_2

Notably the use of adc on x86_64 and cinc on Arm. Using uica and LLVM to simulate the performance of the respective loops we get:

inc_direct cycles per loop iteration:

Tool	Skylake	Sunny Cove
uiCA	2.01	1.87
llvm-mca	1.59	1.59

inc_via_counter cycles per loop iteration:

Tool	Skylake	Sunny Cove
uiCA	1.31	1.56
llvm-mca	1.33	1.33

Simulations as part of some larger code structure are less conclusive, but one thing is certain, for both architectures it's less instructions and worst case same perf. Ideally rustc could generate the more efficient code for the obvious direct increment version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-optimal code-gen for branchless pointer increment #117128

Sub-optimal code-gen for branchless pointer increment #117128

Voultapher commented Oct 24, 2023

Sub-optimal code-gen for branchless pointer increment #117128

Sub-optimal code-gen for branchless pointer increment #117128

Comments

Voultapher commented Oct 24, 2023

Meta