Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align_offset seems to generate worse code than is desirable (unless size_of::<T>() == 1) #98809

Closed
thomcc opened this issue Jul 2, 2022 · 2 comments · Fixed by #98866
Closed
Assignees
Labels
A-codegen Area: Code generation A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@thomcc
Copy link
Member

thomcc commented Jul 2, 2022

There are many examples where align_offset produces worse code than it should (worse than you'd write by hand, at the very least, and in the case of slices, worse then I think it needs to). You can often work around this by going through a *const u8 pointer, although this is surprising, and not always an option (for example, it is not an option for users of align_to).

Here's a minimal example extracted from some real code but massaged somewhat: https://rust.godbolt.org/z/aq8Yo8oYj. For posterity, the generated code here currently is:

example::simd_align1:
        mov     eax, edi
        and     eax, 12
        mov     ecx, edi
        shr     ecx, 2
        neg     ecx
        and     ecx, 3
        test    rax, rax
        cmove   rcx, rax
        lea     rax, [rdi + 4*rcx]
        ret

example::simd_align2:
        lea     rax, [rdi + 15]
        and     rax, -16
        ret

(And simd_align1 here is actually better than I was getting in my real code, where it ended up as a branch. That said, this is bad enough to demonstrate the problem).

I guess this is because align_offset gets passed a *const T which itself may not be aligned to T (for example, (1 as *const f32).align_offset(16) can't be satisfied). That problem is avoided by going through u8 (as is done with simd_align2), since that should always succeed (for non-const which isn't relevant here), and hits the stride == 1 special case added to fix #75579).

However, in this case the compiler should know that the pointer is aligned to 4, since it comes from a slice (and the pointer does end up with align(4) in the LLVM). I think this means there's no actual reason that simd_align1 must be less efficient than simd_align2 (unless I'm missing something), and the issue is just that our align_offset impl doesn't have a fast path for this situation. That said, maybe I'm wrong, and the generated code is bad for some other reason.

Either way, it would be very nice for this to do better for the slice use case -- as I mentioned, going through *const u8 is not viable when align_offset is invoked from an API like align_to (I suspect if we can't fix it for align_offset, we probably could work around it in align_to more directly, but it would be better to do it in align_offset if possible. That said, if I'm wrong about why this is happening, all bets are certainly off).

See also #72356, which seems related (and my hunch is that it will be fixed if we fix this issue, at least in this case).

CC @nagisa, since you asked.

P.S. Sorry that this is a bit of a rambling issue.

@thomcc thomcc added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-raw-pointers Area: raw pointers, MaybeUninit, NonNull labels Jul 2, 2022
@thomcc thomcc changed the title align_offset seems to generate worse code than needed (unless size_of::<T>() == 1) align_offset seems to generate worse code than is desirable (unless size_of::<T>() == 1) Jul 2, 2022
@CAD97
Copy link
Contributor

CAD97 commented Jul 3, 2022

Saw, got interested. Some drive-by notes:

  • simd_align1 is exactly the align_offset codegen plus an lea to multiply it by the stride.
  • pmoda which is used to gate the already-aligned fast path is necessarily not constant folded. Interestingly, (edit: forgot "removing") the fast path thus seems to improve codegen and without changing results.
  • Everything not derived from pmoda successfully constant folds.

as I mentioned, going through *const u8 is not viable when align_offset is invoked from an API like align_to

fn align_offset_aligned<T>(ptr: &[T; 0], align: usize) -> usize
where
    T: Sized,
{
    if !align.is_power_of_two() {
        panic!("align_offset: align is not a power-of-two");
    }

    let byte_align_offset = ptr.as_ptr().cast::<u8>().align_offset(align);
    byte_align_offset / core::mem::align_of::<T>()
}

This optimizes perfectly, even at -O1, with no changes to align_offset.

(EDIT: on mobile and can't edit on godbolt, but: the linked code implements &[T] -> &[T; 0] incorrectly as slice.try_into().unwrap_unchecked() where it should be slice[..0].try_into().unwrap(). In this case this does not change the results; the UB results in the optimizer assuming the slice is length zero but this has no impact on the pointer aligning.)

I figured this out while trying and failing to add a fast path for pmods == 0. I tried delegating inside align_offset_impl which I couldn't get to work, then had the idea to just do it from the outside.

Even inside align_offset_impl, a pmods is correctly constant folded, so it should be possible to improve perf internally, but I couldn't figure out how to add a correct fast path for that case.

@nagisa
Copy link
Member

nagisa commented Jul 3, 2022

Thanks for the report. I think I have an idea how this can be improved, but I need to think about it a little further to see if it really is correct in all instances and to evaluate if it the tradeoffs in code quality are going to be positive enough to warrant a change I’m thinking of.

@rustbot claim

@workingjubilee workingjubilee added the A-codegen Area: Code generation label Jul 6, 2022
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 16, 2022
…ark-Simulacrum

Add a special case for align_offset /w stride != 1

This generalizes the previous `stride == 1` special case to apply to any
situation where the requested alignment is divisible by the stride. This
in turn allows the test case from rust-lang#98809 produce ideal assembly, along
the lines of:

    leaq 15(%rdi), %rax
    andq $-16, %rax

This also produces pretty high quality code for situations where the
alignment of the input pointer isn’t known:

    pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 {
        slice.offset(slice.align_offset(16) as isize)
    }

    // =>

    movl %edi, %eax
    andl $3, %eax
    leaq 15(%rdi), %rcx
    andq $-16, %rcx
    subq %rdi, %rcx
    shrq $2, %rcx
    negq %rax
    sbbq %rax, %rax
    orq  %rcx, %rax
    leaq (%rdi,%rax,4), %rax

Here LLVM is smart enough to replace the `usize::MAX` special case with
a branch-less bitwise-OR approach, where the mask is constructed using
the neg and sbb instructions. This appears to work across various
architectures I’ve tried.

This change ends up introducing more branches and code in situations
where there is less knowledge of the arguments. For example when the
requested alignment is entirely unknown. This use-case was never really
a focus of this function, so I’m not particularly worried, especially
since llvm-mca is saying that the new code is still appreciably faster,
despite all the new branching.

Fixes rust-lang#98809.
Sadly, this does not help with rust-lang#72356.
@bors bors closed this as completed in 62a182c Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants