Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise align_offset for stride=1 further #75728

Merged
merged 1 commit into from
Oct 26, 2020

Commits on Aug 20, 2020

  1. Optimise align_offset for stride=1 further

    `stride == 1` case can be computed more efficiently through `-p (mod
    a)`. That, then translates to a nice and short sequence of LLVM
    instructions:
    
        %address = ptrtoint i8* %p to i64
        %negptr = sub i64 0, %address
        %offset = and i64 %negptr, %a_minus_one
    
    And produces pretty much ideal code-gen when this function is used in
    isolation.
    
    Typical use of this function will, however, involve use of
    the result to offset a pointer, i.e.
    
        %aligned = getelementptr inbounds i8, i8* %p, i64 %offset
    
    This still looks very good, but LLVM does not really translate that to
    what would be considered ideal machine code (on any target). For example
    that's the codegen we obtain for an unknown alignment:
    
        ; x86_64
        dec     rsi
        mov     rax, rdi
        neg     rax
        and     rax, rsi
        add     rax, rdi
    
    In particular negating a pointer is not something that’s going to be
    optimised for in the design of CISC architectures like x86_64. They
    are much better at offsetting pointers. And so we’d love to utilize this
    ability and produce code that's more like this:
    
        ; x86_64
        lea     rax, [rsi + rdi - 1]
        neg     rsi
        and     rax, rsi
    
    To achieve this we need to give LLVM an opportunity to apply its
    various peep-hole optimisations that it does during DAG selection. In
    particular, the `and` instruction appears to be a major inhibitor here.
    We cannot, sadly, get rid of this load-bearing operation, but we can
    reorder operations such that LLVM has more to work with around this
    instruction.
    
    One such ordering is proposed in rust-lang#75579 and results in LLVM IR that
    looks broadly like this:
    
        ; using add enables `lea` and similar CISCisms
        %offset_ptr = add i64 %address, %a_minus_one
        %mask = sub i64 0, %a
        %masked = and i64 %offset_ptr, %mask
        ; can be folded with `gepi` that may follow
        %offset = sub i64 %masked, %address
    
    …and generates the intended x86_64 machine code. One might also wonder
    how the increased amount of code would impact a RISC target. Turns out
    not much:
    
        ; aarch64 previous                 ; aarch64 new
        sub     x8, x1, #1                 add     x8, x1, x0
        neg     x9, x0                     sub     x8, x8, #1
        and     x8, x9, x8                 neg     x9, x1
        add     x0, x0, x8                 and     x0, x8, x9
    
        (and similarly for ppc, sparc, mips, riscv, etc)
    
    The only target that seems to do worse is… wasm32.
    
    Onto actual measurements – the best way to evaluate snippets like these
    is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
    there isn’t any performance difference to be found. Both snippets
    execute in same number of cycles for the CPUs I tried. On x86_64,
    we get throughput improvement of >50%, however!
    nagisa committed Aug 20, 2020
    Configuration menu
    Copy the full SHA
    4bfacff View commit details
    Browse the repository at this point in the history