Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise align_offset for stride=1 further #75728

Merged
merged 1 commit into from
Oct 26, 2020

Conversation

nagisa
Copy link
Member

@nagisa nagisa commented Aug 20, 2020

stride == 1 case can be computed more efficiently through -p (mod a). That, then translates to a nice and short sequence of LLVM
instructions:

%address = ptrtoint i8* %p to i64
%negptr = sub i64 0, %address
%offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

%aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

; x86_64
dec     rsi
mov     rax, rdi
neg     rax
and     rax, rsi
add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

; x86_64
lea     rax, [rsi + rdi - 1]
neg     rsi
and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the and instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:

; using add enables `lea` and similar CISCisms
%offset_ptr = add i64 %address, %a_minus_one
%mask = sub i64 0, %a
%masked = and i64 %offset_ptr, %mask
; can be folded with `gepi` that may follow
%offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code.
One might also wonder how the increased amount of code would impact a
RISC target. Turns out not much:

; aarch64 previous                 ; aarch64 new
sub     x8, x1, #1                 add     x8, x1, x0
neg     x9, x0                     sub     x8, x8, #1
and     x8, x9, x8                 neg     x9, x1
add     x0, x0, x8                 and     x0, x8, x9

(and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snipets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%!

Fixes #75579

@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 20, 2020
@nagisa nagisa force-pushed the improve_align_offset_2 branch 3 times, most recently from 2f009ad to 9ebaf10 Compare August 20, 2020 01:45
`stride == 1` case can be computed more efficiently through `-p (mod
a)`. That, then translates to a nice and short sequence of LLVM
instructions:

    %address = ptrtoint i8* %p to i64
    %negptr = sub i64 0, %address
    %offset = and i64 %negptr, %a_minus_one

And produces pretty much ideal code-gen when this function is used in
isolation.

Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.

    %aligned = getelementptr inbounds i8, i8* %p, i64 %offset

This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:

    ; x86_64
    dec     rsi
    mov     rax, rdi
    neg     rax
    and     rax, rsi
    add     rax, rdi

In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:

    ; x86_64
    lea     rax, [rsi + rdi - 1]
    neg     rsi
    and     rax, rsi

To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the `and` instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.

One such ordering is proposed in rust-lang#75579 and results in LLVM IR that
looks broadly like this:

    ; using add enables `lea` and similar CISCisms
    %offset_ptr = add i64 %address, %a_minus_one
    %mask = sub i64 0, %a
    %masked = and i64 %offset_ptr, %mask
    ; can be folded with `gepi` that may follow
    %offset = sub i64 %masked, %address

…and generates the intended x86_64 machine code. One might also wonder
how the increased amount of code would impact a RISC target. Turns out
not much:

    ; aarch64 previous                 ; aarch64 new
    sub     x8, x1, #1                 add     x8, x1, x0
    neg     x9, x0                     sub     x8, x8, #1
    and     x8, x9, x8                 neg     x9, x1
    add     x0, x0, x8                 and     x0, x8, x9

    (and similarly for ppc, sparc, mips, riscv, etc)

The only target that seems to do worse is… wasm32.

Onto actual measurements – the best way to evaluate snippets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%, however!
@crlf0710 crlf0710 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 12, 2020
@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 15, 2020
@Dylan-DPC-zz
Copy link

@joshtriplett this is ready for review

@joshtriplett
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 5, 2020

⌛ Trying commit 4bfacff with merge 911995c7ac82feb8ff493eb4328f4387f5cf506e...

@bors
Copy link
Contributor

bors commented Oct 5, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 911995c7ac82feb8ff493eb4328f4387f5cf506e (911995c7ac82feb8ff493eb4328f4387f5cf506e)

@rust-timer
Copy link
Collaborator

Queued 911995c7ac82feb8ff493eb4328f4387f5cf506e with parent ced813f, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (911995c7ac82feb8ff493eb4328f4387f5cf506e): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@joshtriplett
Copy link
Member

joshtriplett commented Oct 5, 2020

Seems relatively reasonable to me. Could someone else doublecheck this, please?

@camelid camelid added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 23, 2020
@Dylan-DPC-zz
Copy link

r? @Mark-Simulacrum

@Mark-Simulacrum
Copy link
Member

The perf results here don't seem to really correspond nicely to the conclusions laid out in the PR description: wall times contain a bunch of regressions, and the instruction counts are slightly up as well. Maybe LLVM 11 (which I think landed since then) changed the calculus for the compiler, though -- let's generate another round of benchmarks.

@bors try @rust-timer queue

@nagisa Do the perf results match your expectations? @joshtriplett -- I think you didn't comment on them at all, perhaps you have thoughts?

Generally speaking I'm fine with landing this -- the implementation looks correct -- but it seems like if it's a regression on rustc runtime then maybe we shouldn't?

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 25, 2020

⌛ Trying commit 4bfacff with merge 2b39dba3b2c0c5abd15893f173833c02e968a7f6...

@joshtriplett
Copy link
Member

@Mark-Simulacrum I didn't see the wall-times; those look really concerning.

Given the nature of the optimization, I was looking heavily at cycles, which are substantially improved across the board for everything other than incr-unchanged.

@nagisa
Copy link
Member Author

nagisa commented Oct 25, 2020

I don't believe this function is used within rustc enough to cause a 50% wall time regression doing anything, but much like @joshtriplett I wasn't even aware there was such a regression.

(I feel like its somewhat of a presentation problem in perf; I don't typically look at all measurement kinds)

@nagisa
Copy link
Member Author

nagisa commented Oct 25, 2020

#75600 has a prior optimisation of this same function, and it did see comparatively large changes in wall-time, though in that case towards the positive side, but also the way I interpret the results, variance is huge too?

@bors
Copy link
Contributor

bors commented Oct 25, 2020

☀️ Try build successful - checks-actions
Build commit: 2b39dba3b2c0c5abd15893f173833c02e968a7f6 (2b39dba3b2c0c5abd15893f173833c02e968a7f6)

@rust-timer
Copy link
Collaborator

Queued 2b39dba3b2c0c5abd15893f173833c02e968a7f6 with parent 0dce3f6, future comparison URL.

@Mark-Simulacrum
Copy link
Member

I agree that a 50% wall time regression is likely to be mostly noise - unfortunately we don't really have a good way yet of determining noise on the presentation layer such that we could hide deltas within the noise bound and then (hopefully) show all the statistics we collect in some way.

That said, I am inclined to land this as is presuming this perf run comes back similarly (or better). I think most of the wall time regressions are on tiny crates - several milliseconds - and so aren't too interesting.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (2b39dba3b2c0c5abd15893f173833c02e968a7f6): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@Mark-Simulacrum
Copy link
Member

This time around wall times are down roughly 50%. So either this PR considerably increases the amount of noise we see (seems highly unlikely) or something has landed in the meantime that shifts things around to this being a net improvement now. Regardless cycles and instructions seem neutral or down, and since llvm-mca points at throughput increases, I think we should merge this.

@bors r+

@bors
Copy link
Contributor

bors commented Oct 26, 2020

📌 Commit 4bfacff has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 26, 2020
@bors
Copy link
Contributor

bors commented Oct 26, 2020

⌛ Testing commit 4bfacff with merge 69e68cf...

@bors
Copy link
Contributor

bors commented Oct 26, 2020

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 69e68cf to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 26, 2020
@bors bors merged commit 69e68cf into rust-lang:master Oct 26, 2020
@rustbot rustbot added this to the 1.49.0 milestone Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ptr::align_offset generates surprisingly bad code