Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Use unaligned read/writes for core::mem::swap on x86_64 #98892

Conversation

AngelicosPhosphoros
Copy link
Contributor

@AngelicosPhosphoros AngelicosPhosphoros commented Jul 4, 2022

This generates better ASM: https://godbolt.org/z/Mr4rWfoad
And misaligned accesses on modern x86_64 processors are fast (see docs for core::mem::swap_chunked).

Main difference is that swapping #[repr(packed)] or aligned to 1 byte types now uses bigger chunks by utilizing movq and movl instructions.

Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple usize chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags.

Also, made size limit in condition for choosing chunked swap smaller to make types like std::vec::Vec<T> (especially std::str::String) use new optimizations.

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 4, 2022
@rustbot
Copy link
Collaborator

rustbot commented Jul 4, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 4, 2022
@AngelicosPhosphoros

This comment was marked as outdated.

@compiler-errors
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 4, 2022
@bors
Copy link
Contributor

bors commented Jul 4, 2022

⌛ Trying commit 4c9609245af2160d5708ac0c07610e7fc8e562bf with merge 86189bb1747c007986340e760682fc32e092f136...

@bors
Copy link
Contributor

bors commented Jul 4, 2022

☀️ Try build successful - checks-actions
Build commit: 86189bb1747c007986340e760682fc32e092f136 (86189bb1747c007986340e760682fc32e092f136)

@rust-timer
Copy link
Collaborator

Queued 86189bb1747c007986340e760682fc32e092f136 with parent a3beeaa, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (86189bb1747c007986340e760682fc32e092f136): comparison url.

Instruction count

  • Primary benchmarks: mixed results
  • Secondary benchmarks: 🎉 relevant improvements found
mean1 max count2
Regressions 😿
(primary)
0.6% 1.3% 13
Regressions 😿
(secondary)
N/A N/A 0
Improvements 🎉
(primary)
-0.5% -0.9% 11
Improvements 🎉
(secondary)
-1.4% -2.3% 11
All 😿🎉 (primary) 0.1% 1.3% 24

Max RSS (memory usage)

Results
  • Primary benchmarks: mixed results
  • Secondary benchmarks: mixed results
mean1 max count2
Regressions 😿
(primary)
4.3% 4.3% 1
Regressions 😿
(secondary)
3.5% 4.9% 2
Improvements 🎉
(primary)
-7.3% -11.8% 2
Improvements 🎉
(secondary)
-2.4% -2.9% 2
All 😿🎉 (primary) -3.5% -11.8% 3

Cycles

Results
  • Primary benchmarks: 😿 relevant regression found
  • Secondary benchmarks: mixed results
mean1 max count2
Regressions 😿
(primary)
3.9% 3.9% 1
Regressions 😿
(secondary)
3.4% 3.6% 2
Improvements 🎉
(primary)
N/A N/A 0
Improvements 🎉
(secondary)
-3.4% -5.1% 10
All 😿🎉 (primary) 3.9% 3.9% 1

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Footnotes

  1. the arithmetic mean of the percent change 2 3

  2. number of relevant changes 2 3

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jul 5, 2022
@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 13, 2022
@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 8, 2022
@scottmcm
Copy link
Member

This now has conflicts, and still says "WIP" in the name, but let me flip it to a pair of team eyes for feedback first
r? @cuviper

@rustbot rustbot assigned cuviper and unassigned scottmcm Jan 14, 2023
Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good, but I'd like to see it rebased so we can run a current perf test.

}

// Trigger static assertion.
let _ = <T as ConstifyOffsets>::STATIC_TEST;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than playing games with overflow, this can just be a const { assert!(...); } block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better. Updated.

@cuviper
Copy link
Member

cuviper commented Mar 3, 2023

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 3, 2023
@AngelicosPhosphoros AngelicosPhosphoros force-pushed the make_unaligned_chunked_swap_for_x64 branch from 4c96092 to 882407a Compare May 2, 2023 17:22
@AngelicosPhosphoros AngelicosPhosphoros changed the title [WIP] Make unaligned chunked swap for x64 Make unaligned chunked swap for x64 May 2, 2023
@AngelicosPhosphoros
Copy link
Contributor Author

AngelicosPhosphoros commented May 2, 2023

@rustbot label -S-waiting-on-author +S-waiting-on-review

@cuviper @scottmcm
I cherry-picked PR on top of current master and fixed requested changes.
Can we rerun perf please?

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 2, 2023
@bors
Copy link
Contributor

bors commented May 16, 2023

☀️ Try build successful - checks-actions
Build commit: cd4875ef17561bd9077809b8d65fcf4e5b439b9e (cd4875ef17561bd9077809b8d65fcf4e5b439b9e)

1 similar comment
@bors
Copy link
Contributor

bors commented May 16, 2023

☀️ Try build successful - checks-actions
Build commit: cd4875ef17561bd9077809b8d65fcf4e5b439b9e (cd4875ef17561bd9077809b8d65fcf4e5b439b9e)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (cd4875ef17561bd9077809b8d65fcf4e5b439b9e): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [0.1%, 4.6%] 3
Regressions ❌
(secondary)
1.8% [1.4%, 2.1%] 2
Improvements ✅
(primary)
-2.4% [-2.4%, -2.4%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.0% [-2.4%, 4.6%] 4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
6.1% [5.5%, 6.7%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.0%, 1.0%] 37
Regressions ❌
(secondary)
0.1% [0.1%, 0.1%] 1
Improvements ✅
(primary)
-0.1% [-0.2%, -0.0%] 7
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [-0.2%, 1.0%] 44

Bootstrap: 644.092s -> 643.294s (-0.12%)

@rustbot rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels May 16, 2023
This generates better ASM: https://godbolt.org/z/Mr4rWfoad
And misaligned accesses on modern x86_64 processors are fast (see docs for `core::mem::swap_chunked`).

Main difference is that swapping `#[repr(packed)]` or aligned to 1 byte types now uses bigger chunks by utilizing `movq` and `movl` instructions.

Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple `usize` chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags.

Also, made size limit in condition for choosing chunked swap smaller to make types like `std::vec::Vec<T>` (especially `std::str::String`) use new optimizations.
@AngelicosPhosphoros AngelicosPhosphoros force-pushed the make_unaligned_chunked_swap_for_x64 branch from 18c41fa to 4607601 Compare May 16, 2023 10:17
@AngelicosPhosphoros
Copy link
Contributor Author

@scottmcm Last perf run looks promising IMHO.

What do you think about using cfg!(miri) and const_eval_select to avoid using swapped version in const eval/miri contexts?

@AngelicosPhosphoros
Copy link
Contributor Author

I started implementing intrinsic: #111744

@scottmcm
Copy link
Member

I think #97712 suggests that it's a good thing that MIRI runs on the block optimization version, so we wouldn't want to hide it, now that MIRI supports it.

@cuviper
Copy link
Member

cuviper commented Jun 5, 2023

It's not clear to me -- is the swap intrinsic meant to fully replace this PR? Please @rustbot author if you're focusing on that direction.

bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 6, 2023
…iler-errors

Use `load`+`store` instead of `memcpy` for small integer arrays

I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway.  After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.

LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s.  In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712).  So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations.

rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now.

The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them.  I don't want to try to bite off too much in this PR, though.  (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.)

Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv>

Before:
```llvm
define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 {
  %a.i = alloca [3 x i16], align 2
  call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false)
  tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false)
  call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i)
  ret void
}
```
Note it going to stack:
```nasm
swap_rgb48_old:                         # `@swap_rgb48_old`
        movzx   eax, word ptr [rdi + 4]
        mov     word ptr [rsp - 4], ax
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsp - 8], eax
        movzx   eax, word ptr [rsi + 4]
        mov     word ptr [rdi + 4], ax
        mov     eax, dword ptr [rsi]
        mov     dword ptr [rdi], eax
        movzx   eax, word ptr [rsp - 4]
        mov     word ptr [rsi + 4], ax
        mov     eax, dword ptr [rsp - 8]
        mov     dword ptr [rsi], eax
        ret
```

Now:
```llvm
define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 {
start:
  %0 = load <3 x i16>, ptr %x, align 2
  %1 = load <3 x i16>, ptr %y, align 2
  store <3 x i16> %1, ptr %x, align 2
  store <3 x i16> %0, ptr %y, align 2
  ret void
}
```
still lowers to `dword`+`word` operations, but has no stack traffic:
```nasm
swap_rgb48:                             # `@swap_rgb48`
        mov     eax, dword ptr [rdi]
        movzx   ecx, word ptr [rdi + 4]
        movzx   edx, word ptr [rsi + 4]
        mov     r8d, dword ptr [rsi]
        mov     dword ptr [rdi], r8d
        mov     word ptr [rdi + 4], dx
        mov     word ptr [rsi + 4], cx
        mov     dword ptr [rsi], eax
        ret
```

And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR
```llvm
define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}
```
but that lowers to reasonable `dword`+`qword` instructions still
```nasm
replace_short_array:                    # `@replace_short_array`
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     edi, dword ptr [rsi + 8]
        mov     dword ptr [rax + 8], edi
        mov     qword ptr [rax], rcx
        mov     rcx, qword ptr [rdx]
        mov     edx, dword ptr [rdx + 8]
        mov     dword ptr [rsi + 8], edx
        mov     qword ptr [rsi], rcx
        ret
```
@AngelicosPhosphoros
Copy link
Contributor Author

@cuviper Yes, I think, it would replace it fully if it would be merged.

And it provides more noticable benefits than this PR.

I would make this PR a draft until then.

@AngelicosPhosphoros AngelicosPhosphoros marked this pull request as draft June 11, 2023 18:22
@cuviper
Copy link
Member

cuviper commented Jun 16, 2023

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 16, 2023
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 20, 2024
…<try>

Tweak the threshold for chunked swapping

Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892.

This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 20, 2024
…Mark-Simulacrum

Tweak the threshold for chunked swapping

Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892.

This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.
@Dylan-DPC
Copy link
Member

@AngelicosPhosphoros any updates on this?

@Dylan-DPC
Copy link
Member

Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks

@Dylan-DPC Dylan-DPC closed this Mar 12, 2024
@Dylan-DPC Dylan-DPC added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.