Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIRI says reverse is UB, so replace it with something LLVM can vectorize #90821

Merged
merged 2 commits into from
Nov 15, 2021

Conversation

scottmcm
Copy link
Member

For small types with padding, the current implementation is UB because it does integer operations on uninit values.

error: Undefined Behavior: using uninitialized data, but this operation requires initialized memory
   --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/num/mod.rs:836:5
    |
836 | /     uint_impl! { u32, u32, i32, 32, 4294967295, 8, "0x10000b3", "0xb301", "0x12345678",
837 | |     "0x78563412", "0x1e6a2c48", "[0x78, 0x56, 0x34, 0x12]", "[0x12, 0x34, 0x56, 0x78]", "", "" }
    | |________________________________________________________________________________________________^ using uninitialized data, but this operation requires initialized memory
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `core::num::<impl u32>::rotate_left` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/num/uint_macros.rs:211:13
    = note: inside `core::slice::<impl [Foo]>::reverse` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/mod.rs:701:58

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=340739f22ca5b457e1da6f361768edc6

But LLVM has gotten smarter since I wrote the previous implementation in 2017, so this PR removes all the manual magic and just writes it in such a way that LLVM will vectorize. This code is much simpler and has very little unsafe, and is actually faster to boot!

If you're curious to see the codegen: https://rust.godbolt.org/z/Pcn13Y9E3

Before:

running 7 tests
test slice::reverse_simd_f64x4                           ... bench:      17,940 ns/iter (+/- 481) = 58448 MB/s
test slice::reverse_u128                                 ... bench:      17,758 ns/iter (+/- 205) = 59048 MB/s
test slice::reverse_u16                                  ... bench:     158,234 ns/iter (+/- 6,876) = 6626 MB/s
test slice::reverse_u32                                  ... bench:      62,047 ns/iter (+/- 1,117) = 16899 MB/s
test slice::reverse_u64                                  ... bench:      31,582 ns/iter (+/- 552) = 33201 MB/s
test slice::reverse_u8                                   ... bench:      81,253 ns/iter (+/- 1,510) = 12905 MB/s
test slice::reverse_u8x3                                 ... bench:     270,615 ns/iter (+/- 11,463) = 3874 MB/s

After:

running 7 tests
test slice::reverse_simd_f64x4                           ... bench:      17,731 ns/iter (+/- 306) = 59137 MB/s
test slice::reverse_u128                                 ... bench:      17,919 ns/iter (+/- 239) = 58517 MB/s
test slice::reverse_u16                                  ... bench:      43,160 ns/iter (+/- 607) = 24295 MB/s
test slice::reverse_u32                                  ... bench:      21,065 ns/iter (+/- 371) = 49778 MB/s
test slice::reverse_u64                                  ... bench:      21,118 ns/iter (+/- 482) = 49653 MB/s
test slice::reverse_u8                                   ... bench:      76,878 ns/iter (+/- 1,688) = 13639 MB/s
test slice::reverse_u8x3                                 ... bench:     264,723 ns/iter (+/- 5,544) = 3961 MB/s

Those are the existing benches,

macro_rules! reverse {
($name:ident, $ty:ty, $f:expr) => {
#[bench]
fn $name(b: &mut Bencher) {
// odd length and offset by 1 to be as unaligned as possible
let n = 0xFFFFF;
let mut v: Vec<_> = (0..1 + (n / mem::size_of::<$ty>() as u64)).map($f).collect();
b.iter(|| black_box(&mut v[1..]).reverse());
b.bytes = n;
}
};
}
reverse!(reverse_u8, u8, |x| x as u8);
reverse!(reverse_u16, u16, |x| x as u16);
reverse!(reverse_u8x3, [u8; 3], |x| [x as u8, (x >> 8) as u8, (x >> 16) as u8]);
reverse!(reverse_u32, u32, |x| x as u32);
reverse!(reverse_u64, u64, |x| x as u64);
reverse!(reverse_u128, u128, |x| x as u128);
#[repr(simd)]
struct F64x4(f64, f64, f64, f64);
reverse!(reverse_simd_f64x4, F64x4, |x| {
let x = x as f64;
F64x4(x, x, x, x)
});

…LLVM can vectorize

For small types with padding, the current implementation is UB because it does integer operations on uninit values.  But LLVM has gotten smarter since I wrote the previous implementation in 2017, so remove all the manual magic and just write it in such a way that LLVM will vectorize.  This code is much simpler (albeit nuanced) and has very little `unsafe`, and is actually faster to boot!
@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 12, 2021
@Mark-Simulacrum
Copy link
Member

@bors r+ rollup=never

@bors
Copy link
Contributor

bors commented Nov 14, 2021

📌 Commit 71f5cfb has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2021
@bors
Copy link
Contributor

bors commented Nov 14, 2021

⌛ Testing commit 71f5cfb with merge 834126ba1b45171372be457c925e7209624feeee...

@bors
Copy link
Contributor

bors commented Nov 14, 2021

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 14, 2021
@rust-log-analyzer

This comment has been minimized.

@scottmcm
Copy link
Member Author

I'm quite confident in the unsafe in here, so this is scary

rustc exited with signal: 11 (core dumped)
error: could not compile `bitflags`

I'll just close my eyes and
@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2021
@Mark-Simulacrum
Copy link
Member

This is #90812, almost certainly.

@bors
Copy link
Contributor

bors commented Nov 14, 2021

⌛ Testing commit 71f5cfb with merge 9e811bc4c0e96a0a804668a9f743d510ef7f2b2a...

@bors
Copy link
Contributor

bors commented Nov 14, 2021

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 14, 2021
@rust-log-analyzer

This comment has been minimized.

@Mark-Simulacrum
Copy link
Member

@bors retry missing log

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2021
@bors
Copy link
Contributor

bors commented Nov 14, 2021

⌛ Testing commit 71f5cfb with merge ea32ff4f0641626a78ae29916df2610c1b6a2c0e...

@bors
Copy link
Contributor

bors commented Nov 14, 2021

💔 Test failed - checks-actions

@bors bors added the S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. label Nov 15, 2021
@bors
Copy link
Contributor

bors commented Nov 15, 2021

⌛ Testing commit f541dd1 with merge 8a3d50db3dad17308528f63668a9dbe628bdaa06...

@bors
Copy link
Contributor

bors commented Nov 15, 2021

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 15, 2021
@rust-log-analyzer
Copy link
Collaborator

The job dist-mips64el-linux failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
   Compiling cmake v0.1.44
error: could not compile `cc`

Caused by:
  process didn't exit successfully: `/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/rustc --crate-name cc --edition=2018 /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/cc-1.0.69/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=0 -C metadata=9022b5d7a536baba -C extra-filename=-9022b5d7a536baba --out-dir /checkout/obj/build/bootstrap/debug/deps -L dependency=/checkout/obj/build/bootstrap/debug/deps --cap-lints allow -Wrust_2018_idioms -Wunused_lifetimes -Wsemicolon_in_expressions_from_macros -Dwarnings` (signal: 11, SIGSEGV: invalid memory reference)
error: build failed
failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
Build completed unsuccessfully in 0:00:55
make: *** [prepare] Error 1
---
   Compiling globset v0.4.5
error: could not compile `regex`

Caused by:
  process didn't exit successfully: `/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/rustc --crate-name regex --edition=2018 /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/regex-1.5.4/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=0 --cfg 'feature="aho-corasick"' --cfg 'feature="default"' --cfg 'feature="memchr"' --cfg 'feature="perf"' --cfg 'feature="perf-cache"' --cfg 'feature="perf-dfa"' --cfg 'feature="perf-inline"' --cfg 'feature="perf-literal"' --cfg 'feature="std"' --cfg 'feature="unicode"' --cfg 'feature="unicode-age"' --cfg 'feature="unicode-bool"' --cfg 'feature="unicode-case"' --cfg 'feature="unicode-gencat"' --cfg 'feature="unicode-perl"' --cfg 'feature="unicode-script"' --cfg 'feature="unicode-segment"' -C metadata=b0c184cb911fb168 -C extra-filename=-b0c184cb911fb168 --out-dir /checkout/obj/build/bootstrap/debug/deps -L dependency=/checkout/obj/build/bootstrap/debug/deps --extern aho_corasick=/checkout/obj/build/bootstrap/debug/deps/libaho_corasick-6acdc16a2f6d6338.rmeta --extern memchr=/checkout/obj/build/bootstrap/debug/deps/libmemchr-9d65e5e3abdde0ba.rmeta --extern regex_syntax=/checkout/obj/build/bootstrap/debug/deps/libregex_syntax-74207c8016af6cf7.rmeta --cap-lints allow -Wrust_2018_idioms -Wunused_lifetimes -Wsemicolon_in_expressions_from_macros -Dwarnings` (signal: 11, SIGSEGV: invalid memory reference)
error: build failed
failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
Build completed unsuccessfully in 0:00:03
make: *** [prepare] Error 1
---
[RUSTC-TIMING] adler test:false 0.155
[RUSTC-TIMING] unwind test:false 0.075
[RUSTC-TIMING] libc test:false 0.721
[RUSTC-TIMING] compiler_builtins test:false 0.897
rustc exited with signal: 11 (core dumped)
error: could not compile `compiler_builtins`
Caused by:
Caused by:
  process didn't exit successfully: `/checkout/obj/build/bootstrap/debug/rustc --crate-name compiler_builtins /cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/compiler_builtins-0.1.49/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C codegen-units=10000 -C debuginfo=1 --cfg 'feature="c"' --cfg 'feature="cc"' --cfg 'feature="compiler-builtins"' --cfg 'feature="core"' --cfg 'feature="default"' --cfg 'feature="rustc-dep-of-std"' -C metadata=6bdc4f8396c65e15 -C extra-filename=-6bdc4f8396c65e15 --out-dir /checkout/obj/build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps --target x86_64-unknown-linux-gnu -L dependency=/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps -L dependency=/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-std/release/deps --extern core=/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/librustc_std_workspace_core-fd44d8b17c2359b1.rmeta --cap-lints allow --cfg=bootstrap -Zsymbol-mangling-version=legacy -Zmacro-backtrace '-Clink-args=-Wl,-rpath,$ORIGIN/../lib' -Cprefer-dynamic '-Zcrate-attr=doc(html_root_url="https://doc.rust-lang.org/nightly/")' -Z binary-dep-depinfo -L native=/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/build/compiler_builtins-399cb2af88d0d65a/out --cfg 'feature="unstable"' --cfg '__absvdi2="optimized-c"' --cfg '__absvsi2="optimized-c"' --cfg '__absvti2="optimized-c"' --cfg '__addvdi3="optimized-c"' --cfg '__addvsi3="optimized-c"' --cfg '__addvti3="optimized-c"' --cfg '__clzdi2="optimized-c"' --cfg '__clzsi2="optimized-c"' --cfg '__clzti2="optimized-c"' --cfg '__cmpdi2="optimized-c"' --cfg '__cmpti2="optimized-c"' --cfg '__ctzdi2="optimized-c"' --cfg '__ctzsi2="optimized-c"' --cfg '__ctzti2="optimized-c"' --cfg '__divdc3="optimized-c"' --cfg '__divsc3="optimized-c"' --cfg '__divxc3="optimized-c"' --cfg '__extendhfsf2="optimized-c"' --cfg '__ffsti2="optimized-c"' --cfg '__floatdisf="optimized-c"' --cfg '__floatdixf="optimized-c"' --cfg '__floatundidf="optimized-c"' --cfg '__floatundisf="optimized-c"' --cfg '__floatundixf="optimized-c"' --cfg '__int_util="optimized-c"' --cfg '__muldc3="optimized-c"' --cfg '__mulsc3="optimized-c"' --cfg '__mulvdi3="optimized-c"' --cfg '__mulvsi3="optimized-c"' --cfg '__mulvti3="optimized-c"' --cfg '__mulxc3="optimized-c"' --cfg '__negdf2="optimized-c"' --cfg '__negdi2="optimized-c"' --cfg '__negsf2="optimized-c"' --cfg '__negti2="optimized-c"' --cfg '__negvdi2="optimized-c"' --cfg '__negvsi2="optimized-c"' --cfg '__negvti2="optimized-c"' --cfg '__paritydi2="optimized-c"' --cfg '__paritysi2="optimized-c"' --cfg '__parityti2="optimized-c"' --cfg '__popcountdi2="optimized-c"' --cfg '__popcountsi2="optimized-c"' --cfg '__popcountti2="optimized-c"' --cfg '__powixf2="optimized-c"' --cfg '__subvdi3="optimized-c"' --cfg '__subvsi3="optimized-c"' --cfg '__subvti3="optimized-c"' --cfg '__truncdfhf2="optimized-c"' --cfg '__truncdfsf2="optimized-c"' --cfg '__truncsfhf2="optimized-c"' --cfg '__ucmpdi2="optimized-c"' --cfg '__ucmpti2="optimized-c"' --cfg 'apple_versioning="optimized-c"' -l static=compiler-rt` (exit status: 254)
[RUSTC-TIMING] memchr test:false 1.375
[RUSTC-TIMING] rustc_demangle test:false 1.515
[RUSTC-TIMING] alloc test:false 2.293
[RUSTC-TIMING] core test:false 16.420

@Mark-Simulacrum
Copy link
Member

@bors retry #90812

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 15, 2021
@bors
Copy link
Contributor

bors commented Nov 15, 2021

⌛ Testing commit f541dd1 with merge 891ca5f...

@bors
Copy link
Contributor

bors commented Nov 15, 2021

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 891ca5f to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 15, 2021
@bors bors merged commit 891ca5f into rust-lang:master Nov 15, 2021
@rustbot rustbot added this to the 1.58.0 milestone Nov 15, 2021
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (891ca5f): comparison url.

Summary: This change led to moderate relevant regressions 😿 in compiler performance.

  • Moderate regression in instruction counts (up to 0.8% on full builds of deeply-nested)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Nov 16, 2021
@scottmcm scottmcm deleted the new-slice-reverse branch November 17, 2021 07:13
@riking
Copy link

riking commented Nov 26, 2021

Instruction count increase seems expected for a vectorization.

@dtolnay dtolnay added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Feb 21, 2022
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Feb 24, 2022
Stop manually SIMDing in `swap_nonoverlapping`

Like I previously did for `reverse` (rust-lang#90821), this leaves it to LLVM to pick how to vectorize it, since it can know better the chunk size to use, compared to the "32 bytes always" approach we currently have.

A variety of codegen tests are included to confirm that the various cases are still being vectorized.

It does still need logic to type-erase in some cases, though, as while LLVM is now smart enough to vectorize over slices of things like `[u8; 4]`, it fails to do so over slices of `[u8; 3]`.

As a bonus, this change also means one no longer gets the spurious `memcpy`(s?) at the end up swapping a slice of `__m256`s: <https://rust.godbolt.org/z/joofr4v8Y>

<details>

<summary>ASM for this example</summary>

## Before (from godbolt)

note the `push`/`pop`s and `memcpy`

```x86
swap_m256_slice:
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 32
        cmp     rsi, rcx
        jne     .LBB0_6
        mov     r14, rsi
        shl     r14, 5
        je      .LBB0_6
        mov     r15, rdx
        mov     rbx, rdi
        xor     eax, eax
.LBB0_3:
        mov     rcx, rax
        vmovaps ymm0, ymmword ptr [rbx + rax]
        vmovaps ymm1, ymmword ptr [r15 + rax]
        vmovaps ymmword ptr [rbx + rax], ymm1
        vmovaps ymmword ptr [r15 + rax], ymm0
        add     rax, 32
        add     rcx, 64
        cmp     rcx, r14
        jbe     .LBB0_3
        sub     r14, rax
        jbe     .LBB0_6
        add     rbx, rax
        add     r15, rax
        mov     r12, rsp
        mov     r13, qword ptr [rip + memcpy@GOTPCREL]
        mov     rdi, r12
        mov     rsi, rbx
        mov     rdx, r14
        vzeroupper
        call    r13
        mov     rdi, rbx
        mov     rsi, r15
        mov     rdx, r14
        call    r13
        mov     rdi, r15
        mov     rsi, r12
        mov     rdx, r14
        call    r13
.LBB0_6:
        add     rsp, 32
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        vzeroupper
        ret
```

## After (from my machine)

Note no `rsp` manipulation, sorry for different ASM syntax

```x86
swap_m256_slice:
	cmpq	%r9, %rdx
	jne	.LBB1_6
	testq	%rdx, %rdx
	je	.LBB1_6
	cmpq	$1, %rdx
	jne	.LBB1_7
	xorl	%r10d, %r10d
	jmp	.LBB1_4
.LBB1_7:
	movq	%rdx, %r9
	andq	$-2, %r9
	movl	$32, %eax
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB1_8:
	vmovaps	-32(%rcx,%rax), %ymm0
	vmovaps	-32(%r8,%rax), %ymm1
	vmovaps	%ymm1, -32(%rcx,%rax)
	vmovaps	%ymm0, -32(%r8,%rax)
	vmovaps	(%rcx,%rax), %ymm0
	vmovaps	(%r8,%rax), %ymm1
	vmovaps	%ymm1, (%rcx,%rax)
	vmovaps	%ymm0, (%r8,%rax)
	addq	$2, %r10
	addq	$64, %rax
	cmpq	%r10, %r9
	jne	.LBB1_8
.LBB1_4:
	testb	$1, %dl
	je	.LBB1_6
	shlq	$5, %r10
	vmovaps	(%rcx,%r10), %ymm0
	vmovaps	(%r8,%r10), %ymm1
	vmovaps	%ymm1, (%rcx,%r10)
	vmovaps	%ymm0, (%r8,%r10)
.LBB1_6:
	vzeroupper
	retq
```

</details>

This does all its copying operations as either the original type or as `MaybeUninit`s, so as far as I know there should be no potential abstract machine issues with reading padding bytes as integers.

<details>

<summary>Perf is essentially unchanged</summary>

Though perhaps with more target features this would help more, if it could pick bigger chunks

## Before

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         894 ns/iter (+/- 11)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,476 ns/iter (+/- 2,784)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,257 ns/iter (+/- 7)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,922 ns/iter (+/- 959)
test slice::swap_with_slice_rgb_30                                 ... bench:         328 ns/iter (+/- 27)
test slice::swap_with_slice_rgb_3000                               ... bench:      16,215 ns/iter (+/- 176)
test slice::swap_with_slice_u8_30                                  ... bench:         312 ns/iter (+/- 9)
test slice::swap_with_slice_u8_3000                                ... bench:       5,401 ns/iter (+/- 123)
test slice::swap_with_slice_usize_30                               ... bench:         368 ns/iter (+/- 3)
test slice::swap_with_slice_usize_3000                             ... bench:      28,472 ns/iter (+/- 3,913)
```

## After

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         868 ns/iter (+/- 36)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,642 ns/iter (+/- 1,507)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,194 ns/iter (+/- 11)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,761 ns/iter (+/- 5,018)
test slice::swap_with_slice_rgb_30                                 ... bench:         324 ns/iter (+/- 6)
test slice::swap_with_slice_rgb_3000                               ... bench:      15,962 ns/iter (+/- 287)
test slice::swap_with_slice_u8_30                                  ... bench:         281 ns/iter (+/- 5)
test slice::swap_with_slice_u8_3000                                ... bench:       5,324 ns/iter (+/- 40)
test slice::swap_with_slice_usize_30                               ... bench:         275 ns/iter (+/- 5)
test slice::swap_with_slice_usize_3000                             ... bench:      28,277 ns/iter (+/- 277)
```

</detail>
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Feb 24, 2022
Stop manually SIMDing in `swap_nonoverlapping`

Like I previously did for `reverse` (rust-lang#90821), this leaves it to LLVM to pick how to vectorize it, since it can know better the chunk size to use, compared to the "32 bytes always" approach we currently have.

A variety of codegen tests are included to confirm that the various cases are still being vectorized.

It does still need logic to type-erase in some cases, though, as while LLVM is now smart enough to vectorize over slices of things like `[u8; 4]`, it fails to do so over slices of `[u8; 3]`.

As a bonus, this change also means one no longer gets the spurious `memcpy`(s?) at the end up swapping a slice of `__m256`s: <https://rust.godbolt.org/z/joofr4v8Y>

<details>

<summary>ASM for this example</summary>

## Before (from godbolt)

note the `push`/`pop`s and `memcpy`

```x86
swap_m256_slice:
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 32
        cmp     rsi, rcx
        jne     .LBB0_6
        mov     r14, rsi
        shl     r14, 5
        je      .LBB0_6
        mov     r15, rdx
        mov     rbx, rdi
        xor     eax, eax
.LBB0_3:
        mov     rcx, rax
        vmovaps ymm0, ymmword ptr [rbx + rax]
        vmovaps ymm1, ymmword ptr [r15 + rax]
        vmovaps ymmword ptr [rbx + rax], ymm1
        vmovaps ymmword ptr [r15 + rax], ymm0
        add     rax, 32
        add     rcx, 64
        cmp     rcx, r14
        jbe     .LBB0_3
        sub     r14, rax
        jbe     .LBB0_6
        add     rbx, rax
        add     r15, rax
        mov     r12, rsp
        mov     r13, qword ptr [rip + memcpy@GOTPCREL]
        mov     rdi, r12
        mov     rsi, rbx
        mov     rdx, r14
        vzeroupper
        call    r13
        mov     rdi, rbx
        mov     rsi, r15
        mov     rdx, r14
        call    r13
        mov     rdi, r15
        mov     rsi, r12
        mov     rdx, r14
        call    r13
.LBB0_6:
        add     rsp, 32
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        vzeroupper
        ret
```

## After (from my machine)

Note no `rsp` manipulation, sorry for different ASM syntax

```x86
swap_m256_slice:
	cmpq	%r9, %rdx
	jne	.LBB1_6
	testq	%rdx, %rdx
	je	.LBB1_6
	cmpq	$1, %rdx
	jne	.LBB1_7
	xorl	%r10d, %r10d
	jmp	.LBB1_4
.LBB1_7:
	movq	%rdx, %r9
	andq	$-2, %r9
	movl	$32, %eax
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB1_8:
	vmovaps	-32(%rcx,%rax), %ymm0
	vmovaps	-32(%r8,%rax), %ymm1
	vmovaps	%ymm1, -32(%rcx,%rax)
	vmovaps	%ymm0, -32(%r8,%rax)
	vmovaps	(%rcx,%rax), %ymm0
	vmovaps	(%r8,%rax), %ymm1
	vmovaps	%ymm1, (%rcx,%rax)
	vmovaps	%ymm0, (%r8,%rax)
	addq	$2, %r10
	addq	$64, %rax
	cmpq	%r10, %r9
	jne	.LBB1_8
.LBB1_4:
	testb	$1, %dl
	je	.LBB1_6
	shlq	$5, %r10
	vmovaps	(%rcx,%r10), %ymm0
	vmovaps	(%r8,%r10), %ymm1
	vmovaps	%ymm1, (%rcx,%r10)
	vmovaps	%ymm0, (%r8,%r10)
.LBB1_6:
	vzeroupper
	retq
```

</details>

This does all its copying operations as either the original type or as `MaybeUninit`s, so as far as I know there should be no potential abstract machine issues with reading padding bytes as integers.

<details>

<summary>Perf is essentially unchanged</summary>

Though perhaps with more target features this would help more, if it could pick bigger chunks

## Before

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         894 ns/iter (+/- 11)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,476 ns/iter (+/- 2,784)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,257 ns/iter (+/- 7)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,922 ns/iter (+/- 959)
test slice::swap_with_slice_rgb_30                                 ... bench:         328 ns/iter (+/- 27)
test slice::swap_with_slice_rgb_3000                               ... bench:      16,215 ns/iter (+/- 176)
test slice::swap_with_slice_u8_30                                  ... bench:         312 ns/iter (+/- 9)
test slice::swap_with_slice_u8_3000                                ... bench:       5,401 ns/iter (+/- 123)
test slice::swap_with_slice_usize_30                               ... bench:         368 ns/iter (+/- 3)
test slice::swap_with_slice_usize_3000                             ... bench:      28,472 ns/iter (+/- 3,913)
```

## After

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         868 ns/iter (+/- 36)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,642 ns/iter (+/- 1,507)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,194 ns/iter (+/- 11)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,761 ns/iter (+/- 5,018)
test slice::swap_with_slice_rgb_30                                 ... bench:         324 ns/iter (+/- 6)
test slice::swap_with_slice_rgb_3000                               ... bench:      15,962 ns/iter (+/- 287)
test slice::swap_with_slice_u8_30                                  ... bench:         281 ns/iter (+/- 5)
test slice::swap_with_slice_u8_3000                                ... bench:       5,324 ns/iter (+/- 40)
test slice::swap_with_slice_usize_30                               ... bench:         275 ns/iter (+/- 5)
test slice::swap_with_slice_usize_3000                             ... bench:      28,277 ns/iter (+/- 277)
```

</detail>
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Feb 24, 2022
Stop manually SIMDing in `swap_nonoverlapping`

Like I previously did for `reverse` (rust-lang#90821), this leaves it to LLVM to pick how to vectorize it, since it can know better the chunk size to use, compared to the "32 bytes always" approach we currently have.

A variety of codegen tests are included to confirm that the various cases are still being vectorized.

It does still need logic to type-erase in some cases, though, as while LLVM is now smart enough to vectorize over slices of things like `[u8; 4]`, it fails to do so over slices of `[u8; 3]`.

As a bonus, this change also means one no longer gets the spurious `memcpy`(s?) at the end up swapping a slice of `__m256`s: <https://rust.godbolt.org/z/joofr4v8Y>

<details>

<summary>ASM for this example</summary>

## Before (from godbolt)

note the `push`/`pop`s and `memcpy`

```x86
swap_m256_slice:
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 32
        cmp     rsi, rcx
        jne     .LBB0_6
        mov     r14, rsi
        shl     r14, 5
        je      .LBB0_6
        mov     r15, rdx
        mov     rbx, rdi
        xor     eax, eax
.LBB0_3:
        mov     rcx, rax
        vmovaps ymm0, ymmword ptr [rbx + rax]
        vmovaps ymm1, ymmword ptr [r15 + rax]
        vmovaps ymmword ptr [rbx + rax], ymm1
        vmovaps ymmword ptr [r15 + rax], ymm0
        add     rax, 32
        add     rcx, 64
        cmp     rcx, r14
        jbe     .LBB0_3
        sub     r14, rax
        jbe     .LBB0_6
        add     rbx, rax
        add     r15, rax
        mov     r12, rsp
        mov     r13, qword ptr [rip + memcpy@GOTPCREL]
        mov     rdi, r12
        mov     rsi, rbx
        mov     rdx, r14
        vzeroupper
        call    r13
        mov     rdi, rbx
        mov     rsi, r15
        mov     rdx, r14
        call    r13
        mov     rdi, r15
        mov     rsi, r12
        mov     rdx, r14
        call    r13
.LBB0_6:
        add     rsp, 32
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        vzeroupper
        ret
```

## After (from my machine)

Note no `rsp` manipulation, sorry for different ASM syntax

```x86
swap_m256_slice:
	cmpq	%r9, %rdx
	jne	.LBB1_6
	testq	%rdx, %rdx
	je	.LBB1_6
	cmpq	$1, %rdx
	jne	.LBB1_7
	xorl	%r10d, %r10d
	jmp	.LBB1_4
.LBB1_7:
	movq	%rdx, %r9
	andq	$-2, %r9
	movl	$32, %eax
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB1_8:
	vmovaps	-32(%rcx,%rax), %ymm0
	vmovaps	-32(%r8,%rax), %ymm1
	vmovaps	%ymm1, -32(%rcx,%rax)
	vmovaps	%ymm0, -32(%r8,%rax)
	vmovaps	(%rcx,%rax), %ymm0
	vmovaps	(%r8,%rax), %ymm1
	vmovaps	%ymm1, (%rcx,%rax)
	vmovaps	%ymm0, (%r8,%rax)
	addq	$2, %r10
	addq	$64, %rax
	cmpq	%r10, %r9
	jne	.LBB1_8
.LBB1_4:
	testb	$1, %dl
	je	.LBB1_6
	shlq	$5, %r10
	vmovaps	(%rcx,%r10), %ymm0
	vmovaps	(%r8,%r10), %ymm1
	vmovaps	%ymm1, (%rcx,%r10)
	vmovaps	%ymm0, (%r8,%r10)
.LBB1_6:
	vzeroupper
	retq
```

</details>

This does all its copying operations as either the original type or as `MaybeUninit`s, so as far as I know there should be no potential abstract machine issues with reading padding bytes as integers.

<details>

<summary>Perf is essentially unchanged</summary>

Though perhaps with more target features this would help more, if it could pick bigger chunks

## Before

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         894 ns/iter (+/- 11)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,476 ns/iter (+/- 2,784)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,257 ns/iter (+/- 7)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,922 ns/iter (+/- 959)
test slice::swap_with_slice_rgb_30                                 ... bench:         328 ns/iter (+/- 27)
test slice::swap_with_slice_rgb_3000                               ... bench:      16,215 ns/iter (+/- 176)
test slice::swap_with_slice_u8_30                                  ... bench:         312 ns/iter (+/- 9)
test slice::swap_with_slice_u8_3000                                ... bench:       5,401 ns/iter (+/- 123)
test slice::swap_with_slice_usize_30                               ... bench:         368 ns/iter (+/- 3)
test slice::swap_with_slice_usize_3000                             ... bench:      28,472 ns/iter (+/- 3,913)
```

## After

```
running 10 tests
test slice::swap_with_slice_4x_usize_30                            ... bench:         868 ns/iter (+/- 36)
test slice::swap_with_slice_4x_usize_3000                          ... bench:      99,642 ns/iter (+/- 1,507)
test slice::swap_with_slice_5x_usize_30                            ... bench:       1,194 ns/iter (+/- 11)
test slice::swap_with_slice_5x_usize_3000                          ... bench:     139,761 ns/iter (+/- 5,018)
test slice::swap_with_slice_rgb_30                                 ... bench:         324 ns/iter (+/- 6)
test slice::swap_with_slice_rgb_3000                               ... bench:      15,962 ns/iter (+/- 287)
test slice::swap_with_slice_u8_30                                  ... bench:         281 ns/iter (+/- 5)
test slice::swap_with_slice_u8_3000                                ... bench:       5,324 ns/iter (+/- 40)
test slice::swap_with_slice_usize_30                               ... bench:         275 ns/iter (+/- 5)
test slice::swap_with_slice_usize_3000                             ... bench:      28,277 ns/iter (+/- 277)
```

</detail>
bors added a commit to rust-lang/miri that referenced this pull request Mar 24, 2022
regression test for reverse() unsoundness

Cc rust-lang/rust#90821
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants