Use wrapping pointer arithmetic in mem/impls.rs #713

saethlin · 2024-10-17T02:41:21Z

@Amanieu you suggested wrapping arithmetic and volatile loads/stores. Some quick experimentation indicates that volatile loads/stores do indeed have an optimization impact, but wrapping pointer arithmetic maybe doesn't: https://godbolt.org/z/Y8MP46dEE

Also, since these intrinsics (i.e. memcpy) are well-known to the compiler, I suspect that if a compiler is going to optimize out a read/write to the last byte in address space, it'll do that by analyzing the call site of the memcpy and not consider whether or not the implementation is full of volatiles. So I suspect using volatiles in here would be dubious benefit for making the Rust toolchain do what our users want to use it for, and probably inflict a slowdown on everyone using compiler-builtins with optimizations enabled.

Add a comment (and fix a typo)

tgross35 · 2024-10-17T02:49:11Z

For reference we have some benchmarks in https://github.com/rust-lang/compiler-builtins/blob/cb060052ab7e4bad408c85d44be7e60096e93e38/testcrate/benches/mem.rs

nikic

I'm fine with having this workaround.

saethlin · 2024-10-17T22:14:23Z

cargo bench memcpy with master:

test memcpy_builtin_1048576           ... bench:      17,544.64 ns/iter (+/- 370.31) = 59768 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      19,614.00 ns/iter (+/- 274.26) = 53460 MB/s
test memcpy_builtin_1048576_offset    ... bench:      19,790.30 ns/iter (+/- 223.04) = 52985 MB/s
test memcpy_builtin_4096              ... bench:          30.99 ns/iter (+/- 0.21) = 136533 MB/s
test memcpy_builtin_4096_misalign     ... bench:          33.73 ns/iter (+/- 0.71) = 124121 MB/s
test memcpy_builtin_4096_offset       ... bench:          33.70 ns/iter (+/- 0.78) = 124121 MB/s
test memcpy_rust_1048576              ... bench:      89,594.09 ns/iter (+/- 853.49) = 11703 MB/s
test memcpy_rust_1048576_misalign     ... bench:      91,075.34 ns/iter (+/- 903.46) = 11513 MB/s
test memcpy_rust_1048576_offset       ... bench:      90,658.27 ns/iter (+/- 1,009.27) = 11566 MB/s
test memcpy_rust_4096                 ... bench:          53.95 ns/iter (+/- 0.16) = 77283 MB/s
test memcpy_rust_4096_misalign        ... bench:          60.52 ns/iter (+/- 0.64) = 68266 MB/s
test memcpy_rust_4096_offset          ... bench:          60.95 ns/iter (+/- 0.20) = 68266 MB/s

With this PR:

test memcpy_builtin_1048576           ... bench:      20,317.31 ns/iter (+/- 420.51) = 51610 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      19,815.48 ns/iter (+/- 95.81) = 52918 MB/s
test memcpy_builtin_1048576_offset    ... bench:      19,716.34 ns/iter (+/- 154.43) = 53184 MB/s
test memcpy_builtin_4096              ... bench:          31.88 ns/iter (+/- 0.23) = 132129 MB/s
test memcpy_builtin_4096_misalign     ... bench:          34.32 ns/iter (+/- 0.93) = 120470 MB/s
test memcpy_builtin_4096_offset       ... bench:          34.35 ns/iter (+/- 0.46) = 120470 MB/s
test memcpy_rust_1048576              ... bench:      86,453.85 ns/iter (+/- 1,088.59) = 12128 MB/s
test memcpy_rust_1048576_misalign     ... bench:      90,903.52 ns/iter (+/- 833.46) = 11535 MB/s
test memcpy_rust_1048576_offset       ... bench:      90,620.70 ns/iter (+/- 896.45) = 11571 MB/s
test memcpy_rust_4096                 ... bench:          55.54 ns/iter (+/- 0.19) = 74472 MB/s
test memcpy_rust_4096_misalign        ... bench:          62.19 ns/iter (+/- 0.08) = 66064 MB/s
test memcpy_rust_4096_offset          ... bench:          62.62 ns/iter (+/- 0.23) = 66064 MB/s

With read_volatile/write_volatile:

test memcpy_builtin_1048576           ... bench:      19,837.37 ns/iter (+/- 58.06) = 52859 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      20,710.08 ns/iter (+/- 70.59) = 50631 MB/s
test memcpy_builtin_1048576_offset    ... bench:      20,723.41 ns/iter (+/- 130.66) = 50599 MB/s
test memcpy_builtin_4096              ... bench:          31.55 ns/iter (+/- 0.18) = 132129 MB/s
test memcpy_builtin_4096_misalign     ... bench:          34.02 ns/iter (+/- 1.01) = 120470 MB/s
test memcpy_builtin_4096_offset       ... bench:          34.35 ns/iter (+/- 0.42) = 120470 MB/s
test memcpy_rust_1048576              ... bench:      90,437.67 ns/iter (+/- 358.93) = 11594 MB/s
test memcpy_rust_1048576_misalign     ... bench:      91,661.44 ns/iter (+/- 596.94) = 11439 MB/s
test memcpy_rust_1048576_offset       ... bench:      93,593.03 ns/iter (+/- 489.66) = 11203 MB/s
test memcpy_rust_4096                 ... bench:          55.17 ns/iter (+/- 0.07) = 74472 MB/s
test memcpy_rust_4096_misalign        ... bench:          61.81 ns/iter (+/- 0.09) = 67147 MB/s
test memcpy_rust_4096_offset          ... bench:          62.25 ns/iter (+/- 0.45) = 66064 MB/s

There are no significant differences, but I'd still rather not use volatile in here because it just seems like overkill, and I wouldn't be shocked if my nice x86_64 CPU is compensating for the codegen regression in microbenchmarks.

tgross35

Thanks for the results, I don’t think there is any reason not to do this change if it helps the copy-to-end situation.

tgross35 · 2024-10-17T23:56:27Z

rust-lang/rust can bump to 0.1.134 after #714 merges

jonathanpallant · 2024-10-18T09:37:38Z

#714 is now merged :)

saethlin · 2024-10-18T20:57:23Z

Posted rust-lang/rust#131907

…=aDotInTheVoid Update `compiler-builtins` to 0.1.134 I'm modeling this PR after rust-lang#131314. This pulls in rust-lang/compiler-builtins#713 which should mitigate the problem reported and discussed in https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/Hello.20World.20on.20sparc-unknown-none-elf.20crashes

…=tgross35 Update `compiler-builtins` to 0.1.134 I'm modeling this PR after rust-lang#131314. This pulls in rust-lang/compiler-builtins#713 which should mitigate the problem reported and discussed in https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/Hello.20World.20on.20sparc-unknown-none-elf.20crashes

Update `compiler-builtins` to 0.1.134 I'm modeling this PR after rust-lang/rust#131314. This pulls in rust-lang/compiler-builtins#713 which should mitigate the problem reported and discussed in https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/Hello.20World.20on.20sparc-unknown-none-elf.20crashes

Use wrapping pointer arithmetic in mem/impls.rs

5bce2d9

Add a comment (and fix a typo)

tgross35 assigned Amanieu Oct 17, 2024

nikic approved these changes Oct 17, 2024

View reviewed changes

tgross35 approved these changes Oct 17, 2024

View reviewed changes

tgross35 merged commit 2a0c9ed into rust-lang:master Oct 17, 2024
25 checks passed

saethlin deleted the permissive-memcpy branch October 17, 2024 23:55

saethlin mentioned this pull request Oct 18, 2024

Update compiler-builtins to 0.1.134 rust-lang/rust#131907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use wrapping pointer arithmetic in mem/impls.rs #713

Use wrapping pointer arithmetic in mem/impls.rs #713

saethlin commented Oct 17, 2024

tgross35 commented Oct 17, 2024

nikic left a comment

saethlin commented Oct 17, 2024

tgross35 left a comment

tgross35 commented Oct 17, 2024

jonathanpallant commented Oct 18, 2024

saethlin commented Oct 18, 2024

Use wrapping pointer arithmetic in mem/impls.rs #713

Use wrapping pointer arithmetic in mem/impls.rs #713

Conversation

saethlin commented Oct 17, 2024

tgross35 commented Oct 17, 2024

nikic left a comment

Choose a reason for hiding this comment

saethlin commented Oct 17, 2024

tgross35 left a comment

Choose a reason for hiding this comment

tgross35 commented Oct 17, 2024

jonathanpallant commented Oct 18, 2024

saethlin commented Oct 18, 2024