Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use wrapping pointer arithmetic in mem/impls.rs #713

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

saethlin
Copy link
Member

@Amanieu you suggested wrapping arithmetic and volatile loads/stores. Some quick experimentation indicates that volatile loads/stores do indeed have an optimization impact, but wrapping pointer arithmetic maybe doesn't: https://godbolt.org/z/Y8MP46dEE

Also, since these intrinsics (i.e. memcpy) are well-known to the compiler, I suspect that if a compiler is going to optimize out a read/write to the last byte in address space, it'll do that by analyzing the call site of the memcpy and not consider whether or not the implementation is full of volatiles. So I suspect using volatiles in here would be dubious benefit for making the Rust toolchain do what our users want to use it for, and probably inflict a slowdown on everyone using compiler-builtins with optimizations enabled.

Add a comment (and fix a typo)
@tgross35
Copy link
Contributor

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with having this workaround.

@saethlin
Copy link
Member Author

cargo bench memcpy with master:

test memcpy_builtin_1048576           ... bench:      17,544.64 ns/iter (+/- 370.31) = 59768 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      19,614.00 ns/iter (+/- 274.26) = 53460 MB/s
test memcpy_builtin_1048576_offset    ... bench:      19,790.30 ns/iter (+/- 223.04) = 52985 MB/s
test memcpy_builtin_4096              ... bench:          30.99 ns/iter (+/- 0.21) = 136533 MB/s
test memcpy_builtin_4096_misalign     ... bench:          33.73 ns/iter (+/- 0.71) = 124121 MB/s
test memcpy_builtin_4096_offset       ... bench:          33.70 ns/iter (+/- 0.78) = 124121 MB/s
test memcpy_rust_1048576              ... bench:      89,594.09 ns/iter (+/- 853.49) = 11703 MB/s
test memcpy_rust_1048576_misalign     ... bench:      91,075.34 ns/iter (+/- 903.46) = 11513 MB/s
test memcpy_rust_1048576_offset       ... bench:      90,658.27 ns/iter (+/- 1,009.27) = 11566 MB/s
test memcpy_rust_4096                 ... bench:          53.95 ns/iter (+/- 0.16) = 77283 MB/s
test memcpy_rust_4096_misalign        ... bench:          60.52 ns/iter (+/- 0.64) = 68266 MB/s
test memcpy_rust_4096_offset          ... bench:          60.95 ns/iter (+/- 0.20) = 68266 MB/s

With this PR:

test memcpy_builtin_1048576           ... bench:      20,317.31 ns/iter (+/- 420.51) = 51610 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      19,815.48 ns/iter (+/- 95.81) = 52918 MB/s
test memcpy_builtin_1048576_offset    ... bench:      19,716.34 ns/iter (+/- 154.43) = 53184 MB/s
test memcpy_builtin_4096              ... bench:          31.88 ns/iter (+/- 0.23) = 132129 MB/s
test memcpy_builtin_4096_misalign     ... bench:          34.32 ns/iter (+/- 0.93) = 120470 MB/s
test memcpy_builtin_4096_offset       ... bench:          34.35 ns/iter (+/- 0.46) = 120470 MB/s
test memcpy_rust_1048576              ... bench:      86,453.85 ns/iter (+/- 1,088.59) = 12128 MB/s
test memcpy_rust_1048576_misalign     ... bench:      90,903.52 ns/iter (+/- 833.46) = 11535 MB/s
test memcpy_rust_1048576_offset       ... bench:      90,620.70 ns/iter (+/- 896.45) = 11571 MB/s
test memcpy_rust_4096                 ... bench:          55.54 ns/iter (+/- 0.19) = 74472 MB/s
test memcpy_rust_4096_misalign        ... bench:          62.19 ns/iter (+/- 0.08) = 66064 MB/s
test memcpy_rust_4096_offset          ... bench:          62.62 ns/iter (+/- 0.23) = 66064 MB/s

With read_volatile/write_volatile:

test memcpy_builtin_1048576           ... bench:      19,837.37 ns/iter (+/- 58.06) = 52859 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      20,710.08 ns/iter (+/- 70.59) = 50631 MB/s
test memcpy_builtin_1048576_offset    ... bench:      20,723.41 ns/iter (+/- 130.66) = 50599 MB/s
test memcpy_builtin_4096              ... bench:          31.55 ns/iter (+/- 0.18) = 132129 MB/s
test memcpy_builtin_4096_misalign     ... bench:          34.02 ns/iter (+/- 1.01) = 120470 MB/s
test memcpy_builtin_4096_offset       ... bench:          34.35 ns/iter (+/- 0.42) = 120470 MB/s
test memcpy_rust_1048576              ... bench:      90,437.67 ns/iter (+/- 358.93) = 11594 MB/s
test memcpy_rust_1048576_misalign     ... bench:      91,661.44 ns/iter (+/- 596.94) = 11439 MB/s
test memcpy_rust_1048576_offset       ... bench:      93,593.03 ns/iter (+/- 489.66) = 11203 MB/s
test memcpy_rust_4096                 ... bench:          55.17 ns/iter (+/- 0.07) = 74472 MB/s
test memcpy_rust_4096_misalign        ... bench:          61.81 ns/iter (+/- 0.09) = 67147 MB/s
test memcpy_rust_4096_offset          ... bench:          62.25 ns/iter (+/- 0.45) = 66064 MB/s

There are no significant differences, but I'd still rather not use volatile in here because it just seems like overkill, and I wouldn't be shocked if my nice x86_64 CPU is compensating for the codegen regression in microbenchmarks.

Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the results, I don’t think there is any reason not to do this change if it helps the copy-to-end situation.

@tgross35 tgross35 merged commit 2a0c9ed into rust-lang:master Oct 17, 2024
25 checks passed
@saethlin saethlin deleted the permissive-memcpy branch October 17, 2024 23:55
@tgross35
Copy link
Contributor

rust-lang/rust can bump to 0.1.134 after #714 merges

@jonathanpallant
Copy link

#714 is now merged :)

@saethlin
Copy link
Member Author

Posted rust-lang/rust#131907

bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 19, 2024
…=aDotInTheVoid

Update `compiler-builtins` to 0.1.134

I'm modeling this PR after rust-lang#131314.

This pulls in rust-lang/compiler-builtins#713 which should mitigate the problem reported and discussed in https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/Hello.20World.20on.20sparc-unknown-none-elf.20crashes
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 19, 2024
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants