-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very bad codegen for thread_local! on OSX #60141
Comments
It may be interesting to see how much codegen is altered by inlining __getit? |
@LifeIsStrange it has been inlined here. I would like to fix this, but I'm not sure if the bug is on the rustc side or libstd. Is there anyone who can give some guidance on where to look? If not I'll just submit a platform specific version of the |
macos tlv workaround fixes: #60141 Includes: * remove dead code: `requires_move_before_drop`. This hasn't been needed for a while now (oops I should have removed it in #57655) * redox had a copy of `fast::Key` (not sure why?). That has been removed. * Perform a `read_volatile` on OSX to reduce `tlv_get_addr` calls per `__getit` from (4-2 depending on context) to 1. `tlv_get_addr` is relatively expensive (~1.5ns on my machine). Previously, in contexts where `__getit` was inlined, 4 calls to `tlv_get_addr` were performed per lookup. For some reason when `__getit` is not inlined this is reduced to 2x - and performance improves to match. After this PR, I have only ever seen 1x call to `tlv_get_addr` per `__getit`, and macos now benefits from situations where `__getit` is inlined. I'm not sure if the `read_volatile(&&__KEY)` trick is working around an LLVM bug, or a rustc bug, or neither. r? @alexcrichton
macos tlv workaround fixes: #60141 Includes: * remove dead code: `requires_move_before_drop`. This hasn't been needed for a while now (oops I should have removed it in #57655) * redox had a copy of `fast::Key` (not sure why?). That has been removed. * Perform a `read_volatile` on OSX to reduce `tlv_get_addr` calls per `__getit` from (4-2 depending on context) to 1. `tlv_get_addr` is relatively expensive (~1.5ns on my machine). Previously, in contexts where `__getit` was inlined, 4 calls to `tlv_get_addr` were performed per lookup. For some reason when `__getit` is not inlined this is reduced to 2x - and performance improves to match. After this PR, I have only ever seen 1x call to `tlv_get_addr` per `__getit`, and macos now benefits from situations where `__getit` is inlined. I'm not sure if the `read_volatile(&&__KEY)` trick is working around an LLVM bug, or a rustc bug, or neither. r? @alexcrichton
This is so odd, why does using a |
Some background you probably already know. LLVM and gcc/clang/etc treat Before the read volatile trick, the read is not memoized because the optimizer mistakenly thinks it is faster to just calculate the address again, as it probly would be for normal non-thread-local variables. This results in a minimum of 2-4 |
Ah, that makes a lot of sense. Thanks for the explanation! |
This sample code demonstrates the problem.
godbolt link (macos on left/top).
asm
The asm demonstrates that even when the value has been initialized, and the destructor registered, but not yet running, the thread local pointer gets looked up (
callq *(%rdi)
) four times!!! On linux that lookup only occurs once.One potential fix is to insert a
read_volatile
inside of__getit()
.Which improves the asm to the following:
asm
Benchmark
Results
The text was updated successfully, but these errors were encountered: