-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bolt] fix a wrong relocation update issue with weak references #69136
Conversation
d999ee5
to
cef0e7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
bolt/lib/Rewrite/RewriteInstance.cpp
Outdated
@@ -1974,6 +1974,14 @@ bool RewriteInstance::analyzeRelocation( | |||
if (!Relocation::isSupported(RType)) | |||
return false; | |||
|
|||
auto isWeakReference = [](const SymbolRef &Symbol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsWeakUndReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping, please capitalise and add Und
.type _start, %function | ||
_start: | ||
.LFB6: | ||
.cfi_startproc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add second case in test, where the symbol is actually emitted
f54c244
to
345b857
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Please capitalise and and Und to IsWeakUndReference.
bolt/lib/Rewrite/RewriteInstance.cpp
Outdated
@@ -1974,6 +1974,14 @@ bool RewriteInstance::analyzeRelocation( | |||
if (!Relocation::isSupported(RType)) | |||
return false; | |||
|
|||
auto isWeakReference = [](const SymbolRef &Symbol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping, please capitalise and add Und
|
||
# CHECK: {{.*}} <.rodata>: | ||
# CHECK-NEXT: {{.*}} .word 0x{{[0]+}}[[#ADDR]] | ||
# CHECK-NEXT: {{.*}} .word 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The func_1 check is missed now
345b857
to
efd7d37
Compare
Hi @yota9 , sorry for a late reply. I made several changes,
I still use IsWeakReference, since weak reference is a term for such a case. Is it ok? |
efd7d37
to
4affa07
Compare
Hi @linsinan1995 . May I ask in what what it might impact it? Does some linkers removes the symbol name?
Corrent me if I'm wrong but the weak reference doesn't always mean that the symbol is NULL. You're looking to the 2 parameters: weak and undefined, that is why I was asking to add both weak and und to the lambda name. |
4affa07
to
6a25f55
Compare
Hi, I spent some time learning why this happened, and finally found a way to reproduce it with lld ... (we need dynamic relocation and also func_1/2 symbol values live in .rodata)
I think But both are fine. in lld/ELF/Symbols.h |
@linsinan1995 Thanks for explanations, I don't have objections on this :) The one thing I want to ask you is to minimise the text asm, it seems to be unnecessary to add all of the instructions , we want to keep tests minimalistic |
…eak reference symbol. Take a common weak reference pattern for example ``` __attribute__((weak)) void undef_weak_fun(); if (&undef_weak_fun) undef_weak_fun(); ``` In this case, an undefined weak symbol `undef_weak_fun` has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behavior.
6a25f55
to
2acdf00
Compare
Done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your fix! :)
Thanks! |
Regarding the formatting/descriptions on the PR: Capitalize BOLT. As a suggestion, I would also make the title shorter (like 50/72 rule format). Although we don't explicitly enforce this, a lot of devs use this because it looks nicer on git log. In BOLT, most of diffs are formatted like that. For example:
The first line of the commit message should be the title of the PR, while the remaining lines formatted to 72 columns are the the description of the PR. BTW I was re-reading https://reviews.llvm.org/D118088 on why are we redirecting these relocs to PLT entries in the first place, and I can't really tell why. @yota9 do you remember why? The problem with that is that we could be pessimizing some .data references if the linker concluded they should go straight to the functions and is bypassing the PLT. By doing this we go to these optimized references and convert them back to go through the PLT again. In some instances this might change the intended behavior too if the application was linked with the intent to make a given symbol non-preemptible, but we go there and make it preemptible again by redirecting the reference to a PLT entry. We can't blindly redirect everything to a PLT entry, and this weak-undef case is just one example why doing so is incorrect. |
If the assembly code was generated from C source, could you please include the source in the comments and what compiler flags were used to generate the code? |
Will do. Thanks a ton for your detailed guide!
Thats a really good point.
|
Sorry, I fill like I didn't quite understand your question. I think you're talking about these lines:
The logic behind this is simple: we found the symbol, but the symbol has no address. Usually it means that the symbol is UND and it would be found it PLT. Yes, I didn't take into account the case like weak reference fixed here. Maybe I misunderstand you, sorry, please give more more information and I would try to answer :) |
Thanks for the context, @yota9. I missed the check that guards on the symbol having no address, which makes this code much more restricted than I previously thought. But I wonder which symbols would have no address, and still be strong definitions (non-weak)? I was trying to understand why is this if() necessary, specially after the fix in this patch. Another thing I was discussing with Maksim yesterday: Maksim pointed out that we should be skipping processing static relocations in places that also have dynamic relocations, and this weak ref case is an example that should have a dynamic relocation. After all, the weak symbol pattern is precisely one in which the runtime can modify it by resolving the weak undef, so BOLT has no business in updating a reference that will be resolved at runtime. But if we look at https://github.com/llvm/llvm-project/blob/main/bolt/lib/Rewrite/RewriteInstance.cpp#L2451 we only do this for non-AArch64, because of https://reviews.llvm.org/D122100. Now, we have this exception that AArch64 processes static relocs in places that have dynamic relocs in the same place because AArch64 uses R_AARCH64_RELATIVE in .rela.dyn (dynamic) to fixup the load address in constant islands. So we're forced to read static relocs that also have dynamic relocs in the same offset, for this specific case, for AArch64. But that seems to be too big of a hammer to fix this problem. Perhaps we should only process those static relocations if they have the dynamic R_AARCH64_RELATIVE in the same offset, but not other kinds of dynamic relocs [that might mean that the runtime will completely recompute the address at that location]. |
Any PLT-related symbol located in another binary (DEFAULT UND)
As for the further discussions - we need to consider it carefully. Maybe you're right and we need to add "white" list of dynamic relocations that would also have static relocations. To be honest I can't say what way would be the best here right away, it needs deep consideration and long list of tests I think :) I would try to ponder about this at leisure cause it doesn't look as simple question to me.. |
# CHECK: {{.*}} <.rodata>: | ||
# CHECK-NEXT: {{.*}} .word 0x00000000 | ||
# CHECK-NEXT: {{.*}} .word 0x00000000 | ||
# CHECK-NEXT: {{.*}} .word 0x{{[0]+}}[[#ADDR]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a discussion, another thing Maksim pointed out is that this actually needs to be zero too, because this part is resolved by dynamic linker at runtime. I've checked and the pre-bolt binary is indeed zeroed here. Currently, BOLT makes this non-zero because we're processing dynamic relocs for aarch64, for the reasons I discussed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.
Can we rely on the linker's decision in such cases? I.e., if the linker decided to put 0, keep it at that. If it was the symbol value, then we can update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.
Can we rely on the linker's decision in such cases? I.e., if the linker decided to put 0, keep it at that. If it was the symbol value, then we can update it.
Theoretically we can. But is there a reason behind it? It doesn't affect runtime. And to be honest I prefer to set the values, we can easily see the value during objdump, sometimes it is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yota9 why should BOLT change this? Isn't leaving it to linker the right thing to do and also less risky in terms of correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-dag Currently I don't see any risks. If there would be some troubles I would try to handle such situations in separate patch, but right now I didn't see binaries where we should worry about these changes. Anyway this question is out of scope of this patch.
So can we agree on merging this commit @rafaelauler ? I still think that this extra check is not useless and it fixes the problem :) |
@maksfb @rafaelauler If no ojections I would like to merge this patch by the end of a week and backport it to 19x. Thanks! |
Hi @linsinan1995. Please merge this patch by wednesday too if no further comments would be provided. Thank you! |
Hi, Wanted to report this seems to be the same issue we ran into while trying to apply BOLT to an Aarch64 binary dynamically linked with musl. So it is a very important use case for embedded software world and we expect more people will run into it as BOLT Aarch64/embedded usage picks up. It can be reproduced with a simple hello_world. Reproduction steps are below:
1- 2-
The patch in 2acdf00 does fix this reproducer as well as our own binary. We hope to see it merged soon. |
I agree, me or @linsinan1995 would probably merge it by the end of the day |
/cherry-pick 6c8933e |
merged, Thanks! |
) Take a common weak reference pattern for example ``` __attribute__((weak)) void undef_weak_fun(); if (&undef_weak_fun) undef_weak_fun(); ``` In this case, an undefined weak symbol `undef_weak_fun` has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behavior. (cherry picked from commit 6c8933e)
/pull-request #102295 |
Take a common weak reference pattern for example ``` __attribute__((weak)) void undef_weak_fun(); if (&undef_weak_fun) undef_weak_fun(); ``` In this case, an undefined weak symbol `undef_weak_fun` has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behavior.
) Take a common weak reference pattern for example ``` __attribute__((weak)) void undef_weak_fun(); if (&undef_weak_fun) undef_weak_fun(); ``` In this case, an undefined weak symbol `undef_weak_fun` has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behavior. (cherry picked from commit 6c8933e)
) Take a common weak reference pattern for example ``` __attribute__((weak)) void undef_weak_fun(); if (&undef_weak_fun) undef_weak_fun(); ``` In this case, an undefined weak symbol `undef_weak_fun` has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behavior.
It is legal to have an address of zero with weak references, but Bolt will update the relocation against this symbol to its PLT entry(a bolt-synthetized symbol and has a non-zero address), which leads to wrong runtime behaviour.
I recently encountered a problem where a segv occurs after using Bolt(crash even just
llvm-bolt app -o app.opt
). It turns out to be related to weak references. e.g.(ref: https://maskray.me/blog/2021-04-25-weak-symbol)
In this case, an undefined weak symbol
undef_weak_fun
has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behaviour.A real-world use case of weak reference: facebook/zstd@6cee3c2