Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mono] Race in init_method when using LLVM AOT. #75088

Merged

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented Sep 5, 2022

When using LLVM AOT codegen, init_method updates two GOT slots. These slots are initialized as part of init_method,
but there is a race between initialization of the two slots. Current implementation can have two threads running init_method for the same method, but as soon as:

[got_slots [pindex]] = addr

store is visible, it will trigger other threads to return back from init_method, but since that could happen before the corresponding LLVM AOT const slot is set, second thread will return to method calling init_method, load the LLVM aot const, and crash when trying to use it (since its still NULL).

This crash is very rare but have been identified on x86/x64 CPU's, when one thread is either preempted between updating regular GOT slot and LLVM aot const slot or store into LLVM aot const slot gets delayed in store buffer. I have also been able to emulate the scenario in debugger, triggering the issue and crashing in the method loading from LLVM aot const slot, like a vtable and then trying to allocate an object using a NULL vtable.

Fix change order of updates and make sure the update of LLVM aot const slot happens before mono_memory_barrier, since:

got [got_slots [pindex]] = addr;

have release semantics in relation to addr and update of amodule->info.llvm_init_aotconst. Fix also add acquire/release semantics for ji->type in init_method since it is used to guard if a thread ignores a patch or not and it should not be re-ordered with previous stores (on weak memory architectures), since it can cause similar race conditions with updated slots.

@lateralusX
Copy link
Member Author

lateralusX commented Sep 5, 2022

@vargaz I also looked a little at the placement of:

if (ji_type == MONO_PATCH_INFO_METHOD_JUMP)
    register_jump_target_got_slot (ji->data.method, &(got [got_slots [pindex]]));

happening after we set:

got [got_slots [pindex]] = addr;

is there a strict ordering that needs to occur between these two? Order is currently enforced due to register_jump_target_got_slot takes a mutex lock, but question is if it must happen after got [got_slots [pindex]] = addr; or if it could be moved above the mono_memory_barrier. Not sure if that would cause any side effects since I'm not full up to date when we could patch the registered jump got slot. Do you see any side effects doing that? The current placement could also trigger a race since another thread will return out from init_method as soon as got [got_slots [pindex]] = addr; is visible and before thread calls register_jump_traget_got_slot, so that could race as well, but not sure it can be exploited in practice.

@vargaz
Copy link
Contributor

vargaz commented Sep 7, 2022

So the problem was a race between setting got [...] and calling init_aotconst () ?

@vargaz
Copy link
Contributor

vargaz commented Sep 7, 2022

I think register_jump_target_got_slot () could be moved earlier as well.

@lateralusX
Copy link
Member Author

So the problem was a race between setting got [...] and calling init_aotconst () ?

Correct.

@lateralusX
Copy link
Member Author

lateralusX commented Sep 7, 2022

I think register_jump_target_got_slot () could be moved earlier as well.

OK, then I move that to happen just before the memory barrier as well, then the assignment of:

got [got_slots [pindex]] = addr;

will have release semantics for all the writes above the memory barrier (including work done in register_jump_target_got_slot)

When using LLVM AOT codegen, init_method updates two GOT slots.
These slots are initialized as part of init_method,
but there is a race between initialization of the two slots. Current
implementation can have two threads running init_method for the same
method, but as soon as:

[got_slots [pindex]] = addr

store is visible, it will trigger other threads to return back from
init_method, but since that could happen before the corresponding
LLVM AOT const slot is set, second thread will return to method
calling init_method, load the LLVM aot const, and crash when
trying to use it (since its still NULL).

This crash is very rare but have been identified on x86/x64 CPU's,
when one thread is either preempted between updating regular GOT slot
and LLVM GOT slot or store into LLVM GOT slot gets delayed in
store buffer. I have also been able to emulate the scenario in debugger,
triggering the issue and crashing in the method loading from LLVM aot
const slot.

Fix change order of updates and make sure the update of LLVM aot const
slot happens before memory_barrier, since:

got [got_slots [pindex]] = addr;

have release semantics in relation to addr and update of LLVM aot const
slot. Fix also add acquire/release semantics for ji->type in init_method
since it is used to guard if a thread ignores a patch or not and it
should not be re-ordered with previous stores, since it can cause
similar race conditions with updated slots.
@lateralusX lateralusX force-pushed the lateralusX/fix-race-in-llvm-aot-init-method branch from 370e2bb to d541a32 Compare September 8, 2022 13:39
@lateralusX
Copy link
Member Author

WasmBuildTests CI lanes fails on other PR's as well, so unrelated.

@lateralusX lateralusX merged commit f591e98 into dotnet:main Sep 9, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 9, 2022
@lambdageek
Copy link
Member

/backport to release/7.0

@github-actions github-actions bot unlocked this conversation Oct 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/6406322696

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Codegen-LLVM-mono runtime-mono specific to the Mono runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants