Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix class construction cycle in Lock on NativeAOT #100374

Merged
merged 1 commit into from
Apr 23, 2024
Merged

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Mar 27, 2024

When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle.

Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle.

Fixes #99663

When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle.

Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle.

Fixes dotnet#99663
@kouvel kouvel added this to the 9.0.0 milestone Mar 27, 2024
@kouvel kouvel self-assigned this Mar 27, 2024
@kouvel kouvel requested a review from VSadov March 27, 2024 22:40
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

@MichalStrehovsky
Copy link
Member

@VSadov @kouvel is there any timeline for this? I stopped looking at segfaults in native AOT testing because whenever I pulled down one it's the stack overflow from this.

Cc @dotnet/ilc-contrib

@kouvel
Copy link
Member Author

kouvel commented Apr 9, 2024

Just trying to fix the CI issue for now. I'm considering a different option, to have the finalizer thread initialize what is necessary for Lock in the background (not relying on a finalizable object), as this initialization is likely to run during startup anyway. The initialization scheme may change later.

@kouvel
Copy link
Member Author

kouvel commented Apr 18, 2024

@VSadov, would you be able to take a look? I think the alternative I suggested above would simplify some things but it needs more testing, I'd like to get the CI issue fixed first.

@kouvel
Copy link
Member Author

kouvel commented Apr 18, 2024

CC @mangod9

s_minSpinCount = DetermineMinSpinCount();
if (oldStage == StaticsInitializationStage.NotStarted)
{
// If the stage is PartiallyComplete, these will have already been initialized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they have already been initialized, why is it initializing them again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are initialized when the stage is changed from NotStarted. The stage doesn't go back to NotStarted once changed, so they would only be initialized once.

@VSadov
Copy link
Member

VSadov commented Apr 19, 2024

This is the third time we are trying to fix a reentrancy issue in the Lock. Perhaps fourth, if counting the same issue in the PR that introduced the Lock. As long as the Lock initialization happens inline, while taking a Lock, and that initialization is not Lock-free (i.e. allocation-free, exception-free, etc...), we will keep finding new ways for this reentrancy to happen.

Lock`s initialization should be trivial. Anything that is nontrivial should be optional and be done "out of line".
In a finalizer of a dummy object would be the simplest. (like in: https://github.com/dotnet/runtime/pull/93879/files#r1572644521)

{
// If the stage is PartiallyComplete, these will have already been initialized
s_isSingleProcessor = Environment.IsSingleProcessor;
s_maxSpinCount = DetermineMaxSpinCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if DetermineMaxSpinCount() takes a lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would attempt to take the lock and may have to spin-wait to acquire it. Not sure what you're actually asking though, can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a possibility that DetermineMaxSpinCount somehow directly or indirectly takes a lock and end up getting here.

But we are not holding anything so we are not introducing a deadlock just by reentering, so the recursive locking like this should work. Just hard to think of all the cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the idea is that there is only one attempt at initialization at any given time, so this path would not reenter even in the same thread.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

This is the third time we are trying to fix a reentrancy issue in the Lock. Perhaps fourth, if counting the same issue in the PR that introduced the Lock. As long as the Lock initialization happens inline, while taking a Lock, and that initialization is not Lock-free (i.e. allocation-free, exception-free, etc...), we will keep finding new ways for this reentrancy to happen.

We're not seeing new fundamental issues yet, the issue being fixed here is just an unfortunate bug that was introduced in a previous change of mine.

Lock`s initialization should be trivial. Anything that is nontrivial should be optional and be done "out of line".
In a finalizer of a dummy object would be the simplest. (like in: https://github.com/dotnet/runtime/pull/93879/files#r1572644521)

The lock does not behave as intended until it is initialized, I don't consider the initialization optional. This is a NativeAOT-specific issue and there it seems nontrivial to do some initialization that should be trivial. I agree that the initialization is better done out-of-line, as there are other potentials for issues that could be avoided. The initialization needs to be done in a timely manner, it's ok for it to be done in a background thread (eg. finalizer). Relying on a finalizable object does not offer timeliness. It also seems nontrivial to run managed initialization on the finalizer thread on startup.

I think simplest would be to use the EagerStaticClassConstruction attribute. I'm not sure that ModuleInitializer would be sufficient because there are other private modules that use Lock. Another workable solution would be for the finalizer thread to background-initialize what is necessary for Lock. Another would be to avoid the cycle altogether by avoiding using Lock in low-level code.

I'd be happy to make a change to accommodate any of the above with any relevant details that may be necessary to make it happen.

@VSadov
Copy link
Member

VSadov commented Apr 22, 2024

The lock does not behave as intended until it is initialized, I don't consider the initialization optional.

Currently the "initialization" is ensuring that logging is enabled and checking for overridden spin defaults. The lock can operate without this. Especially if it is not for long.

It also seems nontrivial to run managed initialization on the finalizer thread on startup.

Yes. Finalizer gets around that and also moves initialization from startup to the later time.
It may be easier to handle initialization failures also - if something throws and we want to handle such case (not sure if we would), but rerunning the finalizer later would be an option.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

Currently the "initialization" is ensuring that logging is enabled and checking for overridden spin defaults. The lock can operate without this. Especially if it is not for long.

It operates, sure, but not as intended. I don't see a good reason to change the intention of behavior because of some artificial limitation.

@VSadov
Copy link
Member

VSadov commented Apr 22, 2024

All these "unintended" states will be in practice short enough to not matter in regular case.

Forcing the lock into a spinlock mode while trying to initialize is not the intended behavior either. We just expect that it will resolve itself quickly.

It adds many additional states though and it is hard to ensure all possible situations.

Like - I am not sure we can do waits with timeouts in such mode, but we could be still ok if the duration of such mode is bounded. But if the boundedness of the initialization may somehow depend on not taking a lock with a timeout, we might still see some circularity. I can't think of an exact scenario though.

Anyways. I think the current fix will address the issue that it tries to address. At least I do not see how it could fail.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

I am not sure we can do waits with timeouts in such mode

Currently, the wait path is avoided until initialization completes. This is more a safety net than anything else - new issues can arise on the wait path and it would be difficult to solve those without that safety net. I think there are multiple good reasons to have a way in NativeAOT to do some managed initialization in the background. For instance, if we find that initialization of NativeRuntimeEventSource is too expensive, some of that could be delegated to the background.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

Background initialization of Lock in NativeAOT, if that were feasible, would only slightly simplify the initialization (and avoid some other potential issues).

@jkotas
Copy link
Member

jkotas commented Apr 22, 2024

NativeRuntimeEventSource is too expensive, some of that could be delegated to the background

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

Fair enough. In that case, it shouldn't cost much more to initialize NativeRuntimeEventSource.

@kouvel
Copy link
Member Author

kouvel commented Apr 22, 2024

At the very least, it should be cheap to initialize NativeRuntimeEventSource, such that IsEnabled returns false until it is fully initialized, perhaps lazily in some fasion.

@VSadov
Copy link
Member

VSadov commented Apr 22, 2024

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

These events are lock contention events. They typically do not happen until late enough that event source is initialized by someone else. That is one reason why we see these reentrancy failures only rarely.

I think this is something that we could trade for reliability. In fact we already do.

What we have right now already has the property of allowing events to be lost. We null-check the event source and if not yet set, we do not post events.
After this PR, such scenario will not happen because the slow path where contention events are fired cannot happen, since the lock is a strict spinlock until we are done initializing. There is still temporary information loss about possible contentious situations.

I understand the desire to have all the events, but it seems like something we could trade for reliability, when the loss is very rare.

@VSadov
Copy link
Member

VSadov commented Apr 22, 2024

These events are lock contention events. They typically do not happen until late enough that event source is initialized by someone else.

It would be possible for many apps to never need to send a contention event at all or even to use the slow path of the lock.
I think being eager with initialization (as in EagerStaticClassConstruction or ModuleInitializer) could work, but perhaps unnecessary use of startup time.

I think DetermineMinSpinCount() may read environment. Not sure how fast that is in the worst case.

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2024

On my x64 Windows machine the initialization appears to be taking about 2.3 us in total, of which about 0.2 us for initializing NativeRuntimeEventSource, so the env var lookups appear to be the more expensive parts. Also looks like the initialization is not triggered in simple console apps. Once the initialization starts though, it seems it would complete fairly quickly, so there probably wouldn't be much lost in contention events or in hiding contentions due to waiting for the initialization. So I'm thinking of going with this for now.

@kouvel kouvel merged commit f270565 into dotnet:main Apr 23, 2024
90 checks passed
@kouvel kouvel deleted the LockFix branch April 23, 2024 19:19
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle.

Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle.

Fixes dotnet#99663
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle.

Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle.

Fixes dotnet#99663
@github-actions github-actions bot locked and limited conversation to collaborators May 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants