Fix class construction cycle in Lock on NativeAOT #100374

kouvel · 2024-03-27T22:40:05Z

When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle.

Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle.

Fixes #99663

When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle. Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle. Fixes dotnet#99663

dotnet-policy-service · 2024-03-27T22:40:39Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

MichalStrehovsky · 2024-04-09T08:19:08Z

@VSadov @kouvel is there any timeline for this? I stopped looking at segfaults in native AOT testing because whenever I pulled down one it's the stack overflow from this.

Cc @dotnet/ilc-contrib

kouvel · 2024-04-09T10:49:08Z

Just trying to fix the CI issue for now. I'm considering a different option, to have the finalizer thread initialize what is necessary for Lock in the background (not relying on a finalizable object), as this initialization is likely to run during startup anyway. The initialization scheme may change later.

kouvel · 2024-04-18T05:19:28Z

@VSadov, would you be able to take a look? I think the alternative I suggested above would simplify some things but it needs more testing, I'd like to get the CI issue fixed first.

kouvel · 2024-04-18T05:20:24Z

CC @mangod9

mangod9 · 2024-04-18T06:10:12Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs

-                s_minSpinCount = DetermineMinSpinCount();
+                if (oldStage == StaticsInitializationStage.NotStarted)
+                {
+                    // If the stage is PartiallyComplete, these will have already been initialized


if they have already been initialized, why is it initializing them again?

These are initialized when the stage is changed from NotStarted. The stage doesn't go back to NotStarted once changed, so they would only be initialized once.

VSadov · 2024-04-19T16:44:40Z

This is the third time we are trying to fix a reentrancy issue in the Lock. Perhaps fourth, if counting the same issue in the PR that introduced the Lock. As long as the Lock initialization happens inline, while taking a Lock, and that initialization is not Lock-free (i.e. allocation-free, exception-free, etc...), we will keep finding new ways for this reentrancy to happen.

Lock`s initialization should be trivial. Anything that is nontrivial should be optional and be done "out of line".
In a finalizer of a dummy object would be the simplest. (like in: https://github.com/dotnet/runtime/pull/93879/files#r1572644521)

VSadov · 2024-04-19T16:58:41Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs

+                {
+                    // If the stage is PartiallyComplete, these will have already been initialized
+                    s_isSingleProcessor = Environment.IsSingleProcessor;
+                    s_maxSpinCount = DetermineMaxSpinCount();


What happens if DetermineMaxSpinCount() takes a lock?

It would attempt to take the lock and may have to spin-wait to acquire it. Not sure what you're actually asking though, can you clarify?

I was thinking of a possibility that DetermineMaxSpinCount somehow directly or indirectly takes a lock and end up getting here.

But we are not holding anything so we are not introducing a deadlock just by reentering, so the recursive locking like this should work. Just hard to think of all the cases.

Yes the idea is that there is only one attempt at initialization at any given time, so this path would not reenter even in the same thread.

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs

kouvel · 2024-04-22T15:47:48Z

This is the third time we are trying to fix a reentrancy issue in the Lock. Perhaps fourth, if counting the same issue in the PR that introduced the Lock. As long as the Lock initialization happens inline, while taking a Lock, and that initialization is not Lock-free (i.e. allocation-free, exception-free, etc...), we will keep finding new ways for this reentrancy to happen.

We're not seeing new fundamental issues yet, the issue being fixed here is just an unfortunate bug that was introduced in a previous change of mine.

Lock`s initialization should be trivial. Anything that is nontrivial should be optional and be done "out of line".
In a finalizer of a dummy object would be the simplest. (like in: https://github.com/dotnet/runtime/pull/93879/files#r1572644521)

The lock does not behave as intended until it is initialized, I don't consider the initialization optional. This is a NativeAOT-specific issue and there it seems nontrivial to do some initialization that should be trivial. I agree that the initialization is better done out-of-line, as there are other potentials for issues that could be avoided. The initialization needs to be done in a timely manner, it's ok for it to be done in a background thread (eg. finalizer). Relying on a finalizable object does not offer timeliness. It also seems nontrivial to run managed initialization on the finalizer thread on startup.

I think simplest would be to use the EagerStaticClassConstruction attribute. I'm not sure that ModuleInitializer would be sufficient because there are other private modules that use Lock. Another workable solution would be for the finalizer thread to background-initialize what is necessary for Lock. Another would be to avoid the cycle altogether by avoiding using Lock in low-level code.

I'd be happy to make a change to accommodate any of the above with any relevant details that may be necessary to make it happen.

VSadov · 2024-04-22T16:26:50Z

The lock does not behave as intended until it is initialized, I don't consider the initialization optional.

Currently the "initialization" is ensuring that logging is enabled and checking for overridden spin defaults. The lock can operate without this. Especially if it is not for long.

It also seems nontrivial to run managed initialization on the finalizer thread on startup.

Yes. Finalizer gets around that and also moves initialization from startup to the later time.
It may be easier to handle initialization failures also - if something throws and we want to handle such case (not sure if we would), but rerunning the finalizer later would be an option.

kouvel · 2024-04-22T16:33:23Z

Currently the "initialization" is ensuring that logging is enabled and checking for overridden spin defaults. The lock can operate without this. Especially if it is not for long.

It operates, sure, but not as intended. I don't see a good reason to change the intention of behavior because of some artificial limitation.

VSadov · 2024-04-22T16:47:31Z

All these "unintended" states will be in practice short enough to not matter in regular case.

Forcing the lock into a spinlock mode while trying to initialize is not the intended behavior either. We just expect that it will resolve itself quickly.

It adds many additional states though and it is hard to ensure all possible situations.

Like - I am not sure we can do waits with timeouts in such mode, but we could be still ok if the duration of such mode is bounded. But if the boundedness of the initialization may somehow depend on not taking a lock with a timeout, we might still see some circularity. I can't think of an exact scenario though.

Anyways. I think the current fix will address the issue that it tries to address. At least I do not see how it could fail.

VSadov

LGTM.

kouvel · 2024-04-22T17:14:00Z

I am not sure we can do waits with timeouts in such mode

Currently, the wait path is avoided until initialization completes. This is more a safety net than anything else - new issues can arise on the wait path and it would be difficult to solve those without that safety net. I think there are multiple good reasons to have a way in NativeAOT to do some managed initialization in the background. For instance, if we find that initialization of NativeRuntimeEventSource is too expensive, some of that could be delegated to the background.

kouvel · 2024-04-22T17:32:01Z

Background initialization of Lock in NativeAOT, if that were feasible, would only slightly simplify the initialization (and avoid some other potential issues).

jkotas · 2024-04-22T17:40:09Z

NativeRuntimeEventSource is too expensive, some of that could be delegated to the background

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

kouvel · 2024-04-22T18:02:26Z

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

Fair enough. In that case, it shouldn't cost much more to initialize NativeRuntimeEventSource.

kouvel · 2024-04-22T18:07:27Z

At the very least, it should be cheap to initialize NativeRuntimeEventSource, such that IsEnabled returns false until it is fully initialized, perhaps lazily in some fasion.

VSadov · 2024-04-22T19:33:05Z

FWIW, I do not think that it would work. EventSource has to be initialized eagerly, otherwise the events fired early during startup would be lost.

These events are lock contention events. They typically do not happen until late enough that event source is initialized by someone else. That is one reason why we see these reentrancy failures only rarely.

I think this is something that we could trade for reliability. In fact we already do.

What we have right now already has the property of allowing events to be lost. We null-check the event source and if not yet set, we do not post events.
After this PR, such scenario will not happen because the slow path where contention events are fired cannot happen, since the lock is a strict spinlock until we are done initializing. There is still temporary information loss about possible contentious situations.

I understand the desire to have all the events, but it seems like something we could trade for reliability, when the loss is very rare.

VSadov · 2024-04-22T19:46:33Z

These events are lock contention events. They typically do not happen until late enough that event source is initialized by someone else.

It would be possible for many apps to never need to send a contention event at all or even to use the slow path of the lock.
I think being eager with initialization (as in EagerStaticClassConstruction or ModuleInitializer) could work, but perhaps unnecessary use of startup time.

I think DetermineMinSpinCount() may read environment. Not sure how fast that is in the worst case.

kouvel · 2024-04-23T17:01:55Z

On my x64 Windows machine the initialization appears to be taking about 2.3 us in total, of which about 0.2 us for initializing NativeRuntimeEventSource, so the env var lookups appear to be the more expensive parts. Also looks like the initialization is not triggered in simple console apps. Once the initialization starts though, it seems it would complete fairly quickly, so there probably wouldn't be much lost in contention events or in hiding contentions due to waiting for the initialization. So I'm thinking of going with this for now.

When a thread reenters class construction through accessing NativeRuntimeEventSource.Log, Log would return null. Checks on IsFullyInitialized were added to ensure that the normal path would not be taken in that case, to avoid null checks in several places and in different files. That doesn't work when a different thread sees the initialization stage as Complete, as it would try to initialize NativeRuntimeEventSource and run into a class construction cycle. Fixed by removing the IsFullyInitialized checks, introducing a new initialization stage PartiallyCompelte, and not setting the stage to Complete until it has been verified that Log does not return null. When the stage is PartiallyCompelte, a thread would retry the relevant initialization. This again guarantees that there would be at most one attempt at initialization through Lock at any given time, and prevents the class construction cycle. Fixes dotnet#99663

kouvel added the area-NativeAOT-coreclr label Mar 27, 2024

kouvel added this to the 9.0.0 milestone Mar 27, 2024

kouvel self-assigned this Mar 27, 2024

kouvel requested a review from MichalStrehovsky as a code owner March 27, 2024 22:40

kouvel requested a review from VSadov March 27, 2024 22:40

mangod9 reviewed Apr 18, 2024

View reviewed changes

VSadov reviewed Apr 19, 2024

View reviewed changes

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs Show resolved Hide resolved

VSadov approved these changes Apr 22, 2024

View reviewed changes

kouvel merged commit f270565 into dotnet:main Apr 23, 2024
90 checks passed

kouvel deleted the LockFix branch April 23, 2024 19:19

github-actions bot locked and limited conversation to collaborators May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix class construction cycle in Lock on NativeAOT #100374

Fix class construction cycle in Lock on NativeAOT #100374

kouvel commented Mar 27, 2024

dotnet-policy-service bot commented Mar 27, 2024

MichalStrehovsky commented Apr 9, 2024

kouvel commented Apr 9, 2024

kouvel commented Apr 18, 2024

kouvel commented Apr 18, 2024

mangod9 Apr 18, 2024

kouvel Apr 22, 2024

VSadov commented Apr 19, 2024 •

edited

Loading

VSadov Apr 19, 2024

kouvel Apr 22, 2024

VSadov Apr 22, 2024

kouvel Apr 22, 2024

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024 •

edited

Loading

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024

VSadov left a comment

kouvel commented Apr 22, 2024

kouvel commented Apr 22, 2024 •

edited

Loading

jkotas commented Apr 22, 2024

kouvel commented Apr 22, 2024

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024 •

edited

Loading

VSadov commented Apr 22, 2024

kouvel commented Apr 23, 2024

Fix class construction cycle in Lock on NativeAOT #100374

Fix class construction cycle in Lock on NativeAOT #100374

Conversation

kouvel commented Mar 27, 2024

dotnet-policy-service bot commented Mar 27, 2024

MichalStrehovsky commented Apr 9, 2024

kouvel commented Apr 9, 2024

kouvel commented Apr 18, 2024

kouvel commented Apr 18, 2024

mangod9 Apr 18, 2024

Choose a reason for hiding this comment

kouvel Apr 22, 2024

Choose a reason for hiding this comment

VSadov commented Apr 19, 2024 • edited Loading

VSadov Apr 19, 2024

Choose a reason for hiding this comment

kouvel Apr 22, 2024

Choose a reason for hiding this comment

VSadov Apr 22, 2024

Choose a reason for hiding this comment

kouvel Apr 22, 2024

Choose a reason for hiding this comment

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024 • edited Loading

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024

VSadov left a comment

Choose a reason for hiding this comment

kouvel commented Apr 22, 2024

kouvel commented Apr 22, 2024 • edited Loading

jkotas commented Apr 22, 2024

kouvel commented Apr 22, 2024

kouvel commented Apr 22, 2024

VSadov commented Apr 22, 2024 • edited Loading

VSadov commented Apr 22, 2024

kouvel commented Apr 23, 2024

VSadov commented Apr 19, 2024 •

edited

Loading

VSadov commented Apr 22, 2024 •

edited

Loading

kouvel commented Apr 22, 2024 •

edited

Loading

VSadov commented Apr 22, 2024 •

edited

Loading