Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare race condition in EventSource dispose/finalizer #55441

Open
josalem opened this issue Jul 10, 2021 · 7 comments
Open

Rare race condition in EventSource dispose/finalizer #55441

josalem opened this issue Jul 10, 2021 · 7 comments

Comments

@josalem
Copy link
Contributor

josalem commented Jul 10, 2021

There is a rare race that can result in a use-after-free on the native end of EventPipeEventProvider and potentially the EtwEventProvider.

#region IDisposable Members
/// <summary>
/// Disposes of an EventSource.
/// </summary>
public void Dispose()
{
this.Dispose(true);
GC.SuppressFinalize(this);
}
/// <summary>
/// Disposes of an EventSource.
/// </summary>
/// <remarks>
/// Called from Dispose() with disposing=true, and from the finalizer (~EventSource) with disposing=false.
/// Guidelines:
/// 1. We may be called more than once: do nothing after the first call.
/// 2. Avoid throwing exceptions if disposing is false, i.e. if we're being finalized.
/// </remarks>
/// <param name="disposing">True if called from Dispose(), false if called from the finalizer.</param>
protected virtual void Dispose(bool disposing)
{
if (!IsSupported)
{
return;
}
if (disposing)
{
#if FEATURE_MANAGED_ETW
// Send the manifest one more time to ensure circular buffers have a chance to get to this information
// even in scenarios with a high volume of ETW events.
if (m_eventSourceEnabled)
{
try
{
SendManifest(m_rawManifest);
}
catch { } // If it fails, simply give up.
m_eventSourceEnabled = false;
}
if (m_etwProvider != null)
{
m_etwProvider.Dispose();
m_etwProvider = null!;
}
#endif
#if FEATURE_PERFTRACING
if (m_eventPipeProvider != null)
{
m_eventPipeProvider.Dispose();
m_eventPipeProvider = null!;
}
#endif
}
m_eventSourceEnabled = false;
m_eventSourceDisposed = true;
}
/// <summary>
/// Finalizer for EventSource
/// </summary>
~EventSource()
{
this.Dispose(false);
}
#endregion

If one thread (A) is calling EventSource.Dispose(), and another is in the process of writing (B), it is possible for the following sequence to occur:

A: (1)Dispose -> (3)m_eventSourceEnabled = false -> (4)m_eventPipeProvider.Dispose() -> (6)m_eventPipeProvider = null
B: (2)if (IsEnabled) -> (5)use m_eventPipeProvider

EventPipeEventProvider.Dispose() calls EventPipeEventProvider.EventUnregister(). This deletes the underlying native structures (the only time EventPipeProvider::m_pEventList is set to nullptr). The managed code, does not unset the m_provHandle member, so if someone got a reference to this managed object, they would have a pointer to freed memory. The managed provider has been marked as disabled, however, not all code paths check that value. Specifically:

// Define an EventPipeEvent handle.
unsafe IntPtr IEventProvider.DefineEventHandle(uint eventID, string eventName, long keywords, uint eventVersion, uint level,
byte *pMetadata, uint metadataLength)
{
IntPtr eventHandlePtr = EventPipeInternal.DefineEvent(m_provHandle, eventID, keywords, eventVersion, level, pMetadata, metadataLength);
return eventHandlePtr;
}

which is where we AV in this case. We get here from TraceLoggingEventSource.WriteImpl() which ends up in NameInfo.GetOrCreateEventHandle() which calls DefineEvent on the provider:

public IntPtr GetOrCreateEventHandle(EventProvider provider, TraceLoggingEventHandleTable eventHandleTable, EventDescriptor descriptor, TraceLoggingEventTypes eventTypes)
{
IntPtr eventHandle;
if ((eventHandle = eventHandleTable[descriptor.EventId]) == IntPtr.Zero)
{
lock (eventHandleTable)
{
if ((eventHandle = eventHandleTable[descriptor.EventId]) == IntPtr.Zero)
{
byte[]? metadata = EventPipeMetadataGenerator.Instance.GenerateEventMetadata(
descriptor.EventId,
name,
(EventKeywords)descriptor.Keywords,
(EventLevel)descriptor.Level,
descriptor.Version,
(EventOpcode)descriptor.Opcode,
eventTypes);
uint metadataLength = (metadata != null) ? (uint)metadata.Length : 0;
unsafe
{
fixed (byte* pMetadataBlob = metadata)
{
// Define the event.
eventHandle = provider.m_eventProvider.DefineEventHandle(
(uint)descriptor.EventId,
name,
descriptor.Keywords,
descriptor.Version,
descriptor.Level,
pMetadataBlob,
metadataLength);
}
}
// Cache the event handle.
eventHandleTable.SetEventHandle(descriptor.EventId, eventHandle);
}
}
}
return eventHandle;
}

I'm still pinning down the exact sequencing of events in the hopes I can create a deterministic repro.

I believe this is the cause for the failures on dotnet/coreclr#28179 and #55240.

CC @tommcdon @noahfalk @dotnet/dotnet-diag

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jul 10, 2021
@josalem josalem self-assigned this Jul 10, 2021
@josalem josalem removed the untriaged New issue has not been triaged by the area owner label Jul 10, 2021
@josalem josalem modified the milestones: 5.0.x, 6.0.0 Jul 10, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 16, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 28, 2021
@josalem
Copy link
Contributor Author

josalem commented Jul 28, 2021

Migrating the conversation from #55862 to here:

@noahfalk:

@josalem and I chatted earlier, but just to record it - most BCL types don't provide a thread-safety guarantee for Dispose() and I wonder if it is necessary here. The synchronization necessary to provide that guarantee comes with perf costs I'm hopeful we can avoid.

  • If the runtime will automatically call Dispose() on a thread of our choosing at shutdown then we do need some thread-safety guarantee, but does the runtime need to be calling Dispose() at all? I'm not sure what benefit it gives us but research may reveal that is necessary.
  • Since the runtime automatically calls Write for EventCounters we would need to modify Dispose() to ensure that we stopped the EventCounter Write() calls prior to freeing the provider.

@brianrob:

I agree. I was trying to figure out what I was concerned about with regard to this change and I couldn't quite verbalize it. I think you hit it on the head @noahfalk.

@noahfalk:

I was looking at a related bug and wound up doing some research on why we call EventSource.Dispose() at shutdown. So far I see two reasons:

Avoiding native->managed callbacks after we've made it illegal to call back into managed code here
Logging ETW manifests at the end of the trace here
With a little refactoring we could create a new internal API like EventSource.HandleShutdown() that handles only these concerns while still leaving us in a state where concurrent calls to Write don't generate failures (whether any data is logged is optional). Then we would call this new API and not Dispose() in the shutdown path. Dispose() would continue to do what it does now including freeing memory, but unlike HandleShutdown() the onus could be on the caller to ensure they don't make any concurrent calls to other EventSource APIs during/after the call to Dispose().

In the case of EventPipe, HandleShutdown() could disable the callbacks without deleting the provider. In the case of ETW past precedent suggests concurrent calls to EventUnregister and EventWrite are already safe (we've presumably been doing it for many years with no reported issues).

@AaronRobinsonMSFT @vitek-karas @elinor-fung - The comment in the source references throwing a COMPLUS_BOOT_EXCEPTION and I find no reference to it any longer. I do recall that we used to block native->managed calls at some point during AppDomain shutdown, but do you know if that constraint is a relic of the desktop runtime that is no longer an issue on CoreCLR?

@jkotas
Copy link
Member

jkotas commented Aug 15, 2021

The comment in the source references throwing a COMPLUS_BOOT_EXCEPTION and I find no reference to it any longer. I do recall that we used to block native->managed calls at some point during AppDomain shutdown, but do you know if that constraint is a relic of the desktop runtime that is no longer an issue on CoreCLR?

Correct, COMPLUS_BOOT_EXCEPTION does not exist in CoreCLR. We do not try to prevent threads from running managed code as the process is shutting down.

@josalem
Copy link
Contributor Author

josalem commented Aug 17, 2021

Having investigated this a little bit I think there are a couple things we could do to solve it:

  • (1) In EventSource.Dispose, unregister and disable any associated counters before cleaning up resources.
    • This is problematic since it means we need to make EventSource.Dispose wait until the counters are actually disabled. They get copied while under the counter lock, so we'd need to wait till those copies are done writing to make sure the race doesn't happen.
    • This is really only needed if a user manually disposes an EventSource before the end of the process, i.e., doesn't use the public static MySource Log = new MySource(...); pattern.
  • (2) In AppContext.OnProcessEnd, disable all counters and stop the PollForValues thread before the logic that disposes all EventSources.
    • This will prevent the problem from occurring, but only if the public static MySource Log = new MySource(...); pattern is followed

Neither of the above solutions actually prevents a user from writing their own code that attempts to write events while an event source is being disabled.

It's worth noting that this has only been observed on arm32. Most likely due to instruction ordering constraints being looser than other architectures.

As a result, I'm leaning towards option 2.

@tommcdon tommcdon modified the milestones: 6.0.0, 7.0.0 Aug 25, 2021
@josalem
Copy link
Contributor Author

josalem commented Apr 18, 2022

Moving to future as there isn't a strong signal that this is a pressing issue. If we receive more reports of this specific failure we can pull it back on the backlog.

CC @tommcdon

@josalem josalem modified the milestones: 7.0.0, Future Apr 18, 2022
@josalem josalem removed their assignment Apr 18, 2022
@SergeiPavlov
Copy link
Contributor

SergeiPavlov commented Mar 9, 2023

We have daily crashes during process shutdown probably related to this issue.
prometheus-net.DotNetRuntime package listens for events.

Platform: .NET6, Debian
Stack Trace:

> clrstack -n
OS Thread Id: 0x1d (17)
        Child SP               IP Call Site
00007EF5457F8900 00007f0a2b30fb07 [FaultingExceptionFrame: 00007ef5457f8900] 
00007EF5457F9700 00007F09BB0A7234 System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload(EventMetadata ByRef, System.ReadOnlySpan`1<Byte>)
00007EF5457F97C0 00007F09BB0A6AD6 System.Diagnostics.Tracing.NativeRuntimeEventSource.ProcessEvent(UInt32, UInt32, System.DateTime, System.Guid, System.Guid, System.ReadOnlySpan`1<Byte>)
00007EF5457F9820 00007F09B3402E26 System.Diagnostics.Tracing.EventPipeEventDispatcher.DispatchEventsToEventListeners()
00007EF5457F98E0 00007F09B2C343A7 System.Threading.Tasks.Task.InnerInvoke()
00007EF5457F9920 00007F09B2C0EA0C System.Threading.Tasks.Task+<>c.<.cctor>b__272_0(System.Object)
00007EF5457F9950 00007F09B2C0E47A System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
00007EF5457F99D0 00007F09B2C0DE0D System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
00007EF5457F9AA0 00007F09B2C0D63C System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread)
00007EF5457F9AD0 00007F09B2C0D58C System.Threading.Tasks.ThreadPoolTaskScheduler+<>c.<.cctor>b__10_0(System.Object)
00007EF5457F9B00 00007F09B2C0D3B2 System.Threading.Thread+StartHelper.RunWorker()
00007EF5457F9B50 00007F09B2C0D25E System.Threading.Thread+StartHelper.Run()
00007EF5457F9B80 00007F09B2C0CD65 System.Threading.Thread.StartCallback()
00007EF5457F9CF0 00007f0a2adc5e47 [DebuggerU2MCatchHandlerFrame: 00007ef5457f9cf0] 
<END_COMMAND_OUTPUT>
> pe -nested -lines
Exception object: 00007ef59c0011b8
Exception type:   System.ExecutionEngineException

To mitigate it we tried to dispose all EventSource.GetSources() manually. But then we got NRE:

00007F4959D20C10 00007f495db45b07 [FaultingExceptionFrame: 00007f4959d20c10] 
00007F4959D21A10 00007F48F48C22F8 System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
00007F4959D21A40 00007F48F48C11DE System.Diagnostics.Tracing.EventSource.Dispose()
00007F4959D21A70 00007F48F48C2930 System.Diagnostics.Tracing.EventListener.DisposeOnShutdown()
00007F4959D21AF0 00007F48F48C1248 System.AppContext.OnProcessExit()
<END_COMMAND_OUTPUT>
> pe -nested -lines
Exception object: 00007f43f9a25000
Exception type:   System.NullReferenceException
Message:          Object reference not set to an instance of an object.

We hope this bug report encourage further investigation/fixing of the issue.

@davmason
Copy link
Member

Hi @SergeiPavlov, thanks for reaching out. Are you able to share a core dump of the crash? We have ways to share them privately if that is a concern

@SergeiPavlov
Copy link
Contributor

Hi @SergeiPavlov, thanks for reaching out. Are you able to share a core dump of the crash? We have ways to share them privately if that is a concern

Unfortunately I cannot.
If we will be able to reproduce it in small test app, I will provide a crash dump.

BTW the mentioned NRE probably is race condition between our call of EventSource.Dispose() and DisposeOnShutdown() called by OnProcessExit(), so now we just clearing the global EventSource list by Reflection.

And we still have similar CRASH problem on stage when prometheus-net.DotNetRuntime begins to subscribe for Events, but the app is in the process of shutting down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants