OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available #49215

baal2000 · 2021-03-05T15:50:49Z

Description

A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast

There are 2 threads:

First thread catches OOM from Monitor.ReliableEnterTimeout and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:

   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)

The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OOM is assumed to be a non-recoverable, fatal state for the application:

   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

The OOM came out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

Windows Server 2016
.Net Core 3.1.12
CoreCLR Version: 4.700.21.6504
64 vCPUs VM in GCE
600 GB RAM
Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
Under what conditions the OOM may happen on a system with that much free RAM available?
Could this somehow related to the concurrent Perfview activity?

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-03-05T15:50:52Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2021-03-05T15:53:42Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

A process running under FW 3.1 threw OOM

CoreCLR Version: 4.700.21.6504
.NET Core Version: 3.1.12
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at ... the application code call stack

That seemed like coming out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

Windows Server 2016
.Net Core 3.1
64 vCPUs VM in GCE
600 GB RAM
Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
Under what conditions the OOM may happen on a system with that much free RAM available?
Could this somehow related to the concurrent Perfview activity?

Author:	baal2000
Assignees:	-
Labels:	`area-GC-coreclr`, `untriaged`
Milestone:	-

baal2000 · 2021-04-01T13:19:51Z

Just pinging back: could someone review the symptoms and whether this is a known issue (or non-issue)?

mangod9 · 2021-04-01T15:54:26Z

Sorry to have missed this. Is this a consistently reproducible? If so could you please share a dump and/or repro?

baal2000 · 2021-04-01T16:06:50Z

@mangod9 No dump and does not happen often. This is probably the only instance when we had some data collected.
Could you please confirm if OOE comes from the suspected area in syncblock.cpp, or there is something else that may cause Monitor to fail like this?

mangod9 · 2021-04-01T16:11:44Z

Unlikely that its happening within the monitor code itself. Possibly some allocation failure on a different thread is triggering FailFast leading to the failure. Might be worthwhile to check if its happening on a specific machine or occurs on multiple physical machines.

baal2000 · 2021-04-01T17:52:25Z

This is the FailFast thread stack: the application invokes FailFast with the exception information.

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)

mangod9 · 2021-04-01T17:59:45Z

@kouvel in case this is something specific to monitor, but I suspect its thrown on a different thread. A dump would be helpful in diagnosing further.

baal2000 · 2021-04-01T18:26:34Z

There are 2 threads:

The thread catches OO exception and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:

   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)

The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OutOfMemory is assumed to be a non-recoverable, fatal state for the application:

   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

baal2000 · 2021-04-01T19:17:10Z

Updated the description to clarify where FailFast comes to the picture from.

DwayneGeo · 2021-06-22T13:56:58Z

we have another server throwing the same OOM exceptions on a more frequent basis, where again it the server has plenty of available memory to use. as well we are collecting mini dumps. I cannot post the mini dumps here, but I could within reason get more information from the dump to share here. I found this exception within a few mini dumps similar to our original case that @baal2000 posted

here is the stack from the our mini dump where as the other was from a perfview. both are within the same class System.Threading.Monitor

If someone would like more information from the mini dumps I can try to provide it here.

00007EC1B7FFD1E0 00007ff53812ccb9 [HelperMethodFrame_1OBJ: 00007ec1b7ffd1e0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007EC1B7FFD330 00007FF4C5011D95 System.Threading.Tasks.Task.AddException(System.Object, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 1807]
00007EC1B7FFD380 00007FF4BDD34DD3 System.Threading.Tasks.Task.HandleException(System.Exception) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2476]
00007EC1B7FFD3B0 00007FF4C25B9CE4 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System

Thanks

mangod9 · 2021-06-22T17:07:54Z

Couple of questions:

Is there any native code on the stack when the exception occurs (!dumpstack should show that in windbg)?
Are multiple threads waiting on the monitor, or just that one thread which throws OOM?

Thanks!

baal2000 · 2021-06-23T15:00:19Z

@mangod9 We could share a production system dump file with the MS team if there is a secure way to do that and it makes it easier to move the case forward.

mangod9 · 2021-06-23T18:00:18Z

hi @baal2000, yeah you could send a pointer to my MS email (its listed in my profile). Thx!

baal2000 · 2021-07-16T14:33:41Z

Another instance of this (under Windows)
300 GB RAM system, more than 100 GB of free system RAM available at the time of the crash.

CoreCLR Version: 5.0.721.25508 .NET Version: 5.0.7 
Exception of type 'System.OutOfMemoryException' was thrown. 
at System.Threading.Monitor.ReliableEnter(Object obj, Boolean& lockTaken) 
at System.ComponentModel.Component.Dispose(Boolean disposing) 
at Database.SqlDatabase.DisposeCommandCoreAsync(Boolean useAsync, DbCommand command)
...

jkotas · 2021-07-16T21:41:17Z

Could you please check how many OS handles has the process open when it crashes? The most likely reason for OOM under this situation is failure to create new OS event handle that can happen when the process has too many handles open.

baal2000 · 2021-07-17T02:52:53Z

What is the best way to do that under Linux? Internally (Process.HandleCount after catching the OOM) or there is an external tool to trace the count continuously?

jkotas · 2021-07-17T03:59:14Z

ls /proc/PID/fd/ | wc -l is how to find number of open handles from command line. Process.HandleCount works too (it does the same thing).

baal2000 · 2021-07-22T15:13:38Z

We'll record this value during future crash events and comment here if there is new data.

mangod9 · 2021-07-22T15:33:50Z

Yup sounds good

baal2000 · 2021-08-13T21:11:15Z

@jkotas

Update

we've setup the handle count monitoring across all servers and had the OOMs in Monitor methods, examples:

System.OutOfMemoryException
Message: Exception of type 'System.OutOfMemoryException' was thrown.
StackTrace: 
  at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)

System.OutOfMemoryException
  at System.Threading.Monitor.ObjWait(Boolean exitContext, Int32 millisecondsTimeout, Object obj)
  at System.Threading.ManualResetEventSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Threading.Tasks.Task.SpinThenBlockingWait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Threading.Tasks.Task.InternalWaitCore(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Runtime.TaskHelpers.WaitForCompletion(Task task)
  at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
  at System.ServiceModel.Channels.CommunicationObject.Open()
  at System.ServiceModel.ChannelFactory.EnsureOpened()
  at System.ServiceModel.ChannelFactory`1.CreateChannel(EndpointAddress address, Uri via)
  at System.ServiceModel.ChannelFactory`1.CreateChannel()
  at System.ServiceModel.ClientBase`1.CreateChannel()
  at System.ServiceModel.ClientBase`1.CreateChannelInternal()
  at System.ServiceModel.ClientBase`1.get_Channel()

The process handle count looks normal before and after the crash:

This is speculative, but looking at the memory pattern on the server there are series of heap size drops that look like GC gen 2 events, and then the last one coinciding with OOM

The vertical axis scale is in GB.
There is plenty of available RAM at the moment of the crash.

What could be the next step?

jkotas · 2021-08-13T22:53:05Z

What could be the next step?

Collect dump (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash) and try to figure out what happened from the dump. We can help you with the latter part.

baal2000 · 2021-08-13T23:07:57Z

we do have a minidump

jkotas · 2021-08-14T15:05:44Z

Since you are collecting the trace, could you please find the number of active syncblocks from the GCHeapStats_V2 event? The system is limited to have at most 67M syncblocks. In theory, it is possible to reach this limit with 300GB heap and reaching this would also lead to OOM exception. Looking at the number of active syncblocks will give us clue whether it may be the case.

Just to clarify - is the minidump from Windows or Linux?

Minidump is unlikely to have enough clues to figure this out. You can pretty much only look at the stack in the minidump, and try to guess where the OOM happened from the left-over data on stack.

Here is where I would start looking for clues (with full dump):

Look at SyncBlockCache::s_pSyncBlockCache. Does it have any free blocks (either m_FreeSyncBlock and m_FreeCount is non-zero)? It will tell us whether it is likely that the OOM is caused by attempting to allocate new memory block for syncblocks.
Find the address of the object that caused the OOM. Does the object have syncblock allocated (look at the 4 bytes before the object address, BIT_SBLK_* constants describe the bits)?
Check both managed and unmanaged heaps for corruptions (!verifyheap and !heap -s -v)

baal2000 · 2021-08-17T13:10:33Z

@jkotas

The minidumps are from Linux.

Anything can be done without having a full dump, or is there some middle-of-the-road type of a dump we can take that could be more useful?

To put this into perspective we are operating in a production environment where taking full crash dump is not feasible due to time it takes on a large memory footprint application,

Thank you.

jkotas · 2021-08-17T13:29:58Z

There is no middle of the road dump.

Anything can be done without having a full dump,

Find the number of active syncblocks from the GCHeapStats_V2 event
You can try to look at the unwound portion of the stack in the minidump, but I will be a real detective work
You can try to build your own libcoreclr.so that will have extra logging on the code path that leads to this OOM exception. If you are not able to take the full dump in production, I think that this is really the only path forward.

baal2000 · 2021-08-17T13:45:53Z

@jkotas
GCHeapStats_V2 event option

If there is a low-overhead option of running perfcollect or dotnet-trace continuously with the filter setup to only observe this event then please share it in a comment.
We do have a separate in-house monitoring tool that we could modify and record the values (like we did with the Handle Count). Is there a way to continuously observe such an event without having to run dotnet-trace or perfcollect continuously until the app crashes?

Custom libcoreclr.so option
We could do that with some guidance.

jkotas · 2021-08-17T13:53:38Z

If there is a low-overhead option of running perfcollect or dotnet-trace continuously with the filter setup to only observe this event then please share it in a comment.

You can write your own event listener. Here is an example on how to do that: https://devblogs.microsoft.com/dotnet/a-portable-way-to-get-gc-events-in-process-and-no-admin-privilege-with-10-lines-of-code-and-ability-to-dynamically-enable-disable-events/

baal2000 · 2021-08-17T17:50:08Z

OK thanks
We'll try GCHeapStats_V2 first.

Let's say this proves that the number of active syncblocks is too high. What could lead the application to such a state? Asking because we also tried to run load testing and could not get the app to crash. Maybe there is a definite scenario that leads to the OOM that we could start with and see if that could apply to our environment in any way. The only common trait between all the crashes is that the system has a very large memory footprint on an even larger footprint Windows or Linux server (only Linux these days).

Or if there is interest on your side to use us as the guinea pigs with a custom libcoreclr.so then we could start working through that option too offline.

jkotas · 2021-08-17T18:28:38Z

Let's say this proves that the number of active syncblocks is too high. What could lead the application to such a state?

lock taken on more than 65M objects. If you have 100 GBs of memory, it is certainly possible - at least in theory - to get more than 65M objects with sync blocks and run into this implementation limit. So I think it is a good idea to eliminate this possibility first.

jkotas · 2021-08-17T18:35:04Z

Or if there is interest on your side to use us as the guinea pigs with a custom libcoreclr.so then we could start working through that option too offline.

Are you able to build libcoreclr.so from the release branch with no changes?

I would start with instrumenting the relevant paths in syncblock.cpp to try to pinpoint the one that is failing with the OOM.

EX_TRY
{
     SyncBlockArray* newsyncblocks = new(SyncBlockArray);
     if (!newsyncblocks)
           COMPlusThrowOM ();
}
EX_CATCH
{
     printf("new(SyncBlockArray) failed!"); // Or some other type of logging that can be captured by the rest of your system
     EX_RETHROW;
}
EX_END_CATCH(RethrowTerminalExceptions);

baal2000 · 2021-08-17T19:47:33Z

Are you able to build libcoreclr.so from the release branch with no changes?
We haven't done in a while.

I though that the OOM callstack

System.OutOfMemoryException
  at System.Threading.Monitor.ObjWait(Boolean exitContext, Int32 millisecondsTimeout, Object obj)
  at System.Threading.ManualResetEventSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)

will tell exactly the path on where the it happened.

lock taken on more than 65M objects.
Under what scenario an application that runs on no more than a few 1000 threads at any moment of time us disproportionally higher number of synchronization blocks?

jkotas · 2021-08-17T20:13:50Z

The callstack says that the OOM happened somewhere inside the unmanaged runtime code. There are number of potential places that can throw OOM in the unmanaged runtime in this method. A first step is to identify the exact place where this OOM happens in the unmanaged runtime code.

Under what scenario an application that runs on no more than a few 1000 threads at any moment of time us disproportionally higher number of synchronization blocks?

Once the syncblock gets attached to the object, it stays attached to it. A high number of live objects with sync block attached (ie objects that ever had a lock taken) would lead to the problem. For example, try to run this on the big 100s GB memory machine:

using System.Threading;

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    Monitor.Wait(o, 0);
    a[i] = o;
}

It will fail with OOM even though the program consumed less than 10GB and there is a lot of memory available.

baal2000 · 2021-08-18T13:17:31Z

Thanks for the example. It does produce OOM and I thought the information may help to track our bug down.

Once the syncblock gets attached to the object, it stays attached to it.

If I Monitor.Exit(o) after Monitor.Wait(o, 0);

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    Monitor.Wait(o, 0);
    Monitor.Exit(o)`;
    a[i] = o;
}

there is no OOM. Does Monitor.Exit(o) detach the syncblock from a live instance?

In our application the most likely call stack where the OOM happens contains Monitor.TryEnter(o, 0) with Monitor.Exit in a finally block. Would be similar to:

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    a[i] = new object();
}
            
for (int i = 0; i < a.Length; i++)
{
    object o = a[i];
    if (Monitor.TryEnter(o, 0))
    {
        try
        {
            // do something
        }
        finally
        {
            // When commented out no OOM either.
            Monitor.Exit(o);
        }
    }
    else
    {
        throw new InvalidOperationException();
    }
}

This does not OOM. Interestingly, no OOM even when no Monitor.Exit(o);.

I am not sure how to proceed now.

jkotas · 2021-08-18T13:44:52Z

Does Monitor.Exit(o) detach the syncblock from a live instance?

Syncblocks that are not actively used for locks may get detached after the GC runs. (My earlier statement was oversimplification.) The exact conditions where the syncblock may get detached are quite complex. For example, the following program does Enter/Exit, but it will still hit the OOM:

using System.Threading;

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    o.GetHashCode();
    Monitor.Exit(o);
    a[i] = o;
}

I am not sure how to proceed now.

Find the number of active syncblocks in your app from GCHeapStats_V2 event so that we can be sure that we are chasing the right problem.

baal2000 · 2021-08-18T14:51:24Z

Thanks for explaining.

"I am not sure how to proceed now" was related to the repro attempt. We are working on implementing GCHeapStats_V2.SinkBlockCount tracing.

baal2000 · 2021-10-05T10:59:26Z

@jkotas

Thank you for the guidance RE: syncblock count.

We've found that a higher than the limit syncblock count was the cause of the OOM in the application. A possible scenario that have lead to such a state:

the application routinely allocates a large number of class instances with Monitor-based synchronisation strategy;
the instances are placed in a queue for later processing;
a slow queue processing schedule leads to the class instances accumulating in the Gen 2 heap, each with a syncblock instance attached to it (potentially);
a large free memory of the process host system causes long periods of no Gen 2 GC collections and, therefore, insufficient rate of syncblock instances recycling;
syncblock count hits the limit and that causes OOM in Monitor.

Replacing Monitor locking with a simple Interlock.CompareExchange- based spinlock has eliminated the syncblock count growth and completely removed the OOM risk. There is potential performance cost to that but it looks very minor at worst or not present at all at best.

Couple questions:

Could you confirm that the stock-standard SpinLock is safe from syncblock accumulation risk point of view?
Is there any documentation or a recommended blog that explains the nuts and bolts of the relationship between an object and the syncblock and when developers should be paying attention to that? I found some but do not want to risk endorsing any. I think it is worth placing a bit of official information or a link to an article somewhere close to Object.GetHashCode and Monitor class documentation.

jkotas · 2021-10-05T14:44:47Z

Yes, SpinLock does not consume syncblocks.

I agree that we need to make this better documented, easier to diagnose, and also look into implementing first class lock type (#34812) that does not consume syncblocks.

ghost · 2021-10-05T14:45:18Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast

There are 2 threads:

First thread catches OOM from Monitor.ReliableEnterTimeout and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:

   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)

The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OOM is assumed to be a non-recoverable, fatal state for the application:

   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

The OOM came out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

Windows Server 2016
.Net Core 3.1.12
CoreCLR Version: 4.700.21.6504
64 vCPUs VM in GCE
600 GB RAM
Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
Under what conditions the OOM may happen on a system with that much free RAM available?
Could this somehow related to the concurrent Perfview activity?

Author:	baal2000
Assignees:	-
Labels:	`area-System.Threading`, `area-VM-coreclr`
Milestone:	Future

Contributes to dotnet#49215

Contributes to #49215

baal2000 · 2021-10-06T20:57:49Z

@jkotas
Thanks for adding a custom message to the OM exception. What will be a cause for this generic OM though?

        if ((indx = GetHeaderSyncBlockIndex()) == 0)
            COMPlusThrowOM();

jkotas · 2021-10-06T21:03:06Z

I think this COMPlusThrowOM() is unreachable. We can turn it into assert.

baal2000 · 2021-10-18T19:30:40Z

@jkotas Could this update be patched back to release 5 of the Framework to save everyone's time on developing custom slolutions telling one OM case from another?

jkotas · 2021-10-18T23:01:56Z

I do not think I can get it to .NET 5 (it is going out of support soon). I will try to get it to .NET 6.

Contributes to #49215

…#60592) * Use custom error message when running out of syncblocks Contributes to #49215 * Update src/coreclr/dlls/mscorrc/mscorrc.rc update message Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Manish Godse <61718172+mangod9@users.noreply.github.com>

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 5, 2021

baal2000 changed the title ~~OutOfMemoryException at Monitor.ReliableEnterTimeout with 200GB of 600 GB free memory available~~ OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available Mar 5, 2021

danmoseley added the area-GC-coreclr label Mar 5, 2021

mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 1, 2021

mangod9 added this to the Future milestone Apr 1, 2021

mangod9 added area-VM-coreclr and removed area-GC-coreclr labels Jun 23, 2021

jkotas added the area-System.Threading label Oct 5, 2021

jkotas removed the area-VM-coreclr label Oct 5, 2021

jkotas added a commit to jkotas/runtime that referenced this issue Oct 5, 2021

Use custom error message when running out of syncblocks

f089d66

Contributes to dotnet#49215

jkotas mentioned this issue Oct 5, 2021

Use custom error message when running out of syncblocks #60013

Merged

jkotas added a commit that referenced this issue Oct 6, 2021

Use custom error message when running out of syncblocks (#60013)

62d70c4

Contributes to #49215

github-actions bot pushed a commit that referenced this issue Oct 18, 2021

Use custom error message when running out of syncblocks

0b37780

Contributes to #49215

jkotas mentioned this issue Oct 18, 2021

[release/6.0] Use custom error message when running out of syncblocks #60592

Merged

OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available #49215

OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available #49215

Comments

baal2000 commented Mar 5, 2021 • edited Loading

Description

Configuration

Questions

dotnet-issue-labeler bot commented Mar 5, 2021

ghost commented Mar 5, 2021

Description

Configuration

Questions

baal2000 commented Apr 1, 2021

mangod9 commented Apr 1, 2021

baal2000 commented Apr 1, 2021

mangod9 commented Apr 1, 2021

baal2000 commented Apr 1, 2021

mangod9 commented Apr 1, 2021

baal2000 commented Apr 1, 2021

baal2000 commented Apr 1, 2021

DwayneGeo commented Jun 22, 2021

mangod9 commented Jun 22, 2021

baal2000 commented Jun 23, 2021

mangod9 commented Jun 23, 2021

baal2000 commented Jul 16, 2021 • edited Loading

jkotas commented Jul 16, 2021

baal2000 commented Jul 17, 2021

jkotas commented Jul 17, 2021

baal2000 commented Jul 22, 2021

mangod9 commented Jul 22, 2021

baal2000 commented Aug 13, 2021 • edited Loading

jkotas commented Aug 13, 2021

baal2000 commented Aug 13, 2021

jkotas commented Aug 14, 2021

baal2000 commented Aug 17, 2021

jkotas commented Aug 17, 2021

baal2000 commented Aug 17, 2021

jkotas commented Aug 17, 2021

baal2000 commented Aug 17, 2021

jkotas commented Aug 17, 2021 • edited Loading

jkotas commented Aug 17, 2021 • edited Loading

baal2000 commented Aug 17, 2021

jkotas commented Aug 17, 2021 • edited Loading

baal2000 commented Aug 18, 2021

jkotas commented Aug 18, 2021

baal2000 commented Aug 18, 2021

baal2000 commented Oct 5, 2021 • edited Loading

jkotas commented Oct 5, 2021

ghost commented Oct 5, 2021

Description

Configuration

Questions

baal2000 commented Oct 6, 2021

jkotas commented Oct 6, 2021

baal2000 commented Oct 18, 2021

jkotas commented Oct 18, 2021

baal2000 commented Mar 5, 2021 •

edited

Loading

baal2000 commented Jul 16, 2021 •

edited

Loading

baal2000 commented Aug 13, 2021 •

edited

Loading

jkotas commented Aug 17, 2021 •

edited

Loading

jkotas commented Aug 17, 2021 •

edited

Loading

jkotas commented Aug 17, 2021 •

edited

Loading

baal2000 commented Oct 5, 2021 •

edited

Loading