Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available #49215

Open
baal2000 opened this issue Mar 5, 2021 · 43 comments

Comments

@baal2000
Copy link

baal2000 commented Mar 5, 2021

Description

A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast

There are 2 threads:

  1. First thread catches OOM from Monitor.ReliableEnterTimeout and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)
  1. The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OOM is assumed to be a non-recoverable, fatal state for the application:
   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

The OOM came out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:
image

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

image

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

image

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

  • Windows Server 2016
  • .Net Core 3.1.12
  • CoreCLR Version: 4.700.21.6504
  • 64 vCPUs VM in GCE
  • 600 GB RAM
  • Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

  • Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
  • Under what conditions the OOM may happen on a system with that much free RAM available?
  • Could this somehow related to the concurrent Perfview activity?
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 5, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@baal2000 baal2000 changed the title OutOfMemoryException at Monitor.ReliableEnterTimeout with 200GB of 600 GB free memory available OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available Mar 5, 2021
@ghost
Copy link

ghost commented Mar 5, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

A process running under FW 3.1 threw OOM

CoreCLR Version: 4.700.21.6504
.NET Core Version: 3.1.12
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at ... the application code call stack

That seemed like coming out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:
image

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

image

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

image

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

  • Windows Server 2016
  • .Net Core 3.1
  • 64 vCPUs VM in GCE
  • 600 GB RAM
  • Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

  • Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
  • Under what conditions the OOM may happen on a system with that much free RAM available?
  • Could this somehow related to the concurrent Perfview activity?
Author: baal2000
Assignees: -
Labels:

area-GC-coreclr, untriaged

Milestone: -

@baal2000
Copy link
Author

baal2000 commented Apr 1, 2021

Just pinging back: could someone review the symptoms and whether this is a known issue (or non-issue)?

@mangod9
Copy link
Member

mangod9 commented Apr 1, 2021

Sorry to have missed this. Is this a consistently reproducible? If so could you please share a dump and/or repro?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 1, 2021
@mangod9 mangod9 added this to the Future milestone Apr 1, 2021
@baal2000
Copy link
Author

baal2000 commented Apr 1, 2021

@mangod9 No dump and does not happen often. This is probably the only instance when we had some data collected.
Could you please confirm if OOE comes from the suspected area in syncblock.cpp, or there is something else that may cause Monitor to fail like this?

@mangod9
Copy link
Member

mangod9 commented Apr 1, 2021

Unlikely that its happening within the monitor code itself. Possibly some allocation failure on a different thread is triggering FailFast leading to the failure. Might be worthwhile to check if its happening on a specific machine or occurs on multiple physical machines.

@baal2000
Copy link
Author

baal2000 commented Apr 1, 2021

This is the FailFast thread stack: the application invokes FailFast with the exception information.

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)

@mangod9
Copy link
Member

mangod9 commented Apr 1, 2021

  • @kouvel in case this is something specific to monitor, but I suspect its thrown on a different thread. A dump would be helpful in diagnosing further.

@baal2000
Copy link
Author

baal2000 commented Apr 1, 2021

There are 2 threads:

  1. The thread catches OO exception and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)
  1. The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OutOfMemory is assumed to be a non-recoverable, fatal state for the application:
   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

@baal2000
Copy link
Author

baal2000 commented Apr 1, 2021

Updated the description to clarify where FailFast comes to the picture from.

@DwayneGeo
Copy link

we have another server throwing the same OOM exceptions on a more frequent basis, where again it the server has plenty of available memory to use. as well we are collecting mini dumps. I cannot post the mini dumps here, but I could within reason get more information from the dump to share here. I found this exception within a few mini dumps similar to our original case that @baal2000 posted

here is the stack from the our mini dump where as the other was from a perfview. both are within the same class System.Threading.Monitor

If someone would like more information from the mini dumps I can try to provide it here.

00007EC1B7FFD1E0 00007ff53812ccb9 [HelperMethodFrame_1OBJ: 00007ec1b7ffd1e0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007EC1B7FFD330 00007FF4C5011D95 System.Threading.Tasks.Task.AddException(System.Object, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 1807]
00007EC1B7FFD380 00007FF4BDD34DD3 System.Threading.Tasks.Task.HandleException(System.Exception) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2476]
00007EC1B7FFD3B0 00007FF4C25B9CE4 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System

Thanks

@mangod9
Copy link
Member

mangod9 commented Jun 22, 2021

Couple of questions:

  • Is there any native code on the stack when the exception occurs (!dumpstack should show that in windbg)?
  • Are multiple threads waiting on the monitor, or just that one thread which throws OOM?

Thanks!

@baal2000
Copy link
Author

@mangod9 We could share a production system dump file with the MS team if there is a secure way to do that and it makes it easier to move the case forward.

@mangod9
Copy link
Member

mangod9 commented Jun 23, 2021

hi @baal2000, yeah you could send a pointer to my MS email (its listed in my profile). Thx!

@baal2000
Copy link
Author

baal2000 commented Jul 16, 2021

Another instance of this (under Windows)
300 GB RAM system, more than 100 GB of free system RAM available at the time of the crash.

CoreCLR Version: 5.0.721.25508 .NET Version: 5.0.7 
Exception of type 'System.OutOfMemoryException' was thrown. 
at System.Threading.Monitor.ReliableEnter(Object obj, Boolean& lockTaken) 
at System.ComponentModel.Component.Dispose(Boolean disposing) 
at Database.SqlDatabase.DisposeCommandCoreAsync(Boolean useAsync, DbCommand command)
...

@jkotas
Copy link
Member

jkotas commented Jul 16, 2021

Could you please check how many OS handles has the process open when it crashes? The most likely reason for OOM under this situation is failure to create new OS event handle that can happen when the process has too many handles open.

@baal2000
Copy link
Author

What is the best way to do that under Linux? Internally (Process.HandleCount after catching the OOM) or there is an external tool to trace the count continuously?

@jkotas
Copy link
Member

jkotas commented Jul 17, 2021

ls /proc/PID/fd/ | wc -l is how to find number of open handles from command line. Process.HandleCount works too (it does the same thing).

@baal2000
Copy link
Author

We'll record this value during future crash events and comment here if there is new data.

@mangod9
Copy link
Member

mangod9 commented Jul 22, 2021

Yup sounds good

@baal2000
Copy link
Author

baal2000 commented Aug 13, 2021

@jkotas

Update

we've setup the handle count monitoring across all servers and had the OOMs in Monitor methods, examples:

System.OutOfMemoryException
Message: Exception of type 'System.OutOfMemoryException' was thrown.
StackTrace: 
  at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
System.OutOfMemoryException
  at System.Threading.Monitor.ObjWait(Boolean exitContext, Int32 millisecondsTimeout, Object obj)
  at System.Threading.ManualResetEventSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Threading.Tasks.Task.SpinThenBlockingWait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Threading.Tasks.Task.InternalWaitCore(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Runtime.TaskHelpers.WaitForCompletion(Task task)
  at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
  at System.ServiceModel.Channels.CommunicationObject.Open()
  at System.ServiceModel.ChannelFactory.EnsureOpened()
  at System.ServiceModel.ChannelFactory`1.CreateChannel(EndpointAddress address, Uri via)
  at System.ServiceModel.ChannelFactory`1.CreateChannel()
  at System.ServiceModel.ClientBase`1.CreateChannel()
  at System.ServiceModel.ClientBase`1.CreateChannelInternal()
  at System.ServiceModel.ClientBase`1.get_Channel()

The process handle count looks normal before and after the crash:
image

This is speculative, but looking at the memory pattern on the server there are series of heap size drops that look like GC gen 2 events, and then the last one coinciding with OOM
image

The vertical axis scale is in GB.
There is plenty of available RAM at the moment of the crash.

What could be the next step?

@jkotas
Copy link
Member

jkotas commented Aug 13, 2021

What could be the next step?

Collect dump (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash) and try to figure out what happened from the dump. We can help you with the latter part.

@baal2000
Copy link
Author

we do have a minidump

@jkotas
Copy link
Member

jkotas commented Aug 14, 2021

Since you are collecting the trace, could you please find the number of active syncblocks from the GCHeapStats_V2 event? The system is limited to have at most 67M syncblocks. In theory, it is possible to reach this limit with 300GB heap and reaching this would also lead to OOM exception. Looking at the number of active syncblocks will give us clue whether it may be the case.

Just to clarify - is the minidump from Windows or Linux?

Minidump is unlikely to have enough clues to figure this out. You can pretty much only look at the stack in the minidump, and try to guess where the OOM happened from the left-over data on stack.

Here is where I would start looking for clues (with full dump):

  • Look at SyncBlockCache::s_pSyncBlockCache. Does it have any free blocks (either m_FreeSyncBlock and m_FreeCount is non-zero)? It will tell us whether it is likely that the OOM is caused by attempting to allocate new memory block for syncblocks.
  • Find the address of the object that caused the OOM. Does the object have syncblock allocated (look at the 4 bytes before the object address, BIT_SBLK_* constants describe the bits)?
  • Check both managed and unmanaged heaps for corruptions (!verifyheap and !heap -s -v)

@baal2000
Copy link
Author

@jkotas

The minidumps are from Linux.

Anything can be done without having a full dump, or is there some middle-of-the-road type of a dump we can take that could be more useful?

To put this into perspective we are operating in a production environment where taking full crash dump is not feasible due to time it takes on a large memory footprint application,

Thank you.

@jkotas
Copy link
Member

jkotas commented Aug 17, 2021

There is no middle of the road dump.

Anything can be done without having a full dump,

  • Find the number of active syncblocks from the GCHeapStats_V2 event
  • You can try to look at the unwound portion of the stack in the minidump, but I will be a real detective work
  • You can try to build your own libcoreclr.so that will have extra logging on the code path that leads to this OOM exception. If you are not able to take the full dump in production, I think that this is really the only path forward.

@baal2000
Copy link
Author

@jkotas
GCHeapStats_V2 event option

  • If there is a low-overhead option of running perfcollect or dotnet-trace continuously with the filter setup to only observe this event then please share it in a comment.

  • We do have a separate in-house monitoring tool that we could modify and record the values (like we did with the Handle Count). Is there a way to continuously observe such an event without having to run dotnet-trace or perfcollect continuously until the app crashes?

Custom libcoreclr.so option
We could do that with some guidance.

@jkotas
Copy link
Member

jkotas commented Aug 17, 2021

If there is a low-overhead option of running perfcollect or dotnet-trace continuously with the filter setup to only observe this event then please share it in a comment.

You can write your own event listener. Here is an example on how to do that: https://devblogs.microsoft.com/dotnet/a-portable-way-to-get-gc-events-in-process-and-no-admin-privilege-with-10-lines-of-code-and-ability-to-dynamically-enable-disable-events/

@baal2000
Copy link
Author

OK thanks
We'll try GCHeapStats_V2 first.

Let's say this proves that the number of active syncblocks is too high. What could lead the application to such a state? Asking because we also tried to run load testing and could not get the app to crash. Maybe there is a definite scenario that leads to the OOM that we could start with and see if that could apply to our environment in any way. The only common trait between all the crashes is that the system has a very large memory footprint on an even larger footprint Windows or Linux server (only Linux these days).

Or if there is interest on your side to use us as the guinea pigs with a custom libcoreclr.so then we could start working through that option too offline.

@jkotas
Copy link
Member

jkotas commented Aug 17, 2021

Let's say this proves that the number of active syncblocks is too high. What could lead the application to such a state?

lock taken on more than 65M objects. If you have 100 GBs of memory, it is certainly possible - at least in theory - to get more than 65M objects with sync blocks and run into this implementation limit. So I think it is a good idea to eliminate this possibility first.

@jkotas
Copy link
Member

jkotas commented Aug 17, 2021

Or if there is interest on your side to use us as the guinea pigs with a custom libcoreclr.so then we could start working through that option too offline.

Are you able to build libcoreclr.so from the release branch with no changes?

I would start with instrumenting the relevant paths in syncblock.cpp to try to pinpoint the one that is failing with the OOM.

EX_TRY
{
     SyncBlockArray* newsyncblocks = new(SyncBlockArray);
     if (!newsyncblocks)
           COMPlusThrowOM ();
}
EX_CATCH
{
     printf("new(SyncBlockArray) failed!"); // Or some other type of logging that can be captured by the rest of your system
     EX_RETHROW;
}
EX_END_CATCH(RethrowTerminalExceptions);

@baal2000
Copy link
Author

Are you able to build libcoreclr.so from the release branch with no changes?
We haven't done in a while.

I though that the OOM callstack

System.OutOfMemoryException
  at System.Threading.Monitor.ObjWait(Boolean exitContext, Int32 millisecondsTimeout, Object obj)
  at System.Threading.ManualResetEventSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)

will tell exactly the path on where the it happened.

lock taken on more than 65M objects.
Under what scenario an application that runs on no more than a few 1000 threads at any moment of time us disproportionally higher number of synchronization blocks?

@jkotas
Copy link
Member

jkotas commented Aug 17, 2021

The callstack says that the OOM happened somewhere inside the unmanaged runtime code. There are number of potential places that can throw OOM in the unmanaged runtime in this method. A first step is to identify the exact place where this OOM happens in the unmanaged runtime code.

Under what scenario an application that runs on no more than a few 1000 threads at any moment of time us disproportionally higher number of synchronization blocks?

Once the syncblock gets attached to the object, it stays attached to it. A high number of live objects with sync block attached (ie objects that ever had a lock taken) would lead to the problem. For example, try to run this on the big 100s GB memory machine:

using System.Threading;

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    Monitor.Wait(o, 0);
    a[i] = o;
}

It will fail with OOM even though the program consumed less than 10GB and there is a lot of memory available.

@baal2000
Copy link
Author

Thanks for the example. It does produce OOM and I thought the information may help to track our bug down.

Once the syncblock gets attached to the object, it stays attached to it.

If I Monitor.Exit(o) after Monitor.Wait(o, 0);

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    Monitor.Wait(o, 0);
    Monitor.Exit(o)`;
    a[i] = o;
}

there is no OOM. Does Monitor.Exit(o) detach the syncblock from a live instance?

In our application the most likely call stack where the OOM happens contains Monitor.TryEnter(o, 0) with Monitor.Exit in a finally block. Would be similar to:

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    a[i] = new object();
}
            
for (int i = 0; i < a.Length; i++)
{
    object o = a[i];
    if (Monitor.TryEnter(o, 0))
    {
        try
        {
            // do something
        }
        finally
        {
            // When commented out no OOM either.
            Monitor.Exit(o);
        }
    }
    else
    {
        throw new InvalidOperationException();
    }
}

This does not OOM. Interestingly, no OOM even when no Monitor.Exit(o);.

I am not sure how to proceed now.

@jkotas
Copy link
Member

jkotas commented Aug 18, 2021

Does Monitor.Exit(o) detach the syncblock from a live instance?

Syncblocks that are not actively used for locks may get detached after the GC runs. (My earlier statement was oversimplification.) The exact conditions where the syncblock may get detached are quite complex. For example, the following program does Enter/Exit, but it will still hit the OOM:

using System.Threading;

var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
    object o = new object();
    Monitor.Enter(o);
    o.GetHashCode();
    Monitor.Exit(o);
    a[i] = o;
}

I am not sure how to proceed now.

Find the number of active syncblocks in your app from GCHeapStats_V2 event so that we can be sure that we are chasing the right problem.

@baal2000
Copy link
Author

Thanks for explaining.

"I am not sure how to proceed now" was related to the repro attempt. We are working on implementing GCHeapStats_V2.SinkBlockCount tracing.

@baal2000
Copy link
Author

baal2000 commented Oct 5, 2021

@jkotas

Thank you for the guidance RE: syncblock count.

We've found that a higher than the limit syncblock count was the cause of the OOM in the application. A possible scenario that have lead to such a state:

  • the application routinely allocates a large number of class instances with Monitor-based synchronisation strategy;
  • the instances are placed in a queue for later processing;
  • a slow queue processing schedule leads to the class instances accumulating in the Gen 2 heap, each with a syncblock instance attached to it (potentially);
  • a large free memory of the process host system causes long periods of no Gen 2 GC collections and, therefore, insufficient rate of syncblock instances recycling;
  • syncblock count hits the limit and that causes OOM in Monitor.

Replacing Monitor locking with a simple Interlock.CompareExchange- based spinlock has eliminated the syncblock count growth and completely removed the OOM risk. There is potential performance cost to that but it looks very minor at worst or not present at all at best.

Couple questions:

  • Could you confirm that the stock-standard SpinLock is safe from syncblock accumulation risk point of view?
  • Is there any documentation or a recommended blog that explains the nuts and bolts of the relationship between an object and the syncblock and when developers should be paying attention to that? I found some but do not want to risk endorsing any. I think it is worth placing a bit of official information or a link to an article somewhere close to Object.GetHashCode and Monitor class documentation.

@jkotas
Copy link
Member

jkotas commented Oct 5, 2021

Yes, SpinLock does not consume syncblocks.

I agree that we need to make this better documented, easier to diagnose, and also look into implementing first class lock type (#34812) that does not consume syncblocks.

@ghost
Copy link

ghost commented Oct 5, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast

There are 2 threads:

  1. First thread catches OOM from Monitor.ReliableEnterTimeout and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)
  1. The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OOM is assumed to be a non-recoverable, fatal state for the application:
   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

The OOM came out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment:
image

To monitor the system performance we are running "continuous" PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

image

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

image

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

  • Windows Server 2016
  • .Net Core 3.1.12
  • CoreCLR Version: 4.700.21.6504
  • 64 vCPUs VM in GCE
  • 600 GB RAM
  • Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

  • Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
  • Under what conditions the OOM may happen on a system with that much free RAM available?
  • Could this somehow related to the concurrent Perfview activity?
Author: baal2000
Assignees: -
Labels:

area-System.Threading, area-VM-coreclr

Milestone: Future

@baal2000
Copy link
Author

baal2000 commented Oct 6, 2021

@jkotas
Thanks for adding a custom message to the OM exception. What will be a cause for this generic OM though?

        if ((indx = GetHeaderSyncBlockIndex()) == 0)
            COMPlusThrowOM();

@jkotas
Copy link
Member

jkotas commented Oct 6, 2021

I think this COMPlusThrowOM() is unreachable. We can turn it into assert.

@baal2000
Copy link
Author

@jkotas Could this update be patched back to release 5 of the Framework to save everyone's time on developing custom slolutions telling one OM case from another?

@jkotas
Copy link
Member

jkotas commented Oct 18, 2021

I do not think I can get it to .NET 5 (it is going out of support soon). I will try to get it to .NET 6.

github-actions bot pushed a commit that referenced this issue Oct 18, 2021
Anipik pushed a commit that referenced this issue Oct 19, 2021
…#60592)

* Use custom error message when running out of syncblocks

Contributes to #49215

* Update src/coreclr/dlls/mscorrc/mscorrc.rc

update message

Co-authored-by: Jan Kotas <jkotas@microsoft.com>

Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Manish Godse <61718172+mangod9@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants