-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available #49215
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionA process running under FW 3.1 threw OOM
That seemed like coming out of blue as the software was running on a system with 200GB RAM still available. The memory graph with the OOM moment: To monitor the system performance we are running "continuous" PerfView on the box all the time as
The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable: In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp): There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock
Configuration
Questions
|
Just pinging back: could someone review the symptoms and whether this is a known issue (or non-issue)? |
Sorry to have missed this. Is this a consistently reproducible? If so could you please share a dump and/or repro? |
@mangod9 No dump and does not happen often. This is probably the only instance when we had some data collected. |
Unlikely that its happening within the monitor code itself. Possibly some allocation failure on a different thread is triggering FailFast leading to the failure. Might be worthwhile to check if its happening on a specific machine or occurs on multiple physical machines. |
This is the FailFast thread stack: the application invokes FailFast with the exception information.
|
|
There are 2 threads:
|
Updated the description to clarify where FailFast comes to the picture from. |
we have another server throwing the same OOM exceptions on a more frequent basis, where again it the server has plenty of available memory to use. as well we are collecting mini dumps. I cannot post the mini dumps here, but I could within reason get more information from the dump to share here. I found this exception within a few mini dumps similar to our original case that @baal2000 posted here is the stack from the our mini dump where as the other was from a perfview. both are within the same class If someone would like more information from the mini dumps I can try to provide it here.
Thanks |
Couple of questions:
Thanks! |
@mangod9 We could share a production system dump file with the MS team if there is a secure way to do that and it makes it easier to move the case forward. |
hi @baal2000, yeah you could send a pointer to my MS email (its listed in my profile). Thx! |
Another instance of this (under Windows)
|
Could you please check how many OS handles has the process open when it crashes? The most likely reason for OOM under this situation is failure to create new OS event handle that can happen when the process has too many handles open. |
What is the best way to do that under Linux? Internally (Process.HandleCount after catching the OOM) or there is an external tool to trace the count continuously? |
|
We'll record this value during future crash events and comment here if there is new data. |
Yup sounds good |
Collect dump (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash) and try to figure out what happened from the dump. We can help you with the latter part. |
we do have a minidump |
Since you are collecting the trace, could you please find the number of active syncblocks from the Just to clarify - is the minidump from Windows or Linux? Minidump is unlikely to have enough clues to figure this out. You can pretty much only look at the stack in the minidump, and try to guess where the OOM happened from the left-over data on stack. Here is where I would start looking for clues (with full dump):
|
The minidumps are from Linux. Anything can be done without having a full dump, or is there some middle-of-the-road type of a dump we can take that could be more useful? To put this into perspective we are operating in a production environment where taking full crash dump is not feasible due to time it takes on a large memory footprint application, Thank you. |
There is no middle of the road dump.
|
@jkotas
Custom libcoreclr.so option |
You can write your own event listener. Here is an example on how to do that: https://devblogs.microsoft.com/dotnet/a-portable-way-to-get-gc-events-in-process-and-no-admin-privilege-with-10-lines-of-code-and-ability-to-dynamically-enable-disable-events/ |
OK thanks Let's say this proves that the number of active syncblocks is too high. What could lead the application to such a state? Asking because we also tried to run load testing and could not get the app to crash. Maybe there is a definite scenario that leads to the OOM that we could start with and see if that could apply to our environment in any way. The only common trait between all the crashes is that the system has a very large memory footprint on an even larger footprint Windows or Linux server (only Linux these days). Or if there is interest on your side to use us as the guinea pigs with a custom libcoreclr.so then we could start working through that option too offline. |
lock taken on more than 65M objects. If you have 100 GBs of memory, it is certainly possible - at least in theory - to get more than 65M objects with sync blocks and run into this implementation limit. So I think it is a good idea to eliminate this possibility first. |
Are you able to build libcoreclr.so from the release branch with no changes? I would start with instrumenting the relevant paths in syncblock.cpp to try to pinpoint the one that is failing with the OOM.
|
I though that the OOM callstack
will tell exactly the path on where the it happened.
|
The callstack says that the OOM happened somewhere inside the unmanaged runtime code. There are number of potential places that can throw OOM in the unmanaged runtime in this method. A first step is to identify the exact place where this OOM happens in the unmanaged runtime code.
Once the syncblock gets attached to the object, it stays attached to it. A high number of live objects with sync block attached (ie objects that ever had a lock taken) would lead to the problem. For example, try to run this on the big 100s GB memory machine: using System.Threading;
var a = new object[70_000_000];
for (int i = 0; i < a.Length; i++)
{
object o = new object();
Monitor.Enter(o);
Monitor.Wait(o, 0);
a[i] = o;
} It will fail with OOM even though the program consumed less than 10GB and there is a lot of memory available. |
Thanks for the example. It does produce OOM and I thought the information may help to track our bug down.
If I
there is no OOM. Does In our application the most likely call stack where the OOM happens contains
This does not OOM. Interestingly, no OOM even when no I am not sure how to proceed now. |
Syncblocks that are not actively used for locks may get detached after the GC runs. (My earlier statement was oversimplification.) The exact conditions where the syncblock may get detached are quite complex. For example, the following program does Enter/Exit, but it will still hit the OOM:
Find the number of active syncblocks in your app from |
Thanks for explaining. "I am not sure how to proceed now" was related to the repro attempt. We are working on implementing |
Thank you for the guidance RE: syncblock count. We've found that a higher than the limit syncblock count was the cause of the OOM in the application. A possible scenario that have lead to such a state:
Replacing Monitor locking with a simple Interlock.CompareExchange- based spinlock has eliminated the syncblock count growth and completely removed the OOM risk. There is potential performance cost to that but it looks very minor at worst or not present at all at best. Couple questions:
|
Yes, SpinLock does not consume syncblocks. I agree that we need to make this better documented, easier to diagnose, and also look into implementing first class lock type (#34812) that does not consume syncblocks. |
Tagging subscribers to this area: @mangod9 Issue DetailsDescriptionA process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast There are 2 threads:
The OOM came out of blue as the software was running on a system with 200GB RAM still available. The memory graph with the OOM moment: To monitor the system performance we are running "continuous" PerfView on the box all the time as
The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable: In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp): There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock
Configuration
Questions
|
@jkotas
|
I think this |
@jkotas Could this update be patched back to release 5 of the Framework to save everyone's time on developing custom slolutions telling one OM case from another? |
I do not think I can get it to .NET 5 (it is going out of support soon). I will try to get it to .NET 6. |
…#60592) * Use custom error message when running out of syncblocks Contributes to #49215 * Update src/coreclr/dlls/mscorrc/mscorrc.rc update message Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Manish Godse <61718172+mangod9@users.noreply.github.com>
Description
A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast
There are 2 threads:
Application.Logging.Logger
instance queue for processing. The call stack there is what is mentioned in the issue description:The OOM came out of blue as the software was running on a system with 200GB RAM still available.
The memory graph with the OOM moment:
To monitor the system performance we are running "continuous" PerfView on the box all the time as
The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:
In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):
There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock
Configuration
Questions
The text was updated successfully, but these errors were encountered: