-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak, Increasing CPU usage #6
Comments
Hey @bdrupieski, thanks for the report, I'll take a look into this over the next week at some point. This library makes use of a number of internal caches and I suspect that these aren't being evicted correctly, leading to higher memory consumption. I also suspect that the increased CPU use could be due to more frequent/ aggressive GC activity- if you could provide a screenshot of the number of GC collections for the impacted time period, that would help confirm this theory. |
Here's the past 14 days of the garbage collection rate obtained from calling A little bit higher but not too far outside the norm. CPU and memory remain stable with only Here's a graph showing this for CPU usage from the past 4 days: |
@bdrupieski I believe I've fixed the issue, I discovered that EventPairTimer wasn't removing items from it's cache. I've pushed |
Thanks for digging into this and for the quick turnaround. I'll give this a try. I'll turn them all back on and see what happens. I'll report back here with what I find. |
I upgraded to v3.0.9-beta and tried to turn all the collectors back on. It. seems to still be happening Here's memory usage. The green box is when they were all on: CPU: I want to try to dig into this myself with a CPU or memory profiler some weekend to figure out what's going on here. Or, to set up remote debugging/profiling to peek into what it's doing exactly when running in Kubernetes and not only locally. I'll try to get to this over the next few days (maybe weeks...). |
Hmm interesting. Let me know what you find out- in the meantime, information around the following might help us understand what's going on:
|
So I'm reasonably confident that there's no memory leak in the strictest sense- cached items are being cleaned up correctly. I think the solution might be to offer a configurable sampling ratio. This will allow a certain percentage of events to be discarded and reduce cache storage requirements. |
Here's Yes, there are memory limits. This is what I have:
I'll look into getting a dump. |
Added event sampling for high throughput stats collectors to reduce memory footprint, should hopefully resolve [issue #6](#6). # Changes - `JitStatsCollector`, `ThreadPoolSchedulingStatsCollector` and `ContentionStatsCollector` can now be configured to discard a ratio of start/ stop events, reducing the number of events we need to store in a `Cache` at any one time - A new default sampling rate has been configured for the above collectors (see README) - Updated README with example on how to configure sampling and details of the defaults for each stats collector
# Summary Added event sampling for high throughput stats collectors to reduce memory footprint, should hopefully resolve [issue #6](#6). # Changes - `JitStatsCollector`, `ThreadPoolSchedulingStatsCollector` and `ContentionStatsCollector` can now be configured to discard a ratio of start/ stop events, reducing the number of events we need to store in a `Cache` at any one time - A new default sampling rate has been configured for the above collectors (see README) - Updated README with example on how to configure sampling and details of the defaults for each stats collector
@bdrupieski I've pushed a new version into nuget- |
@bdrupieski, I've also included some optional debugging metrics- you can enable these with when you set up the collectors:
|
I'm still seeing a very small memory leak, only the process working set grows, while the GC heap is stable. App has with very little to no load (http requests apart from health and metrics).
This seems to be related to CoreCLR 2.2 issue https://github.com/dotnet/coreclr/issues/23562 I'm trying running the app with environment variable |
Can confirm @thirus's issue, it's the same in my app. Looking at memory dumps, there's a lot of XML reminiscent of the EventPipe xml. |
@thirus - I think that issue explains why memory consumption gradually rises perfectly (thanks for bringing it to my attention):
When running on .NET core 2.2, I saw the same gradual memory consumption issue but on 3.0, memory use remained stable: Another issue chalked up to bugz in 2.2! |
@TheAngryByrd When I was profiling memory to get an idea of what was going on, this is what I saw too. @djluck Good find that it's a 2.2 thing. I'm looking forward to upgrading soon, once it's GA or RC and no longer in preview. |
@bdrupieski in the comments of preview 7 release blog, Richard mentioned they're not doing "RC" anymore and we can consider preview 7 as the RC. |
Setting ENV |
Hi all - I'm from the .NET runtime team at Microsoft and I work on EventPipe/LTTng/ETW/counters part of the runtime. I saw this library being used by many of our customers so I decided to come over and see what things we need to improve on, and also wanted to clarify a few things that are going on in this thread.
This is probably the growth in the buffer that the runtime uses to store events internally. This library seems to be using EventListener, which creates an EventPipe session with a buffer size limit of 1024MB in the .NET Core 2.2 runtime. This behavior was fixed in 3.0 (by decreasing the size limit to 10MB) and hence you won't see the constant increase in memory in 3.0. The constant growth in the memory shown in some of those graphs above is due to this - it hasn't hit the maximum limit it thinks it can have, so it's constantly growing. However, if it does grow beyond 1 GB, then we might be looking at a runtime bug. I have https://github.com/dotnet/coreclr/issues/23562 currently filed as a CoreCLR bug that I'm having a hard time to repro but if someone can repro this and send a heap dump for me to debug, I would be more than happy to try to debug it and potentially file a servicing fix for it.
This environment variable is only valid if you are using the environment variable Sorry that you are all running into issues in 2.2 - While I can't guarantee that I can make the fix go into 2.2 servicing release (the decision is above my pay grade), I'm hoping to come up with a reasonable fix that can be ported back to 2.2 servicing release for those of you who are still planning to use 2.2 after the 3.0 GA. If there are any issues/suggestions to EventPipe/Counters part of the runtime/API, please feel free to reach out to me at |
Hey @sywhang, thanks for taking the time to provide us with that level of detail, it's really appreciated. Also I just want to take this opportunity to thank you guys for providing these new telemetry hooks- they're definitely appreciated 👍 |
I have the same issue with memory leak. any updates? |
This issue is now being tracked here: dotnet/coreclr/issue#26344. But also know that the issue is not present in .Net Core 3.0 which is currently supported in production environments and will be fully released on September 23. |
Read AllocationAmount64 when handling gc allocation
We're experiencing the same issue with CPU increase / memory leak on .NET Core 3.0-3.1. The memory supposedly leaks slowly over time but the CPU usage increase is more apparent. Once we turned off the DotNetRuntime collectors, these problems went away. Here's the usage graphs of our service that uses the DotNetRuntime collectors on .NET Core 3.0: I made a small repro with a blank .NET Core 3.1 app that illustrates the issue here: https://github.com/semyon2105/leak-repro Might be related to dotnet/runtime#31900 |
Hi @semyon2105, thanks for the detailed repro steps. I tried to repro your issue and ran the app you shared and here is what I have so far after running the app for ~10 hours: The metric I used to draw the chart above was collected using The increase in memory usage in the graph you posted above may be related to an issue I fixed in .NET Core 3.1.5 servicing release, where the native buffer allocation pattern in EventPipe was causing an internal heap fragmentation in glibc's internal data structure used in the implementation of malloc. So it is possible for you to see an unbounded growth if you are on .NET Core version 3.1.4 or below. Regarding the CPU usage increase, I haven't done a full CPU trace and analyzed it, but from my knowledge on how .NET runtime eventing works, the CPU usage increase is probably caused by the way metrics are captured in this library (which is via EventListener to turn on runtime event providers). These runtime event providers are normally used for in-depth tracing of the runtime components' behavior such as the GC, JIT, ThreadPool, Assembly Loader, etc. These are not cheap, since the runtime has to populate the event payload to send to the listener, and that alone can slow down things by quite a bit. The exact performance overhead will vary largely depending on the characteristics of your application and the specific metrics you turn on. For instance, if your app is allocating a lot of objects and you turn on the GcStats collector in this library, that will turn on the GC eventing in the runtime at a verbose level. That will give you a lot of useful GC info, but it comes at a cost. For instance, the GC has to fire populate some event payload and fire an event every time it marks a Type for collection during its mark phase, every time it allocates a new chunk of managed memory, etc. These are not things you want to be firing all the time in a performance-sensitive environment, since it will slow down the GC quite a bit. The verbose keyword is usually used when you are actively diagnosing some GC issue in your app. Changing the sampling rate could help alleviate a little bit, but I wouldn't expect it to change the behavior that much since much of the cost is probably coming from the runtime populating the event payload (not parsing the event payload on the consumer side does not mean the runtime is not populating those events' payloads). The performance impact, as you might expect, would be even greater on an app that's doing actual work (i.e. a heavily loaded server) compared to a blank app that's not doing anything (since the runtime components like GC or ThreadPool would be stressed further in a loaded server and the perf hit of populating the event payload makes them run slower). This is why the runtime metrics exposed by Part of the effort we are working on for .NET 5 and .NET 6 is to convince area-owners of various runtime components in .NET to expose APIs for developers to get some of these "more in-depth" metrics without having to start expensive tracing sessions. For example, the GC has added GCMemoryInfo API for you to get a lot of useful GC related metrics through an API call. The ThreadPool also has few APIs you can invoke to get the metrics directly instead of having to start a tracing session. We're hoping to increase the scope of metrics we cover by adding similar APIs for other runtime components as well, so if you have feedback on metrics you'd like to see, please feel free to come to the dotnet/runtime repo and submit feature requests. If you have any more questions regarding the behavior of EventListener, EventPipe, or how the .NET runtime emits the runtime events, I am happy to explain more so please let me know : ) |
I have the same issue after adding Prometheus monitoring with default metrics to .netcore31 app. I can see that most of thread pool threads waiting for cpu about 10-15% of total time. After disabling Prometheus monitoring threads wait about 5% of total time. |
What version of .NET 3.1 is the container using? Version < 3.1.5 has a memory leak that can cause this CPU consumption you're experiencing (see this MR dotnet/runtime#35924). |
We saw the same issue with 3.1.7 |
I think this comes down to what @sywhang was saying above- this library collects too many events from the .NET runtime. This isn't intentional but to get at the few "interesting" events it has to enable a level of verbosity that can see hundreds/ thousands of events generated a second. |
I have 3.1.3. Now, I've upgraded to 3.1.5. Seems that threads are spinning on EventPipe lock. Need to verify. |
The issue I fixed resolved memory issue but likely won't impact CPU usage. For CPU usage, the main issue is that this library uses EventListener to consume the events, which is not the mainline scenario for tracing. CPU usage increasing is due to the lock used by the runtime to emit the events in order. Apart from the issue I described above with the verbosity of events being consumed, EventListeners also aren't really that performant from its design since it wasn't designed to be something you'd use for a verbose and long running session. That being said, EventPipe does have a scalability issue especially when events are written from many threads (ex. More than 30) and we're currently working on improving its scalability for the next upcoming .NET 6 release. In general, performance of the app and the verbosity of events you get/depth of diagnostics is trade off. Some of the events being turned on by this library is very expensive, and those events weren't meant to be used in a long running session. |
@sywhang Thanks for the detailed answer. Could you pls clarify is 'disabling/reducing metrics amount' only solution i can go with? Is it correct that default settings for prometheus-net library is not suitable for long-running process? In other words, should I suppress DotNetStats metrics for production-ready app by default and enable them only for some time intervals due to investigating concrete problems? |
@simbadltd I think that would be appropriate. I'm trying to reproduce the issue myself so I can investigate what impact briefly enabling/ disabling the trace listeners might have. |
I believe I have reproduced the issue and have a workaround. After (partially) listening to @sywhang's observation that event sessions were not meant to be long-running, I tried disposing of the collector and then re-enabling it shortly after. This was the result: The blue line marks the point where I disposed of and re-created the runtime stats collector- CPU immediately drops back to levels seen at the start of the application. I will be incorporating periodic recycling of the collector into a later release but for now, you can always |
@djluck Thanks for your time. I can confirm WA proposed is working. CPU load has been stabilized along with the service throughput and threadpool scheduling latency: The yellow line marks the point where WA was applied. |
@simbadltd thanks for confirming this fixes things for you, appreciate the report mate. Obviously disabling metric collection is not ideal but a quick enable/ disable should see minimal disruption to metrics collected. I'm going on holiday for the next week but will be keen to get back to work on V4 and this fix once I'm back. |
CPU is looking stable so far with the work around after two days. I added the following in the ASP.NET Core Task.Run(async () =>
{
var delay = TimeSpan.FromMinutes(5);
while (!applicationLifetime.ApplicationStopping.IsCancellationRequested)
{
var collector = DotNetRuntimeStatsBuilder.Default().StartCollecting();
await Task.Delay(delay);
collector.Dispose();
}
});
|
I've used the same solution, but wrapped as a public class CollectorService : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
using var collector = StartCollection();
await Task.Delay(TimeSpan.FromHours(3), stoppingToken);
}
}
private IDisposable StartCollection()
{
var builder = DotNetRuntimeStatsBuilder.Customize()
.WithContentionStats()
.WithJitStats()
.WithThreadPoolSchedulingStats()
.WithThreadPoolStats()
.WithGcStats();
return builder.StartCollecting();
}
} |
So I found a fairly serious bug that you should be aware of if you're using the "dispose after X duration" workaround- dotnet/runtime#49804. This can cause events to stop producing and even application crashes. This has delayed the release of V4 but I think I have a half-decent workaround for now. |
I released V4 today which should resolve all performance issues if you're using The option remains to enable the existing
|
This is an awesome library. Thanks for building this.
I dropped this into two .NET Core 2.2 services running in Kubernetes yesterday morning, setting it up to run all of the collectors like this:
Throughout the day I saw memory and CPU usage kept climbing. It looks like there's a memory leak somewhere, and something is causing CPU usage to increase over time.
I already have a metric derived from
Process.GetCurrentProcess().WorkingSet64
. This is what that metric did yesterday:The green box shows memory usage when I had all of this library's collectors running. This is for one of the two services, and there are three instances of each, which is why there are three lines on the graph. Each line is
Process.GetCurrentProcess().WorkingSet64
over time for that instance. Each time the colors change is from a deployment when the pod names change; the query I'm using for that chart treats them as a completely different metric name at that point.Here's a graph showing the summed rate of
Process.GetCurrentProcess().PrivilegedProcessorTime.TotalSeconds
andProcess.GetCurrentProcess().UserProcessorTime.TotalSeconds
for all instances of each of the two services over the past 2 days:Sorry for not including the legend -- it has names I don't want to share. The yellow and green lines on the bottom are privileged processor time. The other two are user processor time. One of them is so spiky because it's running a pretty intense scheduled job every 20 minutes. The other is only responding to HTTP requests.
I turned everything off except the thread pool scheduling stats last night. The most recent parts of the above charts are from only running this:
If there's a problem, then it's not with the thread pool scheduling stats.
I'll add the others back one at a time to try to find where the problem is. I'll report back to this issue with my findings.
Thanks again for this library. I didn't know about the CLR events this library records. I'm thankful you wrote this both to learn about them and also to get insight I didn't know was so readily available.
The text was updated successfully, but these errors were encountered: