-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.2] Potential native memory leak from EventPipe #13309
Comments
From : https://github.com/dotnet/coreclr/issues/23562#issuecomment-523731589 Sorry, I do not think I can share the full memory dump, but here is some data from the memory dump and I can get you more data upon request.
As you can see, at the time of that memory dump, there was 52,217 occurrences of these 102,400 byte objects that take up a total of 5,347,020,800 (5.3 GB) of memory. I spot checked some of the same size objects in a debug instance and they all had the same allocation stack trace listed here (sorry for the mix-up, my stack trace is different than the OP's stack trace). The repro was pretty simple. I made an Asp.Net Core 2.2.4 targeted application that was self hosted as a console app and self contained with win7-x64 as the runtime target. Then I added a nuget reference to the latest of prometheus-net.DotNetRuntime and called DotNetRuntimeStatsBuilder.Default().StartCollecting(); in Startup.Configure. I can create a project like that and attach a zip if you'd like. |
From: https://github.com/dotnet/coreclr/issues/23562#issuecomment-524005521 Attached is a simple app the reproduces my issue. EventPipeMemoryLeak.zip I just let the app sit idle and the memory usage is continuously increasing. And the heap stats show a regular increase in quantity of these 0x19000 size objects. |
From : https://github.com/dotnet/coreclr/issues/23562#issuecomment-524136956 @tylerohlsen I'm running the repro app and trying to see how it ends up in that state. So far I'm not seeing anything that's not expected - I'm waiting to see if the memory usage grows beyond 1GB which is what EventListener sessions have as the max limit. That being said the 0x19000 size events does correspond to the EventPipe buffer blocks that we allocate internally to store the events. The repro app's been running for about ~1.5 hours now and it's at 380MB - I will come back in a few hours and see how things have changed and take it from there. |
Unfortunately I haven't been able to repro the issue using your repro. I ran it for more than 22 hours now and so far the memory usage is very stable at around 370MB. I believe there may be some machine specific differences that causes the repro to not happen on my side - for example, if the buffer is being cleared fast enough that it never accumulates to a single buffer size, it's possible that the thread never allocates another buffer (which is what's supposed to be happening). But in your case it obviously grew beyond that size, and is allocating more than the maximum limit. That being said, I can provide some guidance on how to debug this leak that you can follow to diagnose the state when the leak is happening. Since the library you are using is using EventListener, the memory usage (in 2.2) can grow up to 1GB but it shouldn't grow beyond that. If you are using less than 1GB, that doesn't necessarily indicate a leak. When/if the native memory usage (the total size of 0x19000 sized blocks) has grown beyond 1GB, could you set a breakpoint in CoreCLR!EventPipeBufferManager::AllocateBufferForThread and step through to see which code path it takes? Specifically we want to know how it gets to this part of that function:
The other thing you might want to check is if it ever goes into this path:
This if block: |
My I suggest running the sample code in this docker container: mcr.microsoft.com/dotnet/core/aspnet:2.2 We are experincing what looks like the same issue. We see issue when running containerized on linux, and we do not have the issue running directly(non-containerized) on windows. We haven't testet non-containerized linux or containerized windows |
@jenshenneberg Does the native memory usage go above 1 GB? |
@sywhang |
@jenshenneberg That would be great. Thank you! I'll try out the repro myself too in the meantime. |
Our app non heap memory grows ~15MB per hour, we have hard cap of 512MB memory limit set at k8s, so container gets killed cause of OOM. So could not check till 1GB. @sywhang When I ran the app in OSX, without calling
|
I now have my local environment set up with the coreclr source code to be able to set a breakpoint and debug into |
@sywhang I'll start another pod with COMPlus_EventPipeCircularMB set to 10 to verify that this is not just the circular buffer filling up. |
@thirus Please do let me know if it grows beyond 1GB. And if it does, a heap dump would be very helpful. @tylerohlsen That would be great. Thank you. @jenshenneberg I can wait for whenever it is ready - on a side note, how are you collecting the trace? If you're using EventListener (it's what the prometheus-dotnet library mentioned in the issue uses), setting |
@sywhang |
After about 2.5 days, I'm well above 1GB of memory allocated from the buffer. I'm running with 2.2.4 and line 115 is what is setting But 3.0.0 preview 8 added a little block on line 110. Could that be why it is not reproducible in 3.0? If so, can we get that back ported to 2.2? Here's the new block in 3.0 that is not in 2.2...
|
Oh, it looks like there's quite a bit more differences in 3.0 in that method than just the block I posted above. FYI, I'm still running my debugger in this repro app at above 1GB if you need me to set a breakpoint anywhere else. Please let me know if there's more you need. |
Even if you fix the glitch that causes the buffer to go above 1GB, isn’t there still a problem that we are getting to 1GB in the first place? I’m doing absolutely nothing in this app. It’s just sitting idle. Doesn’t this also indicate that the deallocations are not happening? |
Today I've set tracepoints on line 224 in the allocate block and on the deallocate methods Allocate is getting called on a regular bases (approximately once every couple minutes or so). Also, the destructor looks like it was compiled out. I cannot set a breakpoint in it as it says "the function cannot be found". |
Is this issue specific to 2.2? We seem to be able to repro a leak in 3.0 while using EventSource, so wanted to check if this is confirmed to not occur in 3.0. |
@tylerohlsen Thanks for the information.
Yes as you noted there has been a substantial change in EventPipe between 2.2 and 3.0 :) which is why we're looking at a surgical fix with minimal code change to 2.2 instead of porting the entire change to 2.2.
I agree. This is why we changed the default to 10MB in 3.0. Changing that in 2.2 is a viable option and I will consider it but I can't guarantee that it'll make the bar we have for 2.2 backports.
The deallocation in EventPipe 2.2 only happens if the thread has 2 or more buffers in its per-thread list of buffers it uses to store the event data. I have couple of hypothesis I'd like to verify but I'll come back with either a potential fix or some more breakpoints for you to try out. Thanks so much for all the help. |
@msgodse, could you expand on the leak you're seeing in 3.0? Like Sung said, there have been significant changes to this code in the 3.0 time frame, so it's unlikely, but not impossible this would be the same issue. |
I've changed my test app to target 3.0 preview 8 and I'm not seeing this specific issue anymore. I've set tracepoints on line 192 (allocate) and on line 341 (deallocate). Over the past 30 minutes the alloc and dealloc have the exact same count. And the process is reporting a very stable memory commit size. It is allocating, so the buffer size limit is not what is keeping the process memory usage stable. |
@mgodse It'd be helpful if you specified a little bit more about your scenario. It may / may not be related to the original issue we're discussing in this thread.
|
Oh and one more thing @tylerohlsen - Could you send the stack trace when you hit the breakpoint? |
@sywhang which breakpoint in which framework version? |
The allocate path in 2.2. It'd be nice if you can also see if which threads are hitting that |
Thanks @josalem and @sywhang for your responses. We are using EventListener to listen for GC events (GCStart, GCHeapStats). With 2.2 and 3.0.100-preview8, it seems to be leaking EventWrittenEventArgs from what we can see from the dump. Based on the discussion on this thread that does seem like a separate issue. Should be able to provide a dump soon (will try to create a standalone app with a repro) |
Thanks, @mgodse! That does sound like a different issue. Feel free to @ mention myself and Sung when you create an issue for that. I'll keep my eye out for it. All of the information that Sung listed above would be very helpful in triaging the issue. |
@sywhang Here's the allocation stack traces... Line 115 is hit the vast majority of the allocations (~9 out of 10 times). This is on a new thread every time. There are two stack traces for this code path...
and
Line 133 is hit on occasion for a different kind of allocation condition (~1 out of 10 times). This is always on the same thread as a previous call that stopped on 115.
|
Thanks @tylerohlsen - That confirms my hypothesis and I think we narrowed down the issue here. Let me try to see what a fix could look like and see what we can do to send out a 2.2 servicing fix. |
is there any progress on this issue? Do we know if this only occurs on 2.2 ? |
We recently upgraded from 2.2 to .Net Core 3.0 and no longer seeing this memory leak. For do nothing app deployed in k8s with memory limit of 512MB, it hovers around ~160MB (which seems high for workstation GC mode) and stays constant for over 3 days. |
I can confirm that .Net Core 3.0 also made the issue go away for us. |
Closing as the original issue was resolved |
Our microservices were growing at a rate of 40MB per hour (each) because of this issue. We have turned off the Prometheus.Net collectors that hook up to the event sources as a temporary workaround (djluck/prometheus-net.DotNetRuntime#6).
This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!
I have confirmed .Net Core 3 - Preview 8 does not experience the same issue. I have also confirmed that .Net Core 2.2.0 - 2.2.6 do experience the issue. I have the same stack trace for the allocations as mentioned in the first reply above.
Please back-port the fix to .Net Core 2.2.
Originally posted by @tylerohlsen in https://github.com/dotnet/coreclr/issues/23562#issuecomment-523592951
The text was updated successfully, but these errors were encountered: