Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native memory leak with ETW EventSource #12374

Closed
pharring opened this issue Mar 29, 2019 · 37 comments
Closed

Native memory leak with ETW EventSource #12374

pharring opened this issue Mar 29, 2019 · 37 comments

Comments

@pharring
Copy link
Contributor

Reported originally at microsoft/ApplicationInsights-dotnet#1102

Repros on .NET Core 2.2, x86 and x64 on Windows.

Build the following application:

using System.Diagnostics.Tracing;

namespace RPESLeak
{
    class MyEventSource : EventSource
    {
        private const EventKeywords Operations = (EventKeywords)0x1;

        private const string EventProviderName = "PSH-LeakTest";

        public MyEventSource() : base(
               EventProviderName,
               EventSourceSettings.EtwSelfDescribingEventFormat)
        {
        }

        public void ProcessOperationStart(long operation)
        {
            if (IsEnabled(EventLevel.Informational, Operations))
            {
                var payload = new { Leak = "leaky", Id = operation };
                Write(
                    "Request",
                    new EventSourceOptions { Keywords = Operations, Opcode = EventOpcode.Start, Level = EventLevel.Informational },
                    payload);
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var eventSource = new MyEventSource();
            for (long i = 0; ; i++)
            {
                eventSource.ProcessOperationStart(i);
            }
        }
    }
}

Run it on Windows and watch the memory usage of dotnet.exe in Task Manager. It settles down on my machine around 10 MB.

Now enable the ETW event listener for the event source. The easiest way to do that is to run PerfView. Press Alt+C to start a collection and make it look like this:
image

The important thing is to add this in Additional Providers:
PSH-LeakTest::Informational
You can turn off all the other options.
Now, when you start logging, watch the memory usage in dotnet.exe climb like crazy (over 200 MB in 5 seconds).
When you stop logging, the memory isn't freed. If you create a dump of the dotnet.exe process, the GC heap is fine. Most of the additional memory is in native heap and it seems to consist of the event manifests repeated over and over again.

This does not reproduce on .NET 4.7.2
I don't know about earlier .NET Core runtimes.

Please let me know if you need any more information, but I think this is a pretty minimal repro with clear instructions. One thing I could do is eliminate PerfView by writing a custom app based around TraceEvent or even a WPR profile.

To be clear how this affects Application Insights customers: The Application Insights Profiler runs in Azure App Services. It captures ETW sessions for 2 minutes at a time about once an hour. One of the providers it enables (listens to), is the Application Insights "Rich Payload Event Source" (see the original bug in the Application Insights repo). Customers are now encountering these memory leaks under load when profiling is running leading to application restarts.

@pharring
Copy link
Contributor Author

The thought occurred to me that PerfView is just the tool for tracking down the heap allocations. Indeed, here are the relevant allocating stacks:

Name Inc %
||      + ntdll!RtlpAllocateHeapInternal 98.6
||       + coreclr!EEHeapAllocInProcessHeap 98.6
||        + coreclr!operator new[] 50.2
||        |+ coreclr!EventPipeEvent::EventPipeEvent 50.2
||        ||+ coreclr!EventPipeProvider::AddEvent 50.2
||        || + coreclr!EventPipeInternal::DefineEvent 50.2
||        ||  + system.private.corelib!IEventProvider.DefineEventHandle 50.2
||        ||   + system.private.corelib!NameInfo.GetOrCreateEventHandle 50.2
||        ||    + system.private.corelib!EventSource.WriteImpl 50.2
||        ||     + ?!? 50.2
||        |+ coreclr!CStructArray::Grow 0.0
||        + coreclr!operator new 48.4
||         + coreclr!EventPipeProvider::AddEvent 48.4
||          + coreclr!EventPipeInternal::DefineEvent 33.8
||          |+ system.private.corelib!IEventProvider.DefineEventHandle 33.8
||          | + system.private.corelib!NameInfo.GetOrCreateEventHandle 33.8
||          |  + system.private.corelib!EventSource.WriteImpl 33.8
||          |   + ?!? 33.8
||          + coreclr!EventPipeProvider::AddEvent 14.6
||           + coreclr!EventPipeInternal::DefineEvent 14.6
||            + system.private.corelib!IEventProvider.DefineEventHandle 14.6
||             + system.private.corelib!NameInfo.GetOrCreateEventHandle 14.6
||              + system.private.corelib!EventSource.WriteImpl 14.6
||               + ?!? 14.6

@vancem
Copy link
Contributor

vancem commented Mar 29, 2019

  • @jorive who owns EventSource and EventPipe
  • @noahfalk owns the whole area
  • @cshung has been fixing bugs associated with EventPipe (which this seems to be)

@pharring We should probably move this bug to the coreclr repo since it is almost certainly going to be fixed by modifying the .NET Core runtime.

@danmoseley danmoseley transferred this issue from dotnet/corefx Mar 29, 2019
@jorive
Copy link
Member

jorive commented Mar 30, 2019

@pharring I'm taking a look.

@pharring
Copy link
Contributor Author

pharring commented Apr 1, 2019

FYI, I was able to eliminate PerfView from the repro by building my own TraceEvent-based listener. I also got a simplified repro in Azure App Service. They all have the same symptoms, pointing to this as the root cause.

@jorive
Copy link
Member

jorive commented Apr 1, 2019

Thanks @pharring, I am able to repro it locally with the code above.

@vancem
Copy link
Contributor

vancem commented Apr 1, 2019

Including @brianrob who wrote this.

I took a quick look and the key place is the stack

Name
--
\|\|  \|\|\|\|\|\|  \|+ system.private.corelib!System.Diagnostics.Tracing.EventSource.Write[System.__Canon](System.String, System.Diagnostics.Tracing.EventSourceOptions, System.__Canon)
\|\|  \|\|\|\|\|\|  \|\|+ system.private.corelib!EventSource.WriteImpl
\|\|  \|\|\|\|\|\|  \|\|\|+ system.private.corelib!NameInfo.GetOrCreateEventHandle
\|\|  \|\|\|\|\|\|  \|\|\|\|+ system.private.corelib!IEventProvider.DefineEventHandle
\|\|  \|\|\|\|\|\|  \|\|\|\|\|+ coreclr!EventPipeInternal::DefineEvent
\|\|  \|\|\|\|\|\|  \|\|\|\|\| + coreclr!EventPipeProvider::AddEvent

Which says that starting with an event write, we try to create a handle for it (for EventPipe), and this seems to allocate ALL THE TIME. However a handle to an event is about its Meta-Data, and thus there should be a lookup and MOST of the time we should NOT be creating new event meta-data.

Thus the basic issue is that we are missing our 'lookup' that would find that we already have meta-data for this event.

@brianrob - Can you quickly tell us where it SHOULD be happening?

There is a secondary issue that we should not need this meta-data at all unless EventPipe is turned on (which it is not), so it would be nice if we avoided even generating the handle if EventPipe is off.

@jorive - We should also check the scenario where we are collecting with EventPIpe instead of ETW before we call this bug fixed.

@jorive
Copy link
Member

jorive commented Apr 1, 2019

I'm unable to repro in netcoreapp3.0, and dotnet/coreclr#18355 seems to have fixed it.

@vancem
Copy link
Contributor

vancem commented Apr 1, 2019

@jorive - I assume you are going to try back-porting that fix (it is not that big).

@pharring - assuming that the fix above fixes things, the main thing we need to figure out is the impact of the bug. Can you detail the impact so that @jorive and make the case to the triage group.

In particular, in a 'normal' use of Application Insights, how bad is the leak? I am assuming that it is big enough that it forces customers to turn off App-Insights. It will also affect any other data collection (thus things leak if you also collect with PerfView).

We should err on the side of fixing this, Looking at https://github.com/dotnet/core/blob/master/release-notes/download-archive.md, we have already released 3 updates to 2.2. and presumably we will have another within a month. That will likely live at least 6 more months before V3.0 becomes available, so I think it is worth it.

@brianrob
Copy link
Member

brianrob commented Apr 1, 2019

@vancem, you're right that this is about metadata, and prior to dotnet/coreclr#18355, the cache was essentially busted and so we always created a new metadata blob. With dotnet/coreclr#18355 this should no longer happen and only a single blob per event should be created.

It is likely possible to avoid creating the metadata if EventPipe isn't enabled, but if I recall correctly, I did the metadata generation regardless because it was hard to get to a place outside of event initialization that had enough of the data required to produce the metadata.

@mattzink
Copy link

mattzink commented Apr 1, 2019

FWIW, our team had to remove all usage of EventPipe used for diagnostics in our product, as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.

@yahorsi
Copy link

yahorsi commented Apr 1, 2019

as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.

+1
Waiting for 3.0 to be fully shipped=>fully tested for our app=>being available in azure is a long path to go

I am assuming that it is big enough that it forces customers to turn off App-Insights.

It is actually depends on the load. The leak is about ~200 bytes per event as I understand

@vancem
Copy link
Contributor

vancem commented Apr 1, 2019

@jorive - it sounds like this bug is rather severe. I think we have a strong case to back-port it to V2.2. We should pursue that.

@noahfalk
Copy link
Member

noahfalk commented Apr 1, 2019

Agreed

@jorive
Copy link
Member

jorive commented Apr 1, 2019

I'm preparing port to 2.2 right now.

@yahorsi
Copy link

yahorsi commented Apr 2, 2019

Any idea if this could be available on Azure at all? I mean deployed to Azure and so could be used in App Service? If yes then how long does it takes usually? :)

@avandych
Copy link

avandych commented Jun 25, 2019

Seems like issue is still reproducible on 2.2.5. I'm using EventListener for monitoring events and when using it, memory usage in dotnet.exe climbs very fast

Host dotnet --info:

Host (useful for support):
Version: 2.2.5
Commit: 0a3c9209c0

.NET Core SDKs installed:
No SDKs were found.

.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

@jenshenneberg
Copy link

We are also still seeing this issue on 2.2.5. Exact same dotnet --info output as above.

@jorive
Copy link
Member

jorive commented Jun 25, 2019

@SPANIOL Does it cap around ~1GB (This is the default internal circular buffer allocated/used)?

@avandych
Copy link

avandych commented Jun 25, 2019

@jorive yeah, i believe it may be the case, left service for 3 days without any traffic and it used around 900 mb of memory, while usually it uses around 150 mb.

@jorive
Copy link
Member

jorive commented Jun 25, 2019

That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.

@mjsabby
Copy link
Contributor

mjsabby commented Jun 25, 2019

Is this memory not used when you disable EventPipe?

@yahorsi
Copy link

yahorsi commented Jun 25, 2019

That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.

How big is the buffer in 2.2?

PS: Has this changed in 3.0, taking in to account changes related to the Docker ?

@avandych
Copy link

@mjsabby memory isn't used when pipe is disabled, when enabled memory usage climbs like crazy under traffic, while heap usage remains at same level
@jorive is buffer size configurable? It looks like it's little bit to much, use 1 gb of memory if you want to gather some basic metrics for dashboard and monitoring

@jorive
Copy link
Member

jorive commented Jun 25, 2019

@SPANIOL Yes, there are few ways to configure the size of the buffer, and depends on how are you enabling EventPipe tracing in your project.

  • Before starting the process, set environment variable: COMPlus_EventPipeCircularMB (default is COMPlus_EventPipeCircularMB=1024)
  • If you are enabling EventPipe tracing via App.eventpipeconfig file drop, then you can add an entry to the file CircularMB (if not provided, it defaults to CircularMB=1024)

@jorive
Copy link
Member

jorive commented Jun 25, 2019

@yahorsi On 3.0, the default value changed to 256MB

@avandych
Copy link

@jorive sounds promising! i'll try configuration via env variables tomorrow and post results here in case if anyone will have such troubles, i can see that ticket for documentation related to EventPipe's is still open, so it's difficult to figured it out from the start, but the case to use them for basic monitoring will be common

@sywhang
Copy link
Contributor

sywhang commented Jun 25, 2019

@SPANIOL For 3.0 we added runtime event counters, which is far more low-overhead for purposes like simple APMs or dashboards for "basic monitoring" scenarios. Unfortunately backporting that to 2.2 is far more costly than we can afford, but just wanted to let you know that better alternatives do exist in 3.0 :)

@jorive
Copy link
Member

jorive commented Jun 25, 2019

Is this memory not used when you disable EventPipe?

This memory is deallocated when the tracing session is disabled.

@noahfalk
Copy link
Member

I created dotnet/coreclr#25413 to track this. FWIW, I think 256MB is probably too big of a default and we should look into changing it.

@jorive
Copy link
Member

jorive commented Jun 26, 2019

I believe 256MB is the default used by PerfView, but I agree we should consider lowering it.

@tylerohlsen
Copy link

tylerohlsen commented Aug 21, 2019

Our microservices were growing at a rate of 40MB per hour (each) because of this issue. We have turned off the Prometheus.Net collectors that hook up to the event sources as a temporary workaround (djluck/prometheus-net.DotNetRuntime#6).

This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!

I have confirmed .Net Core 3 - Preview 8 does not experience the same issue. I have also confirmed that .Net Core 2.2.0 - 2.2.6 do experience the issue. I have the same stack trace for the allocations as mentioned in the first reply above.

Please back-port the fix to .Net Core 2.2.

@sywhang
Copy link
Contributor

sywhang commented Aug 21, 2019

cc @josalem

This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!

That definitely doesn't sound normal to me either.

I have confirmed .Net Core 3 - Preview 8 does not experience the same issue.

This makes me less nervous. That being said, AFAIK @jorive already ported his fix (dotnet/coreclr#23661) to 2.2 and if it's not addressing the issue, we might be looking at a different issue from one that was already ported.

Do you happen to have a memory dump and/or a small repro that I can use to repro the behavior?

@tylerohlsen
Copy link

tylerohlsen commented Aug 22, 2019

Sorry, I do not think I can share the full memory dump, but here is some data from the memory dump and I can get you more data upon request.

0:000> !heap -stat -h 000002b837d30000 
 heap @ 000002b837d30000
group-by: TOTSIZE max-display: 20
    size     #blocks     total     ( %) (percent of total busy bytes)
    19000 cbf9 - 13eb51000  (96.11)
    32000 3ae - b7fc000  (3.47)
    258000 2 - 4b0000  (0.09)
    38 d482 - 2e7c70  (0.05)
    30 eb62 - 2c2260  (0.05)
0:000> !heap -p -a 000001d85f2a5f80    
    address 000001d85f2a5f80 found in
    _HEAP @ 1d3c2ff0000
              HEAP_ENTRY Size Prev Flags            UserPtr UserSize - state
        000001d85f2a5f50 1903 0000  [00]   000001d85f2a5f80    19000 - (busy)
        7ff9014cff83 ntdll!RtlpCallInterceptRoutine+0x000000000000003f
        7ff901494423 ntdll!RtlpAllocateHeapInternal+0x0000000000078ba3
        7ff88c8755b9 coreclr!EEHeapAllocInProcessHeap+0x0000000000000029
        7ff88c875571 coreclr!operator new[]+0x0000000000000025
        7ff88caebd12 coreclr!EventPipeBuffer::EventPipeBuffer+0x0000000000000016
        7ff88ca7da1a coreclr!EventPipeBufferManager::AllocateBufferForThread+0x00000000000001e2
        7ff88ca7e2be coreclr!EventPipeBufferManager::WriteEvent+0x00000000000000e6
        7ff88ca0dd5e coreclr!EventPipe::WriteEventInternal+0x00000000000000be
        7ff88c8e3e9c coreclr!EventPipe::WriteEvent+0x000000000000003c
        7ff88c9643c5 coreclr!FireEtwThreadCreated+0x0000000000120415
        7ff88c843fa3 coreclr!ETW::ThreadLog::FireThreadCreated+0x0000000000000043
        7ff88c843ef7 coreclr!Thread::HasStarted+0x00000000000000ef
        7ff88c83a42e coreclr!ThreadNative::KickOffThread+0x00000000000000be
        7ff88c83a34a coreclr!Thread::intermediateThreadProc+0x000000000000008a
        7ff8ffa37bd4 kernel32!BaseThreadInitThunk+0x0000000000000014
        7ff90144ce71 ntdll!RtlUserThreadStart+0x0000000000000021

As you can see, at the time of that memory dump, there was 52,217 occurrences of these 102,400 byte objects that take up a total of 5,347,020,800 (5.3 GB) of memory. I spot checked some of the same size objects in a debug instance and they all had the same allocation stack trace listed here (sorry for the mix-up, my stack trace is different than the OP's stack trace).

The repro was pretty simple. I made an Asp.Net Core 2.2.4 targeted application that was self hosted as a console app and self contained with win7-x64 as the runtime target. Then I added a nuget reference to the latest of prometheus-net.DotNetRuntime and called DotNetRuntimeStatsBuilder.Default().StartCollecting(); in Startup.Configure.

I can create a project like that and attach a zip if you'd like.

@tylerohlsen
Copy link

tylerohlsen commented Aug 22, 2019

Attached is a simple app the reproduces my issue. EventPipeMemoryLeak.zip

I just let the app sit idle and the memory usage is continuously increasing. And the heap stats show a regular increase in quantity of these 0x19000 size objects.

@sywhang and @josalem ^^

@sywhang
Copy link
Contributor

sywhang commented Aug 23, 2019

@tylerohlsen I'm running the repro app and trying to see how it ends up in that state. So far I'm not seeing anything that's not expected - I'm waiting to see if the memory usage grows beyond 1GB which is what EventListener sessions have as the max limit.

That being said the 0x19000 size events does correspond to the EventPipe buffer blocks that we allocate internally to store the events. The repro app's been running for about ~1.5 hours now and it's at 380MB - I will come back in a few hours and see how things have changed and take it from there.

@sywhang
Copy link
Contributor

sywhang commented Aug 23, 2019

@tylerohlsen I moved the issue to dotnet/coreclr#26344 because this issue is different from the original issue. I'll continue posting things in that thread.

@sywhang
Copy link
Contributor

sywhang commented Aug 27, 2019

Closing this issue since the original issue was fixed in both master/2.2.

@sywhang sywhang closed this as completed Aug 27, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.2.x milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests