Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.2] Potential native memory leak from EventPipe #13309

Closed
sywhang opened this issue Aug 23, 2019 · 33 comments
Closed

[2.2] Potential native memory leak from EventPipe #13309

sywhang opened this issue Aug 23, 2019 · 33 comments
Assignees
Milestone

Comments

@sywhang
Copy link
Contributor

sywhang commented Aug 23, 2019

Our microservices were growing at a rate of 40MB per hour (each) because of this issue. We have turned off the Prometheus.Net collectors that hook up to the event sources as a temporary workaround (djluck/prometheus-net.DotNetRuntime#6).

This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!

I have confirmed .Net Core 3 - Preview 8 does not experience the same issue. I have also confirmed that .Net Core 2.2.0 - 2.2.6 do experience the issue. I have the same stack trace for the allocations as mentioned in the first reply above.

Please back-port the fix to .Net Core 2.2.

Originally posted by @tylerohlsen in https://github.com/dotnet/coreclr/issues/23562#issuecomment-523592951

@sywhang sywhang self-assigned this Aug 23, 2019
@sywhang
Copy link
Contributor Author

sywhang commented Aug 23, 2019

From : https://github.com/dotnet/coreclr/issues/23562#issuecomment-523731589

Sorry, I do not think I can share the full memory dump, but here is some data from the memory dump and I can get you more data upon request.

0:000> !heap -stat -h 000002b837d30000 
 heap @ 000002b837d30000
group-by: TOTSIZE max-display: 20
    size     #blocks     total     ( %) (percent of total busy bytes)
    19000 cbf9 - 13eb51000  (96.11)
    32000 3ae - b7fc000  (3.47)
    258000 2 - 4b0000  (0.09)
    38 d482 - 2e7c70  (0.05)
    30 eb62 - 2c2260  (0.05)
0:000> !heap -p -a 000001d85f2a5f80    
    address 000001d85f2a5f80 found in
    _HEAP @ 1d3c2ff0000
              HEAP_ENTRY Size Prev Flags            UserPtr UserSize - state
        000001d85f2a5f50 1903 0000  [00]   000001d85f2a5f80    19000 - (busy)
        7ff9014cff83 ntdll!RtlpCallInterceptRoutine+0x000000000000003f
        7ff901494423 ntdll!RtlpAllocateHeapInternal+0x0000000000078ba3
        7ff88c8755b9 coreclr!EEHeapAllocInProcessHeap+0x0000000000000029
        7ff88c875571 coreclr!operator new[]+0x0000000000000025
        7ff88caebd12 coreclr!EventPipeBuffer::EventPipeBuffer+0x0000000000000016
        7ff88ca7da1a coreclr!EventPipeBufferManager::AllocateBufferForThread+0x00000000000001e2
        7ff88ca7e2be coreclr!EventPipeBufferManager::WriteEvent+0x00000000000000e6
        7ff88ca0dd5e coreclr!EventPipe::WriteEventInternal+0x00000000000000be
        7ff88c8e3e9c coreclr!EventPipe::WriteEvent+0x000000000000003c
        7ff88c9643c5 coreclr!FireEtwThreadCreated+0x0000000000120415
        7ff88c843fa3 coreclr!ETW::ThreadLog::FireThreadCreated+0x0000000000000043
        7ff88c843ef7 coreclr!Thread::HasStarted+0x00000000000000ef
        7ff88c83a42e coreclr!ThreadNative::KickOffThread+0x00000000000000be
        7ff88c83a34a coreclr!Thread::intermediateThreadProc+0x000000000000008a
        7ff8ffa37bd4 kernel32!BaseThreadInitThunk+0x0000000000000014
        7ff90144ce71 ntdll!RtlUserThreadStart+0x0000000000000021

As you can see, at the time of that memory dump, there was 52,217 occurrences of these 102,400 byte objects that take up a total of 5,347,020,800 (5.3 GB) of memory. I spot checked some of the same size objects in a debug instance and they all had the same allocation stack trace listed here (sorry for the mix-up, my stack trace is different than the OP's stack trace).

The repro was pretty simple. I made an Asp.Net Core 2.2.4 targeted application that was self hosted as a console app and self contained with win7-x64 as the runtime target. Then I added a nuget reference to the latest of prometheus-net.DotNetRuntime and called DotNetRuntimeStatsBuilder.Default().StartCollecting(); in Startup.Configure.

I can create a project like that and attach a zip if you'd like.

@sywhang
Copy link
Contributor Author

sywhang commented Aug 23, 2019

From: https://github.com/dotnet/coreclr/issues/23562#issuecomment-524005521

Attached is a simple app the reproduces my issue. EventPipeMemoryLeak.zip

I just let the app sit idle and the memory usage is continuously increasing. And the heap stats show a regular increase in quantity of these 0x19000 size objects.

@sywhang and @josalem ^^

@sywhang
Copy link
Contributor Author

sywhang commented Aug 23, 2019

From : https://github.com/dotnet/coreclr/issues/23562#issuecomment-524136956

@tylerohlsen I'm running the repro app and trying to see how it ends up in that state. So far I'm not seeing anything that's not expected - I'm waiting to see if the memory usage grows beyond 1GB which is what EventListener sessions have as the max limit.

That being said the 0x19000 size events does correspond to the EventPipe buffer blocks that we allocate internally to store the events. The repro app's been running for about ~1.5 hours now and it's at 380MB - I will come back in a few hours and see how things have changed and take it from there.

@sywhang
Copy link
Contributor Author

sywhang commented Aug 23, 2019

@tylerohlsen

Unfortunately I haven't been able to repro the issue using your repro. I ran it for more than 22 hours now and so far the memory usage is very stable at around 370MB. I believe there may be some machine specific differences that causes the repro to not happen on my side - for example, if the buffer is being cleared fast enough that it never accumulates to a single buffer size, it's possible that the thread never allocates another buffer (which is what's supposed to be happening). But in your case it obviously grew beyond that size, and is allocating more than the maximum limit.

That being said, I can provide some guidance on how to debug this leak that you can follow to diagnose the state when the leak is happening.

Since the library you are using is using EventListener, the memory usage (in 2.2) can grow up to 1GB but it shouldn't grow beyond that. If you are using less than 1GB, that doesn't necessarily indicate a leak. When/if the native memory usage (the total size of 0x19000 sized blocks) has grown beyond 1GB, could you set a breakpoint in CoreCLR!EventPipeBufferManager::AllocateBufferForThread and step through to see which code path it takes? Specifically we want to know how it gets to this part of that function:

    if(allocateNewBuffer)
    {
        // Pick a buffer size by multiplying the base buffer size by the number of buffers already allocated for this thread.
        unsigned int sizeMultiplier = pThreadBufferList->GetCount() + 1;

        // Pick the base buffer size based.  Debug builds have a smaller size to stress the allocate/steal path more.
        unsigned int baseBufferSize =
#ifdef _DEBUG
            30 * 1024; // 30K
#else
            100 * 1024; // 100K
#endif
        unsigned int bufferSize = baseBufferSize * sizeMultiplier;

        // Make sure that buffer size >= request size so that the buffer size does not
        // determine the max event size.
        if(bufferSize < requestSize)
        {
            bufferSize = requestSize;
        }

        // Don't allow the buffer size to exceed 1MB.
        const unsigned int maxBufferSize = 1024 * 1024;
        if(bufferSize > maxBufferSize)
        {
            bufferSize = maxBufferSize;
        }

        // EX_TRY is used here as opposed to new (nothrow) because
        // the constructor also allocates a private buffer, which
        // could throw, and cannot be easily checked
        EX_TRY
        {
            pNewBuffer = new EventPipeBuffer(bufferSize);
        }
        EX_CATCH
        {
            pNewBuffer = NULL;
        }
        EX_END_CATCH(SwallowAllExceptions);

        if (pNewBuffer == NULL)
        {
            return NULL;
        }

        m_sizeOfAllBuffers += bufferSize;
#ifdef _DEBUG
        m_numBuffersAllocated++;
#endif // _DEBUG
    }

The other thing you might want to check is if it ever goes into this path:

    // Determine if policy allows us to allocate another buffer, or if we need to steal one
    // from another thread.
    if(!allocateNewBuffer)
    {
        EventPipeConfiguration *pConfig = EventPipe::GetConfiguration();
        if(pConfig == NULL)
        {
            return NULL;
        }

        size_t circularBufferSizeInBytes = pConfig->GetCircularBufferSize();
        if(m_sizeOfAllBuffers < circularBufferSizeInBytes)
        {
            // We don't worry about the fact that a new buffer could put us over the circular buffer size.
            // This is OK, and we won't do it again if we actually go over.
            allocateNewBuffer = true;
        }
    }

This if block: if(m_sizeOfAllBuffers < circularBufferSizeInBytes) is supposed to give you false if your memory usage is greater than the maximum limit (1GB).

@jenshenneberg
Copy link

@sywhang

My I suggest running the sample code in this docker container: mcr.microsoft.com/dotnet/core/aspnet:2.2

We are experincing what looks like the same issue.

We see issue when running containerized on linux, and we do not have the issue running directly(non-containerized) on windows.

We haven't testet non-containerized linux or containerized windows

@sywhang
Copy link
Contributor Author

sywhang commented Aug 27, 2019

@jenshenneberg Does the native memory usage go above 1 GB?

@jenshenneberg
Copy link

@sywhang
We've disabled the offending code our end due to this issue. I've reenabled it again, and will get back to you when I have some concrete data on the memory usage. (As it grows rather slowly it might be a day or two)

@sywhang
Copy link
Contributor Author

sywhang commented Aug 27, 2019

@jenshenneberg That would be great. Thank you! I'll try out the repro myself too in the meantime.

@thirus
Copy link

thirus commented Aug 28, 2019

Our app non heap memory grows ~15MB per hour, we have hard cap of 512MB memory limit set at k8s, so container gets killed cause of OOM. So could not check till 1GB.

@sywhang When I ran the app in OSX, without calling /metrics endpoint the memory usages does not seem to increase looking the htop but didn't ran long enough, will provide update after running for 4-6 hours.

Sdk Version: 2.2.401
Asp Net Core Version: 2.2.6
GC Mode: Workstation
Docker runtime image: mcr.microsoft.com/dotnet/core/aspnet:2.2.6
prometheus-net.DotNetRuntime: 3.1.1-beta

image
image

@tylerohlsen
Copy link

I now have my local environment set up with the coreclr source code to be able to set a breakpoint and debug into AllocateBufferForThread as you suggested @sywhang. I'm report back in a couple days when the buffer gets larger than 1 GB.

@jenshenneberg
Copy link

jenshenneberg commented Aug 29, 2019

@sywhang
Our apps grow at about 4,3MB pr hour, so unfortunately it will be a long while before we reach the 1GB we discussed. We are also running on kubernetes, and unfortunately pods are not likely to live for 10 days on our test cluster.

I'll start another pod with COMPlus_EventPipeCircularMB set to 10 to verify that this is not just the circular buffer filling up.

@sywhang
Copy link
Contributor Author

sywhang commented Aug 29, 2019

@thirus Please do let me know if it grows beyond 1GB. And if it does, a heap dump would be very helpful.

@tylerohlsen That would be great. Thank you.

@jenshenneberg I can wait for whenever it is ready - on a side note, how are you collecting the trace? If you're using EventListener (it's what the prometheus-dotnet library mentioned in the issue uses), setting COMPlus_EventPipeCircularMB does not affect the buffer size since that environment variable is only meant for out-of-proc collection via the COMPlus_EnableEventPipe environment variable.

@jenshenneberg
Copy link

@sywhang
I am indeed using EventListner. Is there an alternative to COMPlus_EventPipeCircularMB in this case?

@tylerohlsen
Copy link

tylerohlsen commented Aug 31, 2019

After about 2.5 days, I'm well above 1GB of memory allocated from the buffer. I'm running with 2.2.4 and line 115 is what is setting allocateNewBuffer = true;. So therefore, EventPipeBufferList *pThreadBufferList = pThread->GetEventPipeBufferList(); is getting assigned to NULL. And because allocateNewBuffer is true, it skips the block that checks the max buffer size on line 120. Looks like 2.2.6 has identical code for the first part of this method which makes sense that I could reproduce this issue in 2.2.6.

But 3.0.0 preview 8 added a little block on line 110. Could that be why it is not reproducible in 3.0? If so, can we get that back ported to 2.2?

Here's the new block in 3.0 that is not in 2.2...

// if we are deallocating then give up, see the comments in SuspendWriteEvents() for why this is important.
if (m_writeEventSuspending.Load())
{
    writeSuspended = TRUE;
    return NULL;
}

@tylerohlsen
Copy link

Oh, it looks like there's quite a bit more differences in 3.0 in that method than just the block I posted above.

FYI, I'm still running my debugger in this repro app at above 1GB if you need me to set a breakpoint anywhere else. Please let me know if there's more you need.

@tylerohlsen
Copy link

Even if you fix the glitch that causes the buffer to go above 1GB, isn’t there still a problem that we are getting to 1GB in the first place? I’m doing absolutely nothing in this app. It’s just sitting idle.

Doesn’t this also indicate that the deallocations are not happening?

@tylerohlsen
Copy link

Today I've set tracepoints on line 224 in the allocate block and on the deallocate methods EventPipeBufferManager::DeAllocateBuffers(void) and EventPipeBufferManager::DeAllocateBuffer(EventPipeBuffer *pBuffer).

Allocate is getting called on a regular bases (approximately once every couple minutes or so).
After an hour, neither of the deallocate methods were called at all.

Also, the destructor looks like it was compiled out. I cannot set a breakpoint in it as it says "the function cannot be found".

@mgodse
Copy link

mgodse commented Sep 3, 2019

Is this issue specific to 2.2? We seem to be able to repro a leak in 3.0 while using EventSource, so wanted to check if this is confirmed to not occur in 3.0.

@sywhang
Copy link
Contributor Author

sywhang commented Sep 3, 2019

@tylerohlsen Thanks for the information.

But 3.0.0 preview 8 added a little block on line 110. Could that be why it is not reproducible in 3.0? If so, can we get that back ported to 2.2?

Yes as you noted there has been a substantial change in EventPipe between 2.2 and 3.0 :) which is why we're looking at a surgical fix with minimal code change to 2.2 instead of porting the entire change to 2.2.

isn’t there still a problem that we are getting to 1GB in the first place? I’m doing absolutely nothing in this app. It’s just sitting idle

I agree. This is why we changed the default to 10MB in 3.0. Changing that in 2.2 is a viable option and I will consider it but I can't guarantee that it'll make the bar we have for 2.2 backports.

Allocate is getting called on a regular bases (approximately once every couple minutes or so). After an hour, neither of the deallocate methods were called at all.

The deallocation in EventPipe 2.2 only happens if the thread has 2 or more buffers in its per-thread list of buffers it uses to store the event data.

I have couple of hypothesis I'd like to verify but I'll come back with either a potential fix or some more breakpoints for you to try out. Thanks so much for all the help.

@josalem
Copy link
Contributor

josalem commented Sep 3, 2019

Is this issue specific to 2.2? We seem to be able to repro a leak in 3.0 while using EventSource, so wanted to check if this is confirmed to not occur in 3.0.

@msgodse, could you expand on the leak you're seeing in 3.0? Like Sung said, there have been significant changes to this code in the 3.0 time frame, so it's unlikely, but not impossible this would be the same issue.

@tylerohlsen
Copy link

I've changed my test app to target 3.0 preview 8 and I'm not seeing this specific issue anymore. I've set tracepoints on line 192 (allocate) and on line 341 (deallocate). Over the past 30 minutes the alloc and dealloc have the exact same count. And the process is reporting a very stable memory commit size.

It is allocating, so the buffer size limit is not what is keeping the process memory usage stable.

@sywhang
Copy link
Contributor Author

sywhang commented Sep 3, 2019

@mgodse It'd be helpful if you specified a little bit more about your scenario. It may / may not be related to the original issue we're discussing in this thread.

  • What version of .NET Core 3 are you using?
  • What is your environment? (OS, arch, container?)
  • How are you consuming the events written through your EventSource? Are you using EventListener? Or are you using something else?
  • Do you have a repro app we can try out?
  • Last but most important - Do you have anything that made you think that this is a leak on the runtime side?

@sywhang
Copy link
Contributor Author

sywhang commented Sep 3, 2019

Oh and one more thing @tylerohlsen - Could you send the stack trace when you hit the breakpoint?

@tylerohlsen
Copy link

@sywhang which breakpoint in which framework version?

@sywhang
Copy link
Contributor Author

sywhang commented Sep 3, 2019

The allocate path in 2.2. It'd be nice if you can also see if which threads are hitting that allocateNewBuffer=true;. If it's different each time, it could point to a cause of the leak where we never deallocate the buffers if there is a single buffer in the per-thread list of buffers. If it repeats then we're looking at a new issue.

@mgodse
Copy link

mgodse commented Sep 3, 2019

Thanks @josalem and @sywhang for your responses. We are using EventListener to listen for GC events (GCStart, GCHeapStats). With 2.2 and 3.0.100-preview8, it seems to be leaking EventWrittenEventArgs from what we can see from the dump. Based on the discussion on this thread that does seem like a separate issue. Should be able to provide a dump soon (will try to create a standalone app with a repro)

@josalem
Copy link
Contributor

josalem commented Sep 3, 2019

Thanks, @mgodse! That does sound like a different issue. Feel free to @ mention myself and Sung when you create an issue for that. I'll keep my eye out for it. All of the information that Sung listed above would be very helpful in triaging the issue.

@tylerohlsen
Copy link

@sywhang Here's the allocation stack traces...

Line 115 is hit the vast majority of the allocations (~9 out of 10 times). This is on a new thread every time.

There are two stack traces for this code path...

	coreclr.dll!EventPipeBufferManager::AllocateBufferForThread(EventPipeSession & session, Thread * pThread, unsigned int requestSize) Line 114	C++
 	coreclr.dll!EventPipeBufferManager::WriteEvent(Thread *) Line 377	C++
 	coreclr.dll!EventPipe::WriteEventInternal(EventPipeEvent & event, EventPipeEventPayload & payload, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 821	C++
 	coreclr.dll!EventPipe::WriteEvent(EventPipeEvent & event, unsigned char * pData, unsigned int length, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 765	C++
 	coreclr.dll!FireEtwThreadCreated(const unsigned __int64 ManagedThreadID, const unsigned __int64 AppDomainID, const unsigned int Flags, const unsigned int ManagedThreadIndex, const unsigned int OSThreadID, const unsigned short ClrInstanceID) Line 1327	C++
 	coreclr.dll!ETW::ThreadLog::FireThreadCreated(Thread * pThread) Line 2579	C++
 	coreclr.dll!Thread::HasStarted(int bRequiresTSL) Line 1988	C++
 	coreclr.dll!SetupThread(int) Line 695	C++
 	[Inline Frame] coreclr.dll!SetupThread() Line 615	C++
 	coreclr.dll!SetupThreadNoThrow(HRESULT * pHR) Line 544	C++
 	coreclr.dll!ThreadpoolMgr::WorkerThreadStart(void * lpArgs) Line 1935	C++
 	coreclr.dll!Thread::intermediateThreadProc(void * arg) Line 2255	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

and

	coreclr.dll!EventPipeBufferManager::AllocateBufferForThread(EventPipeSession & session, Thread * pThread, unsigned int requestSize) Line 114	C++
 	coreclr.dll!EventPipeBufferManager::WriteEvent(Thread *) Line 377	C++
 	coreclr.dll!EventPipe::WriteEventInternal(EventPipeEvent & event, EventPipeEventPayload & payload, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 821	C++
 	coreclr.dll!EventPipe::WriteEvent(EventPipeEvent & event, unsigned char * pData, unsigned int length, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 765	C++
 	coreclr.dll!FireEtwThreadCreated(const unsigned __int64 ManagedThreadID, const unsigned __int64 AppDomainID, const unsigned int Flags, const unsigned int ManagedThreadIndex, const unsigned int OSThreadID, const unsigned short ClrInstanceID) Line 1327	C++
 	coreclr.dll!ETW::ThreadLog::FireThreadCreated(Thread * pThread) Line 2579	C++
 	coreclr.dll!Thread::HasStarted(int bRequiresTSL) Line 1988	C++
 	coreclr.dll!SetupThread(int) Line 695	C++
 	[Inline Frame] coreclr.dll!SetupThread() Line 615	C++
 	coreclr.dll!SetupThreadNoThrow(HRESULT * pHR) Line 544	C++
 	coreclr.dll!ThreadpoolMgr::CompletionPortThreadStart(void * lpArgs) Line 3299	C++
 	coreclr.dll!Thread::intermediateThreadProc(void * arg) Line 2255	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

Line 133 is hit on occasion for a different kind of allocation condition (~1 out of 10 times). This is always on the same thread as a previous call that stopped on 115.

	coreclr.dll!EventPipeBufferManager::AllocateBufferForThread(EventPipeSession & session, Thread * pThread, unsigned int requestSize) Line 128	C++
 	coreclr.dll!EventPipeBufferManager::WriteEvent(Thread *) Line 377	C++
 	coreclr.dll!EventPipe::WriteEventInternal(EventPipeEvent & event, EventPipeEventPayload & payload, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 821	C++
 	coreclr.dll!EventPipe::WriteEvent(EventPipeEvent & event, unsigned char * pData, unsigned int length, const _GUID * pActivityId, const _GUID * pRelatedActivityId) Line 765	C++
 	coreclr.dll!ThreadpoolMgr::WorkerThreadStart(void * lpArgs) Line 2133	C++
 	coreclr.dll!Thread::intermediateThreadProc(void * arg) Line 2255	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

@sywhang
Copy link
Contributor Author

sywhang commented Sep 3, 2019

Thanks @tylerohlsen - That confirms my hypothesis and I think we narrowed down the issue here. Let me try to see what a fix could look like and see what we can do to send out a 2.2 servicing fix.

@mgodse
Copy link

mgodse commented Oct 3, 2019

is there any progress on this issue? Do we know if this only occurs on 2.2 ?

@thirus
Copy link

thirus commented Oct 3, 2019

We recently upgraded from 2.2 to .Net Core 3.0 and no longer seeing this memory leak. For do nothing app deployed in k8s with memory limit of 512MB, it hovers around ~160MB (which seems high for workstation GC mode) and stays constant for over 3 days.

@jenshenneberg
Copy link

I can confirm that .Net Core 3.0 also made the issue go away for us.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.2.x milestone Jan 31, 2020
@josalem
Copy link
Contributor

josalem commented Feb 7, 2020

Closing as the original issue was resolved

@josalem josalem closed this as completed Feb 7, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants