dotnet-dump makes process to double its used memory and fails #71472

afilatov-st · 2022-06-30T01:58:51Z

Description

In a Kubernetes environment, we have a process that normally consumes around 3.8 Gi.
When we run dotnet-dump collect, it causes the process to increase memory usage up to around 7.2 Gi.
Since we have a 6 Gi memory limit for the Pod, dotnet-dump cannot finish dump generation and fails with a System.IO.EndOfStreamException: Unable to read beyond the end of the stream exception.

If we set a higher memory limit, dotnet-dump collect succeeded, approximately doubling the used memory.
Is this expected behavior? Is it possible to make it just save the dump to the file without consuming more memory?

Reproduction Steps

Run dotnet-dump collect --process-id 1

Expected behavior

A dump file is created

Actual behavior

Dump file generation failed and the process may be crashed

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

tommcdon · 2022-06-30T03:53:40Z

@afilatov-st Thanks for the bug report! I do not believe that memory doubling is expected in this scenario (though some memory usage is expected). Dotnet-Dump sends an IPC command using a domain socket on Linux to the target process to collect a dump. The target process will then launch createdump as a child process to collect a dump of the parent process. When the memory is doubled - is it the target process's memory that increases, createdump, or dotnet-dump itself that uses the extra memory?
@mikem8361 @hoyosjs

afilatov-st · 2022-06-30T06:45:41Z

@tommcdon thanks for the prompt response!
the crash happens pretty quickly, so I had to run the following script in a parallel session

while true                      
do 
  ps aux;
  sleep 0.5;
done

it shows that the RSS of the target dotnet process is in its initial value of 3.6 Gb for around 5 seconds, then it quickly grows to 6.2 Gb before Kubernetes kills it.

afilatov-st · 2022-06-30T06:47:55Z

For the context, the Docker image is based on mcr.microsoft.com/dotnet/aspnet:6.0.5-bullseye-slim-amd64

mikem8361 · 2022-07-12T00:28:49Z

Can you try a heap dump by add --type Heap to the dotnet-dump collect command line?

We think maybe when createdump is reading memory to write pages to the dump file that causes them to be "swapped" back in or read from the module files into the target process memory. A heap dump doesn't touch/read most of the module pages.

ghost · 2022-07-12T00:29:16Z

This issue has been marked needs-author-action and may be missing some important information.

afilatov-st · 2022-07-12T23:32:40Z

--type Heap behaves in the same way

tommcdon · 2022-07-13T21:48:30Z

Our current working theory is that createdump is reading memory from the target process which causes pages to be "swapped" back in or read from the module files into the target process memory. This requires more involved investigation and so moving .net 8.

@afilatov-st Can you provide details on how memory is being measured?

afilatov-st · 2022-07-14T22:07:53Z

@tommcdon I run ps aux and assume RSS column shows me the memory consumed

sakshamsaxena · 2022-07-28T03:47:11Z

I'm facing a highly similar situation. The memory shoots to almost double, and no dump is finally generated. I'm not sure about the exact error since that pod shell is also killed.
@afilatov-st Were you able to figure out a workaround that didn't involve increasing the memory just so that the dump could be collected ?

afilatov-st · 2022-07-29T21:36:29Z

@sakshamsaxena unfortunately not

afilatov-st · 2022-07-29T21:40:09Z

I also found that if you create an app simply consuming managed byte arrays, then you can create dumps of it without this problem.
However, in our application, I suppose we use unmanaged libraries which consume unmanaged buffers and this problem occurs. However, I could not reproduce it with the synthetic tests using unmanaged memory via Marshal.AllocHGlobal.

If the dotnet team can provide some guidance on the problem's root cause, I could try to reproduce it, it would be beneficial for everybody.

mikem8361 · 2022-08-03T17:33:14Z

I've been investigating this and figured out why createdump's memory usage is increasing so much but I don't have any fix yet. I haven't come up with any work around other than creating "full" dumps or any fix especially one that will fit in our 7.0 schedule.

mikem8361 · 2022-08-03T21:55:38Z

I put that comment in the wrong issue. This was supposed to be in issue #72148. The workaround of creating a full dump won't help in the target process memory usage. It may even make it worse.

tommcdon · 2022-08-09T02:14:32Z

However, in our application, I suppose we use unmanaged libraries which consume unmanaged buffers and this problem occurs. However, I could not reproduce it with the synthetic tests using unmanaged memory via Marshal.AllocHGlobal.

Thank you for the information. The "ps aux" command outputs the resident set size of the process, however, it does not count pages that have been swapped out. My hypothesis is that createdump is causing these swapped out pages to be paged back into the process causing RSS to increase. Createdump will read memory pages in the target process and writes them to a dump file. In order to write a dump these pages must be read from the process and so if they are swapped out by the OS, it is reasonable to assume that the working will increase while they are being read. I suggest using getrusage to output various statistics to determine if the memory usage is actually increasing or is being swapped back into memory when createdump runs. It would be useful to track the "Maximum" resident set size. Assuming that createdump is merely swapping the pages back into memory, I'm guessing that the max RSS metric should not increase. To fully understand what pages are getting pulled back in, we would need to track OS page faults. Since this issue does not appear to be a dotnet issue at this time, I'm moving this issue to the Future milestone.

ezsilmar · 2022-12-22T09:41:06Z

@afilatov-st Regarding a backport: I already did one for .NET 5 and will do for .NET 6 today. Note it's only Linux binaries that were built with CentOS 7 docker image following this instruction. Feel free to cherry-pick it and compile yourself if you need something else :)

To check that fallback happened you should run dotnet-dump collect -p <pid> --diag. It will print messages into the output of the application, then search for FAILED

ezsilmar · 2022-12-22T16:37:47Z

Backport for .NET 6 based on 6.0.12:

afilatov-st · 2022-12-22T18:49:52Z

@ezsilmar thank you so much!

hoyosjs · 2022-12-26T23:58:55Z

Opening this for backport tracking

hoyosjs · 2023-01-12T23:26:47Z

We are pausing the port for a bit - some pages are not getting properly reported in dumps and will take a bit to get fixed.

FischlerA · 2023-06-14T18:08:30Z

Are there any news or plans to move forward with the fix?

tommcdon · 2023-06-26T22:47:15Z

@FischlerA thanks for checking in on this issue. We plan on continuing the investigation but given our current backlog of issues this will likely move to .NET 9. Is this issue blocking for your scenario?

FischlerA · 2023-07-03T06:39:27Z

@FischlerA thanks for checking in on this issue. We plan on continuing the investigation but given our current backlog of issues this will likely move to .NET 9. Is this issue blocking for your scenario?

Not any more, we were able to increase the max memory to more than double the initial setting and were able to get a dump.

ezsilmar · 2023-07-05T10:52:23Z

Hi @tommcdon could you please confirm if you were talking about the backport being paused until .NET 9 or about the fix not being available in .NET 8? I thought the PR #79853 was merged so I hoped it would be a part of .NET 8 release this November.

Also if you face any particular issue with the backport or the fix please let me know the details I may be able to look into it.

tommcdon · 2023-07-05T13:34:37Z

@ezsilmar the dumps were incomplete leading to command failures in SOS. The code change is reading the kernel pagemap to determine which pages to write to the dump, but there seems to be some discrepancy between the documented kernel behavior and what we are observing. For example, we have found that some of the pages used by the GC seem to be marked as though it were not in use, however, are indeed needed in the dump. While we didn't revert the change in .NET 8, we didn't back port it to .NET 6/7 due to these reasons. @hoyosjs can provide further details.

hoyosjs · 2023-07-06T02:22:36Z

@ezsilmar the main issue is there are some zero pages that don't get reported by pagemap - essentially they get reserved by the GC, but they get lazily initialized. The gaps in the dump make heap verification algorithms fail since elements of arrays for example will find memory missing that should be 0's.

ezsilmar · 2023-07-06T10:44:15Z

@hoyosjs thanks for the explanation! If I get it right there's no issue form the OS or createdump perspective: the GC reserved some memory but didn't commit or write to it yet, so it doesn't appear in the pagemap. Then in the dump the heap verification algorithm expects these pages to be available and zeroed out but can't find them and crashes.

If it's only a heap verification problem (is it a part of dotnet-dump?) I wonder if we may fix it there directly. I.e. treat the unavailable pages as zeroed out.

Or if we can somehow detect these pages in createdump and include them in the dump. Not sure it's possible to check if these pages are reserved and zeroed out without actually reading and committing them.

ezgambac · 2023-10-31T20:43:08Z

@tommcdon @hoyosjs Is there an eta for having heap analyzers fixed?
This doubling memory issue makes dotnet dump unusable in the required scenarios, like debugging why is there high memory, as k8s will kill the pod.

hoyosjs · 2023-10-31T22:06:57Z

@ezgambac Even in the case this got improved (the change wasn't backed out; it's just flagged off since it will make commands like verifyheap in SOS fail), it will still force memory swapping and some growth since the dumper itself runs in the cgroup of the container. For OOM, there's other options that could work since they are started in the init process's context if you have access to the host.

ezgambac · 2023-10-31T22:17:16Z

@hoyosjs What do you suggest doing then for the scenario where the pod is running at 80% memory then?
We currently have dotnet monitor 6, which uses an older version of dotnet dump, but from what you are saying, even if we moved to latest, the dumper would generate enough extra memory that k8s will kill the process?
From following this thread, it seemed like the change @ezsilmar had significantly reduced the memory consumption by dotnet dump while getting a dump. Would this dump be analyzable by perfview/visual studio?

hoyosjs · 2023-10-31T22:32:56Z

It's analyzable, but you might get tooling telling you that the heap is inconsistent. You need to deploy the app with DOTNET_DbgDisablePagemapUse=0 and this is only there in .NET 8. Do you have access to the host (node)?

dotnet-issue-labeler bot added the area-Diagnostics-coreclr label Jun 30, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jun 30, 2022

afilatov-st changed the title ~~dotnet-dump makes process to double its used memory and fail~~ dotnet-dump makes process to double its used memory and fails Jun 30, 2022

agocke added this to Runtime Infra Jul 1, 2022

agocke removed this from Runtime Infra Jul 1, 2022

leculver mentioned this issue Jul 13, 2022

CreateDump may use too much memory #72148

Closed

mikem8361 self-assigned this Jul 6, 2022

tommcdon removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2022

tommcdon added this to the 7.0.0 milestone Jul 7, 2022

mikem8361 added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 12, 2022

ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 12, 2022

tommcdon modified the milestones: 7.0.0, 8.0.0 Jul 13, 2022

tommcdon modified the milestones: 8.0.0, Future Aug 9, 2022

hoyosjs closed this as completed in #79853 Dec 22, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 22, 2022

hoyosjs reopened this Dec 26, 2022

This was referenced Dec 27, 2022

[release/7.0] createdump: only dump committed memory #79983

Closed

[release/6.0] CreateDump: Only add pages with committed memory #80005

Closed

tommcdon assigned hoyosjs and unassigned mikem8361 Feb 28, 2023

tommcdon modified the milestones: 8.0.0, 9.0.0 Jul 17, 2023

This was referenced Aug 9, 2023

How to debug memory leak that is not clear on dotnet-dump? dotnet/diagnostics#4139

Closed

.NET Blazor App high memory usage under Linux #90163

Closed

tommcdon modified the milestones: 9.0.0, 10.0.0 Jul 23, 2024

tommcdon unassigned hoyosjs Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dotnet-dump makes process to double its used memory and fails #71472

dotnet-dump makes process to double its used memory and fails #71472

afilatov-st commented Jun 30, 2022

tommcdon commented Jun 30, 2022

afilatov-st commented Jun 30, 2022 •

edited

Loading

afilatov-st commented Jun 30, 2022

mikem8361 commented Jul 12, 2022

ghost commented Jul 12, 2022

afilatov-st commented Jul 12, 2022

tommcdon commented Jul 13, 2022

afilatov-st commented Jul 14, 2022

sakshamsaxena commented Jul 28, 2022

afilatov-st commented Jul 29, 2022

afilatov-st commented Jul 29, 2022

mikem8361 commented Aug 3, 2022

mikem8361 commented Aug 3, 2022

tommcdon commented Aug 9, 2022

ezsilmar commented Dec 22, 2022

ezsilmar commented Dec 22, 2022

afilatov-st commented Dec 22, 2022

hoyosjs commented Dec 26, 2022

hoyosjs commented Jan 12, 2023

FischlerA commented Jun 14, 2023

tommcdon commented Jun 26, 2023

FischlerA commented Jul 3, 2023

ezsilmar commented Jul 5, 2023

tommcdon commented Jul 5, 2023

hoyosjs commented Jul 6, 2023

ezsilmar commented Jul 6, 2023

ezgambac commented Oct 31, 2023

hoyosjs commented Oct 31, 2023

ezgambac commented Oct 31, 2023

hoyosjs commented Oct 31, 2023

dotnet-dump makes process to double its used memory and fails #71472

dotnet-dump makes process to double its used memory and fails #71472

Comments

afilatov-st commented Jun 30, 2022

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

tommcdon commented Jun 30, 2022

afilatov-st commented Jun 30, 2022 • edited Loading

afilatov-st commented Jun 30, 2022

mikem8361 commented Jul 12, 2022

ghost commented Jul 12, 2022

afilatov-st commented Jul 12, 2022

tommcdon commented Jul 13, 2022

afilatov-st commented Jul 14, 2022

sakshamsaxena commented Jul 28, 2022

afilatov-st commented Jul 29, 2022

afilatov-st commented Jul 29, 2022

mikem8361 commented Aug 3, 2022

mikem8361 commented Aug 3, 2022

tommcdon commented Aug 9, 2022

ezsilmar commented Dec 22, 2022

ezsilmar commented Dec 22, 2022

afilatov-st commented Dec 22, 2022

hoyosjs commented Dec 26, 2022

hoyosjs commented Jan 12, 2023

FischlerA commented Jun 14, 2023

tommcdon commented Jun 26, 2023

FischlerA commented Jul 3, 2023

ezsilmar commented Jul 5, 2023

tommcdon commented Jul 5, 2023

hoyosjs commented Jul 6, 2023

ezsilmar commented Jul 6, 2023

ezgambac commented Oct 31, 2023

hoyosjs commented Oct 31, 2023

ezgambac commented Oct 31, 2023

hoyosjs commented Oct 31, 2023

afilatov-st commented Jun 30, 2022 •

edited

Loading