Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotnet-dump makes process to double its used memory and fails #71472

Open
afilatov-st opened this issue Jun 30, 2022 · 53 comments · Fixed by #79853
Open

dotnet-dump makes process to double its used memory and fails #71472

afilatov-st opened this issue Jun 30, 2022 · 53 comments · Fixed by #79853

Comments

@afilatov-st
Copy link

Description

In a Kubernetes environment, we have a process that normally consumes around 3.8 Gi.
When we run dotnet-dump collect, it causes the process to increase memory usage up to around 7.2 Gi.
Since we have a 6 Gi memory limit for the Pod, dotnet-dump cannot finish dump generation and fails with a System.IO.EndOfStreamException: Unable to read beyond the end of the stream exception.

If we set a higher memory limit, dotnet-dump collect succeeded, approximately doubling the used memory.
Is this expected behavior? Is it possible to make it just save the dump to the file without consuming more memory?

Reproduction Steps

Run dotnet-dump collect --process-id 1

Expected behavior

A dump file is created

Actual behavior

Dump file generation failed and the process may be crashed

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 30, 2022
@afilatov-st afilatov-st changed the title dotnet-dump makes process to double its used memory and fail dotnet-dump makes process to double its used memory and fails Jun 30, 2022
@tommcdon
Copy link
Member

@afilatov-st Thanks for the bug report! I do not believe that memory doubling is expected in this scenario (though some memory usage is expected). Dotnet-Dump sends an IPC command using a domain socket on Linux to the target process to collect a dump. The target process will then launch createdump as a child process to collect a dump of the parent process. When the memory is doubled - is it the target process's memory that increases, createdump, or dotnet-dump itself that uses the extra memory?
@mikem8361 @hoyosjs

@afilatov-st
Copy link
Author

afilatov-st commented Jun 30, 2022

@tommcdon thanks for the prompt response!
the crash happens pretty quickly, so I had to run the following script in a parallel session

while true                      
do 
  ps aux;
  sleep 0.5;
done

it shows that the RSS of the target dotnet process is in its initial value of 3.6 Gb for around 5 seconds, then it quickly grows to 6.2 Gb before Kubernetes kills it.

@afilatov-st
Copy link
Author

For the context, the Docker image is based on mcr.microsoft.com/dotnet/aspnet:6.0.5-bullseye-slim-amd64

@agocke agocke removed this from Runtime Infra Jul 1, 2022
@mikem8361 mikem8361 self-assigned this Jul 6, 2022
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2022
@tommcdon tommcdon added this to the 7.0.0 milestone Jul 7, 2022
@mikem8361
Copy link
Member

Can you try a heap dump by add --type Heap to the dotnet-dump collect command line?

We think maybe when createdump is reading memory to write pages to the dump file that causes them to be "swapped" back in or read from the module files into the target process memory. A heap dump doesn't touch/read most of the module pages.

@mikem8361 mikem8361 added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 12, 2022
@ghost
Copy link

ghost commented Jul 12, 2022

This issue has been marked needs-author-action and may be missing some important information.

@afilatov-st
Copy link
Author

--type Heap behaves in the same way

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 12, 2022
@tommcdon
Copy link
Member

Our current working theory is that createdump is reading memory from the target process which causes pages to be "swapped" back in or read from the module files into the target process memory. This requires more involved investigation and so moving .net 8.

@afilatov-st Can you provide details on how memory is being measured?

@tommcdon tommcdon modified the milestones: 7.0.0, 8.0.0 Jul 13, 2022
@afilatov-st
Copy link
Author

@tommcdon I run ps aux and assume RSS column shows me the memory consumed

@sakshamsaxena
Copy link

I'm facing a highly similar situation. The memory shoots to almost double, and no dump is finally generated. I'm not sure about the exact error since that pod shell is also killed.
@afilatov-st Were you able to figure out a workaround that didn't involve increasing the memory just so that the dump could be collected ?

@afilatov-st
Copy link
Author

@sakshamsaxena unfortunately not

@afilatov-st
Copy link
Author

I also found that if you create an app simply consuming managed byte arrays, then you can create dumps of it without this problem.
However, in our application, I suppose we use unmanaged libraries which consume unmanaged buffers and this problem occurs. However, I could not reproduce it with the synthetic tests using unmanaged memory via Marshal.AllocHGlobal.

If the dotnet team can provide some guidance on the problem's root cause, I could try to reproduce it, it would be beneficial for everybody.

@mikem8361
Copy link
Member

I've been investigating this and figured out why createdump's memory usage is increasing so much but I don't have any fix yet. I haven't come up with any work around other than creating "full" dumps or any fix especially one that will fit in our 7.0 schedule.

@mikem8361
Copy link
Member

I put that comment in the wrong issue. This was supposed to be in issue #72148. The workaround of creating a full dump won't help in the target process memory usage. It may even make it worse.

@tommcdon
Copy link
Member

tommcdon commented Aug 9, 2022

However, in our application, I suppose we use unmanaged libraries which consume unmanaged buffers and this problem occurs. However, I could not reproduce it with the synthetic tests using unmanaged memory via Marshal.AllocHGlobal.

Thank you for the information. The "ps aux" command outputs the resident set size of the process, however, it does not count pages that have been swapped out. My hypothesis is that createdump is causing these swapped out pages to be paged back into the process causing RSS to increase. Createdump will read memory pages in the target process and writes them to a dump file. In order to write a dump these pages must be read from the process and so if they are swapped out by the OS, it is reasonable to assume that the working will increase while they are being read. I suggest using getrusage to output various statistics to determine if the memory usage is actually increasing or is being swapped back into memory when createdump runs. It would be useful to track the "Maximum" resident set size. Assuming that createdump is merely swapping the pages back into memory, I'm guessing that the max RSS metric should not increase. To fully understand what pages are getting pulled back in, we would need to track OS page faults. Since this issue does not appear to be a dotnet issue at this time, I'm moving this issue to the Future milestone.

@tommcdon tommcdon modified the milestones: 8.0.0, Future Aug 9, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 22, 2022
@ezsilmar
Copy link
Contributor

@afilatov-st Regarding a backport: I already did one for .NET 5 and will do for .NET 6 today. Note it's only Linux binaries that were built with CentOS 7 docker image following this instruction. Feel free to cherry-pick it and compile yourself if you need something else :)

To check that fallback happened you should run dotnet-dump collect -p <pid> --diag. It will print messages into the output of the application, then search for FAILED

@ezsilmar
Copy link
Contributor

Backport for .NET 6 based on 6.0.12:

@afilatov-st
Copy link
Author

@ezsilmar thank you so much!

@hoyosjs
Copy link
Member

hoyosjs commented Dec 26, 2022

Opening this for backport tracking

@hoyosjs
Copy link
Member

hoyosjs commented Jan 12, 2023

We are pausing the port for a bit - some pages are not getting properly reported in dumps and will take a bit to get fixed.

@tommcdon tommcdon assigned hoyosjs and unassigned mikem8361 Feb 28, 2023
@FischlerA
Copy link

Are there any news or plans to move forward with the fix?

@tommcdon
Copy link
Member

@FischlerA thanks for checking in on this issue. We plan on continuing the investigation but given our current backlog of issues this will likely move to .NET 9. Is this issue blocking for your scenario?

@FischlerA
Copy link

@FischlerA thanks for checking in on this issue. We plan on continuing the investigation but given our current backlog of issues this will likely move to .NET 9. Is this issue blocking for your scenario?

Not any more, we were able to increase the max memory to more than double the initial setting and were able to get a dump.

@ezsilmar
Copy link
Contributor

ezsilmar commented Jul 5, 2023

Hi @tommcdon could you please confirm if you were talking about the backport being paused until .NET 9 or about the fix not being available in .NET 8? I thought the PR #79853 was merged so I hoped it would be a part of .NET 8 release this November.

Also if you face any particular issue with the backport or the fix please let me know the details I may be able to look into it.

@tommcdon
Copy link
Member

tommcdon commented Jul 5, 2023

@ezsilmar the dumps were incomplete leading to command failures in SOS. The code change is reading the kernel pagemap to determine which pages to write to the dump, but there seems to be some discrepancy between the documented kernel behavior and what we are observing. For example, we have found that some of the pages used by the GC seem to be marked as though it were not in use, however, are indeed needed in the dump. While we didn't revert the change in .NET 8, we didn't back port it to .NET 6/7 due to these reasons. @hoyosjs can provide further details.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 6, 2023

@ezsilmar the main issue is there are some zero pages that don't get reported by pagemap - essentially they get reserved by the GC, but they get lazily initialized. The gaps in the dump make heap verification algorithms fail since elements of arrays for example will find memory missing that should be 0's.

@ezsilmar
Copy link
Contributor

ezsilmar commented Jul 6, 2023

@hoyosjs thanks for the explanation! If I get it right there's no issue form the OS or createdump perspective: the GC reserved some memory but didn't commit or write to it yet, so it doesn't appear in the pagemap. Then in the dump the heap verification algorithm expects these pages to be available and zeroed out but can't find them and crashes.

If it's only a heap verification problem (is it a part of dotnet-dump?) I wonder if we may fix it there directly. I.e. treat the unavailable pages as zeroed out.

Or if we can somehow detect these pages in createdump and include them in the dump. Not sure it's possible to check if these pages are reserved and zeroed out without actually reading and committing them.

@ezgambac
Copy link
Contributor

@tommcdon @hoyosjs Is there an eta for having heap analyzers fixed?
This doubling memory issue makes dotnet dump unusable in the required scenarios, like debugging why is there high memory, as k8s will kill the pod.

@hoyosjs
Copy link
Member

hoyosjs commented Oct 31, 2023

@ezgambac Even in the case this got improved (the change wasn't backed out; it's just flagged off since it will make commands like verifyheap in SOS fail), it will still force memory swapping and some growth since the dumper itself runs in the cgroup of the container. For OOM, there's other options that could work since they are started in the init process's context if you have access to the host.

@ezgambac
Copy link
Contributor

@hoyosjs What do you suggest doing then for the scenario where the pod is running at 80% memory then?
We currently have dotnet monitor 6, which uses an older version of dotnet dump, but from what you are saying, even if we moved to latest, the dumper would generate enough extra memory that k8s will kill the process?
From following this thread, it seemed like the change @ezsilmar had significantly reduced the memory consumption by dotnet dump while getting a dump. Would this dump be analyzable by perfview/visual studio?

@hoyosjs
Copy link
Member

hoyosjs commented Oct 31, 2023

It's analyzable, but you might get tooling telling you that the heap is inconsistent. You need to deploy the app with DOTNET_DbgDisablePagemapUse=0 and this is only there in .NET 8. Do you have access to the host (node)?

@tommcdon tommcdon modified the milestones: 9.0.0, 10.0.0 Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants