-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dotnet-dump makes process to double its used memory and fails #71472
Comments
@afilatov-st Thanks for the bug report! I do not believe that memory doubling is expected in this scenario (though some memory usage is expected). Dotnet-Dump sends an IPC command using a domain socket on Linux to the target process to collect a dump. The target process will then launch createdump as a child process to collect a dump of the parent process. When the memory is doubled - is it the target process's memory that increases, createdump, or dotnet-dump itself that uses the extra memory? |
@tommcdon thanks for the prompt response!
it shows that the RSS of the target dotnet process is in its initial value of 3.6 Gb for around 5 seconds, then it quickly grows to 6.2 Gb before Kubernetes kills it. |
For the context, the Docker image is based on |
Can you try a heap dump by add We think maybe when createdump is reading memory to write pages to the dump file that causes them to be "swapped" back in or read from the module files into the target process memory. A heap dump doesn't touch/read most of the module pages. |
This issue has been marked |
|
Our current working theory is that createdump is reading memory from the target process which causes pages to be "swapped" back in or read from the module files into the target process memory. This requires more involved investigation and so moving .net 8. @afilatov-st Can you provide details on how memory is being measured? |
@tommcdon I run |
I'm facing a highly similar situation. The memory shoots to almost double, and no dump is finally generated. I'm not sure about the exact error since that pod shell is also killed. |
@sakshamsaxena unfortunately not |
I also found that if you create an app simply consuming managed byte arrays, then you can create dumps of it without this problem. If the dotnet team can provide some guidance on the problem's root cause, I could try to reproduce it, it would be beneficial for everybody. |
I've been investigating this and figured out why createdump's memory usage is increasing so much but I don't have any fix yet. I haven't come up with any work around other than creating "full" dumps or any fix especially one that will fit in our 7.0 schedule. |
I put that comment in the wrong issue. This was supposed to be in issue #72148. The workaround of creating a full dump won't help in the target process memory usage. It may even make it worse. |
Thank you for the information. The "ps aux" command outputs the resident set size of the process, however, it does not count pages that have been swapped out. My hypothesis is that createdump is causing these swapped out pages to be paged back into the process causing RSS to increase. Createdump will read memory pages in the target process and writes them to a dump file. In order to write a dump these pages must be read from the process and so if they are swapped out by the OS, it is reasonable to assume that the working will increase while they are being read. I suggest using getrusage to output various statistics to determine if the memory usage is actually increasing or is being swapped back into memory when createdump runs. It would be useful to track the "Maximum" resident set size. Assuming that createdump is merely swapping the pages back into memory, I'm guessing that the max RSS metric should not increase. To fully understand what pages are getting pulled back in, we would need to track OS page faults. Since this issue does not appear to be a dotnet issue at this time, I'm moving this issue to the Future milestone. |
@afilatov-st Regarding a backport: I already did one for .NET 5 and will do for .NET 6 today. Note it's only Linux binaries that were built with CentOS 7 docker image following this instruction. Feel free to cherry-pick it and compile yourself if you need something else :) To check that fallback happened you should run |
Backport for .NET 6 based on 6.0.12: |
@ezsilmar thank you so much! |
Opening this for backport tracking |
We are pausing the port for a bit - some pages are not getting properly reported in dumps and will take a bit to get fixed. |
Are there any news or plans to move forward with the fix? |
@FischlerA thanks for checking in on this issue. We plan on continuing the investigation but given our current backlog of issues this will likely move to .NET 9. Is this issue blocking for your scenario? |
Not any more, we were able to increase the max memory to more than double the initial setting and were able to get a dump. |
Hi @tommcdon could you please confirm if you were talking about the backport being paused until .NET 9 or about the fix not being available in .NET 8? I thought the PR #79853 was merged so I hoped it would be a part of .NET 8 release this November. Also if you face any particular issue with the backport or the fix please let me know the details I may be able to look into it. |
@ezsilmar the dumps were incomplete leading to command failures in SOS. The code change is reading the kernel pagemap to determine which pages to write to the dump, but there seems to be some discrepancy between the documented kernel behavior and what we are observing. For example, we have found that some of the pages used by the GC seem to be marked as though it were not in use, however, are indeed needed in the dump. While we didn't revert the change in .NET 8, we didn't back port it to .NET 6/7 due to these reasons. @hoyosjs can provide further details. |
@ezsilmar the main issue is there are some zero pages that don't get reported by pagemap - essentially they get reserved by the GC, but they get lazily initialized. The gaps in the dump make heap verification algorithms fail since elements of arrays for example will find memory missing that should be 0's. |
@hoyosjs thanks for the explanation! If I get it right there's no issue form the OS or createdump perspective: the GC reserved some memory but didn't commit or write to it yet, so it doesn't appear in the pagemap. Then in the dump the heap verification algorithm expects these pages to be available and zeroed out but can't find them and crashes. If it's only a heap verification problem (is it a part of dotnet-dump?) I wonder if we may fix it there directly. I.e. treat the unavailable pages as zeroed out. Or if we can somehow detect these pages in createdump and include them in the dump. Not sure it's possible to check if these pages are reserved and zeroed out without actually reading and committing them. |
@ezgambac Even in the case this got improved (the change wasn't backed out; it's just flagged off since it will make commands like |
@hoyosjs What do you suggest doing then for the scenario where the pod is running at 80% memory then? |
It's analyzable, but you might get tooling telling you that the heap is inconsistent. You need to deploy the app with |
Description
In a Kubernetes environment, we have a process that normally consumes around 3.8 Gi.
When we run
dotnet-dump collect
, it causes the process to increase memory usage up to around 7.2 Gi.Since we have a 6 Gi memory limit for the Pod,
dotnet-dump
cannot finish dump generation and fails with aSystem.IO.EndOfStreamException: Unable to read beyond the end of the stream
exception.If we set a higher memory limit,
dotnet-dump collect
succeeded, approximately doubling the used memory.Is this expected behavior? Is it possible to make it just save the dump to the file without consuming more memory?
Reproduction Steps
Run
dotnet-dump collect --process-id 1
Expected behavior
A dump file is created
Actual behavior
Dump file generation failed and the process may be crashed
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: