Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Diagnostic impact of Pinned Object Heap #1408

Closed
15 of 18 tasks
noahfalk opened this issue Apr 28, 2020 · 12 comments
Closed
15 of 18 tasks

Address Diagnostic impact of Pinned Object Heap #1408

noahfalk opened this issue Apr 28, 2020 · 12 comments
Assignees
Milestone

Comments

@noahfalk
Copy link
Member

noahfalk commented Apr 28, 2020

Recently the GC team added a new Pinned Object Heap. This new heap adds a conceptual 5th generation to the existing four: gen0, gen1, gen2, LOH (large object heap), and now POH (pinned object heap). The fact there were exactly four generations has been broadly exposed to diagnostic tools so changing to five has far reaching ramifications. At a minimum we don't want changes to the runtime to break critical tools which requires testing. More ideally we also want any necessary API changes, implementation changes, documentation changes, and coordination with the owners of external libraries/tools so that the .NET diagnostic tools ecosystem remains effective at memory investigations.

Work items:

  • .Net runtime team owned code - investigation + design/coding/testing/docs as needed
  • Other tools - this is primarily notification and guidance they may need for testing
    • Visual Studio Debugger
    • Visual Studio Profiler
    • Application Insights
    • bperf
    • mex/sosex/psscor?
    • WPA?
    • 3rd party profilers (update our announcement issue)
    • windbg/!analyze (probably no relevance but better safe than sorry)
@ghost
Copy link

ghost commented Apr 28, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@tommcdon tommcdon self-assigned this Apr 28, 2020
@leculver
Copy link
Contributor

leculver commented May 5, 2020

DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)

@tommcdon : I would like to strongly emphasize we need to fix this in particular before 5.0 RTM. As long as the dac produces correct answers out of its dac/sos interface we can light up a lot of "diagnostic tools of last resort" (sos/clrmd/mex) to solve any issues in servicing that arise from the original feature. Concrete support in SOS and other tools can wait as long as we are sure the dac implementation is solid.

I'd be happy to help test this part of the change via clrmd when work is getting close to finished in the dac side of things.

PerfView - confirm with @VSadov, @Maoni0, and @brianrob that we are good
mex/sosex/psscor?

mex and PerfView uses ClrMD for diagnostics. If ClrMD lights up this scenario PerfView will light up too. ClrMD is blocked on this item:

DAC implementation/APIs (#13735, dotnet/runtime#13736 and dotnet/runtime#35258)

psscor likely is blocked on that too, not sure of the status of sosex these days.

@Maoni0
Copy link
Member

Maoni0 commented May 5, 2020

@leculver just making sure I understand - you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct? it's important to distinguish those 2 scenarios - it would be bad if a diagnostics tool actually crashes if there are 5 generations instead of 4; but it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.

@leculver
Copy link
Contributor

leculver commented May 5, 2020

you are saying that in order to "light up" POH in ClrMD you need those items done, but today ClrMD is not gonna crash if it's run against 5.0 builds with POH in them, correct?

I haven't checked the behavior of 1.1, but at minimum I can tell you that ClrMD 2.0 would fail here during ClrRuntime creation: https://github.com/microsoft/clrmd/blob/a89a02c02c8ffa7da542753e844a67478e2b4def/src/Microsoft.Diagnostics.Runtime/src/Builders/HeapBuilder.cs#L44-L45. That's a simple fix, but I'm not sure if we would fall over elsewhere when we encounter valid objects, which don't lie on the gc heap as far as ClrMD knows. I'd also worry that !verifyheap (and tools similar to it) will fail/report errors when the heap is valid and intact.

My overall takeaway here is "we have no idea how diagnostics tools will react to this change, we should probably test them to make sure there's nothing that's going to fail/throw exceptions".

it's not bad if it simply doesn't recognize the existence of the 5th gen therefore cannot give you diagnostics info on it (but you still get info on the previous 4 gens) 'cause that support can be added incrementally.

@Maoni0: No, my position is the opposite of that. We must fix the dac piece of this before shipping 5.0. Specifically I'm referring to adding support through new ISOSDacInterfaces. I'm fine if some tools (perfview, sos, etc) are incrementally updated, but fixing the dac to enumerate correct values is critical before we ship.

Here's a few reasons why:

  1. Once we ship a build with a change we actually can't incrementally add support for anything that requires dac changes. Shiproom has little patience even for diagnostic only changes. If we don't add dac support for this feature before shipping we will never be able to debug builds in the wild with this generation change.
  2. We need to take time to test functionality to make sure we didn't break existing tools even if we did no diagnostics work. ClrMD 2.0 is one example, !verifyheap is another very important thing that needs testing.
  3. Making diagnostics worse than a set baseline of functionality makes other's devs lives harder. We've had "simple" diagnostics issues cause pri-0 "drop everything and work through the weekend" escalations to CLR. As a rule, we should never ship a build of .Net Core with regressed diagnostic functionality unless we have a really good reason to.

As an example of 3, a few weeks back I got pulled into an Azure IOT escalation because WinDbg, PerfView, and Visual Studio all couldn't root cause answer a simple "managed OOM" crash. None of those tools were working due to 3 unrelated bugs, one in each tool. This was an issue where "!dumpheap -stat" would have solved it, but instead it was a pri-0 escalation to CLR. I got called in on a Friday afternoon to root cause their issue, and thankfully ClrMD processed the dump correctly.
If we didn't have at least one diagnostics tool working in their scenario, I'd have had to work through the weekend.

If their problem happened to be large chunks of memory on the pinned object heap we'd be back in that situation of heroics to solve basic problems. Even absent a pri-0 escalation, bad tools means more developers (internal and external) have to ask CLR for help debugging when before they would otherwise be self sufficient.

To be clear, I am fine with incrementally adding support to tools as long as the dac piece works and is well tested. SOS, perfview, clrmd, etc can be fixed later as long as we are 100% confident we can implement support for this feature after we ship without needing to service mscordaccore.

@leculver
Copy link
Contributor

leculver commented May 5, 2020

I've marked this issue blocking-release for all of the reasons listed above. I specifically mean the dac portion of this work, and not the entire enumerated list in the top. If sufficiently informed folks disagree, please feel free to remove that tag. (cc @jkotas)

@Maoni0
Copy link
Member

Maoni0 commented May 5, 2020

@leculver I agree we should fix DAC since that's a change in the runtime. my comment about "incremental changes" was also for tooling (which includes ClrMd in this case...I just meant everything that's not in the runtime since we can't ship runtime OOB).

@tommcdon
Copy link
Member

@davmason is there any work left for this issue?

@tommcdon
Copy link
Member

tommcdon commented Aug 5, 2020

Confirmed with David that all known runtime issues are resolved, moving to the diagnostics repo to track documentation and out of band tooling changes

@tommcdon tommcdon transferred this issue from dotnet/runtime Aug 5, 2020
@Maoni0
Copy link
Member

Maoni0 commented Aug 5, 2020

thank you very much, @davmason!

@tommcdon
Copy link
Member

@davmason @brianrob
@cshung has reported that POH sizes in perfview are not being reported correctly. This may be related to microsoft/perfview/issues/1146

@cshung
Copy link
Member

cshung commented Oct 28, 2020

See microsoft/perfview#1295 that fixes TraceEvent and PerfView with respect to the heap sizes reporting.

@tommcdon
Copy link
Member

@davmason can this issue be closed or is there outstanding work?

@tommcdon tommcdon removed their assignment Dec 18, 2020
@tommcdon tommcdon added this to the 6.0 milestone Dec 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jun 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants