Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ICorProfiler] Add new API to enumerate GC Heap objects #103735

Merged
merged 35 commits into from
Jul 11, 2024

Conversation

mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Jun 19, 2024

This PR aims to add a new ICorProfilerInfo API that extends the ability to enumerate all GC Heap objects (GC's GCHeap::DiagWalkHeap) to a profiler. Within a GC there are preparation steps that are performed before+after walking the GC Heap via DiagWalkHeap, so this PR introduces a new IGCHeap API that facilitates a DiagWalkHeap outside of a GC.

Additionally, this PR modifies ProfToEEInterfaceImpl::SuspendRuntime + ProfToEEInterfaceImpl::ResumeRuntime to both be Asynchronous APIs, invokable directly through the profiler rather than only through callbacks.


Testing

Verification with !dumpheap

Invoking the new EnumerateGCHeapObjects with a walk function like

static BOOL STDMETHODCALLTYPE heap_walk_fn(ObjectID object, void* callbackState)
{
    CallbackState* state = static_cast<CallbackState*>(callbackState);
    state->objectCount++;
    ClassID classId{0};
    HRESULT hr = CorProfiler::GetInstance()->pCorProfilerInfo15->GetClassFromObject(object, &classId);
    String typeName = CorProfiler::GetInstance()->GetClassIDName(classId);
    printf("(%ls)\n", typeName.ToWString().c_str());
    return TRUE;
}

observed 9999 objects from a sample app.
Attaching windbg to that same sample app and calling !dumpheap at the same spot where EnumerateGCHeapObjects was invoked led to 10082 objects, 83 of which were free.

Furthermore, the Class types counts from both the heap walk function and !dumpheap matched up.

Profiler tests

Added unit tests for the following behaviors:

  • Custom object type shows up in the GC Heap walk
  • Profiler can invoke EnumerateGCHeapObjects within the scope of a profiler requested runtime suspension with SuspendRuntime/ResumeRuntime
  • Multiple threads racing to suspend EE will have the slower thread waiting until the faster thread resumes the runtime

mdh1418 added 4 commits June 13, 2024 15:59
After making changes to corprof.idl and building the clr subset
the generated corprof.h needs to be manually copied over from
the artifacts obj directory into the prebuilt directory.
Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

src/coreclr/inc/corprof.idl Outdated Show resolved Hide resolved

// Walk the GC Heap
unsigned max_generation = hp->GetMaxGeneration();
hp->DiagWalkHeap((walk_fn)callback, callbackState, max_generation, TRUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiktork does the full stop the world gen2 GC meet your performance criteria? I thought you were looking for something lighter weight, although I haven't been super involved in the design process

Copy link
Member

@wiktork wiktork Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this compares to dump collection, which is basically what we are trying to be faster than. My previous discussions with Noah suggested that this would be significantly faster. I was also under the impression that we would be able to heap walk after a suspend, without the need to GC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this API we solely suspend EE + fix allocation contexts + walk the gc heap + repair allocation contexts + resume EE, all without performing a GC.

mdh1418 added 3 commits June 21, 2024 00:24
Add parameter check
Add SuspendEE akin to SuspendRuntime
Add DiagWalkHeap
Add RestartEE akin to ResumeRuntime
@mdh1418 mdh1418 force-pushed the async_heap_enumeration_icorprof_shim branch from d849af2 to c40ecbc Compare June 21, 2024 04:29
@mdh1418 mdh1418 marked this pull request as ready for review June 21, 2024 15:42
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far, I think the main thing remaining is to add a test case for the new API.

src/coreclr/pal/prebuilt/inc/corprof.h Show resolved Hide resolved
src/coreclr/vm/proftoeeinterfaceimpl.cpp Show resolved Hide resolved
@mdh1418 mdh1418 force-pushed the async_heap_enumeration_icorprof_shim branch from 5cc754a to 69ddad7 Compare July 8, 2024 18:49
@mdh1418
Copy link
Member Author

mdh1418 commented Jul 9, 2024

Still having a bit of trouble reproducing the CI failures, tried to follow the build steps through the different legs via
./build.cmd -arch x64 -os windows -s clr+libs -c Debug -rc Checked
./src/tests/build.cmd skipmanaged skipgeneratelayout Checked x64
./src/tests/build.cmd copynativeonly log Native checked x64 /p:LibrariesConfiguration=Debug
./src/tests/build.cmd buildtestwrappersonly log Wrappers checked x64 /p:LibrariesConfiguration=Debug
./src/tests/build.cmd generatelayoutonly log Layout checked x64 /p:LibrariesConfiguration=Debug
Maybe I'm missing an intermediate step, but I attempt to run the dotnet command shown in the console log but I get

BEGIN: coreclr_initialize failed - Error: 0x80070057
END: coreclr_initialize failed - Error: 0x80070057

and I think when I attempted to run the test command script directly .\artifacts\tests\coreclr\windows.x64.Checked\profiler\gcheapenumeration\gcheapenumeration\gcheapenumeration.cmd there wasn't a gcheapenumeration.dll until I built the test manually with .\src\tests\build.cmd test profiler/gcheapenumeration/gcheapenumeration.csproj /p:Configuration=Checked /p:LibrariesConfiguration=Debug which seems to deviate from how its built on CI.

And building manually/running the test command, I haven't been able to repro the error on CI, which seems to be unable to find the custom GCHeap object for all checked no_tiered_compilation test legs.

Any other ideas to repro @noahfalk ? Where might the dumps be for the windows/linux runtime tests legs?
I can also continue to try to push up changes to the tests just to see if our suspected fixes might work.

src/tests/profiler/gcheapenumeration/gcheapenumeration.cs Outdated Show resolved Hide resolved
Thread enumerateThread = new Thread(() =>
{
startEvent.WaitOne();
Thread.Sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is testing what you intend it to. The GC.Collect() keeps managed code suspended which means this sleep isn't waking up until the GC.Collect() is finished. By the time EnumerateGCHeapObjects() runs there is no concurrent suspension happening.

To test concurrent suspends you will need to coordinate it from native code. For example you might do this:

GCHeapEnumerationProfiler::GarbageCollectionStarted()
{
    _gcStartSleeping = TRUE;
    // if this new call to EnumerateGCHeapObjects doesn't wait for the GC then it is going to
    // observe _gcStartSleeping == TRUE
    thread_list.emplace_back(std::thread([instance]()
                {
                    instance->ValidateEnumerateGCHeapObjects(S_OK);
                }));
   std::this_thread::sleep_for(std::chrono::seconds(1));
   _gcStartSleeping = FALSE;
}

static BOOL STDMETHODCALLTYPE heap_walk_fn(ObjectID object, void* callbackState)
{
    CallbackState* pState = (CallbackState*)callbackState;
    if(pState->instance._gcStartSleeping)
    {
        //Error, no callbacks expected while the GC collection was going on
    }
    ...
}

HRESULT GCHeapEnumerationProfiler::Shutdown()
{
    // make sure async operations have finished
    for (auto& t : threadList) {
        t.join();
    }
}

Copy link
Member Author

@mdh1418 mdh1418 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, this would be testing invoking ProfToEEInterfaceImpl::EnumerateGCHeapObjects during a non-profiler requested runtime suspension? As in we are expecting it to hit ProfToEEInterfaceImpl::EnumerateGCHeapObjects's else if in

    if (!ThreadSuspend::SysIsSuspendInProgress() && (ThreadSuspend::GetSuspensionThread() == 0))
    {
        g_profControlBlock.fProfilerRequestedRuntimeSuspend = TRUE;
        ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON::SUSPEND_FOR_PROFILER);
        ownEESuspension = TRUE;
    }
    else if (!g_profControlBlock.fProfilerRequestedRuntimeSuspend)
    {
        return CORPROF_E_SUSPENSION_IN_PROGRESS;
    }
    else
    {
        // The profiler requested a runtime suspension before invoking this API,
        // so the responsibility of resuming the runtime is outside the scope of this API.
        // Given that the profiler owns the suspension, walking the GC Heap is safe.
    }

and throw the error CORPROF_E_SUSPENSION_IN_PROGRESS

Originally, I was intending the 3rd test
to be the "background native thread invoking EnumerateGCHeapObjects and rejoining a thread that has an ongoing non-profiler requested runtime suspension" test, and for this (4th test) to be multiple managed threads racing for ThreadSuspend::SuspendEE, in case the above if block

    if (!ThreadSuspend::SysIsSuspendInProgress() && (ThreadSuspend::GetSuspensionThread() == 0))
    {
        g_profControlBlock.fProfilerRequestedRuntimeSuspend = TRUE;
        ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON::SUSPEND_FOR_PROFILER);
        ownEESuspension = TRUE;
    }

is entered before a different native thread invokes ThreadSuspend::SuspendEE (which would have toggled the condition to be false)

Copy link
Member

@noahfalk noahfalk Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, this would be testing invoking ProfToEEInterfaceImpl::EnumerateGCHeapObjects during a non-profiler requested runtime suspension?

There are two different variations of this:

  1. Thread A suspends in the runtime code, Thread B calls EnumerateGCHeapObjects, Thread A resumes
  2. Thread A suspends in the runtime code, Thread A calls EnumerateGCHeapObjects, Thread A resumes.

The first case should work but thread B will block until Thread A resumes the runtime. The 2nd case would usually return CORPROF_E_UNSUPPORTED_CALL_SEQUENCE, but there are corner cases where it might make it past that first check and return CORPROF_E_SUSPENSION_IN_PROGRESS instead.

As in we are expecting it to hit ProfToEEInterfaceImpl::EnumerateGCHeapObjects's else if in

I proposed some alternate logic for how you would check for suspensions. The test case I suggested in my comment above does the EnumerateGCHeapObjects on a different thread than the one which suspended the runtime so I'd expect it to enumerate successfully after waiting for the GC to complete. It should run through this part of the code I recommended above:

else
{
    _ASSERTE(!suspendedByThisThread);
    // Its possible some background threads are suspending and resuming the runtime
    // concurrently. We need to suspend the runtime on this thread to be certain the heap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed the above recommendations for checking thread suspension within ProfToEEInterfaceImpl::EnumerateGCHeapObjects, after implementing a test as you suggested in the first comment in this thread, I attached a debugger and the background thread that invokes EnumerateGCHeapObjects does indeed get blocked on ThreadSuspend::SuspendEE until GC completes

00 0000001b`0657f218 00007ffb`06d640e3     ntdll!ZwWaitForAlertByThreadId+0x14
01 0000001b`0657f220 00007ffb`06d51ce4     ntdll!TpWorkOnBehalfClearTicket+0x2c3
02 0000001b`0657f330 00007ffb`06d51ad2     ntdll!RtlEnterCriticalSection+0x254
03 0000001b`0657f3b0 00007ffa`5b26652e     ntdll!RtlEnterCriticalSection+0x42
04 0000001b`0657f3e0 00007ffa`5b51372b     coreclr!CrstBase::Enter+0x27e [C:\Users\mihw\source\repos\mdh1418\runtime\src\coreclr\vm\crst.cpp @ 325] 
05 0000001b`0657f440 00007ffa`5b8abd8b     coreclr!ThreadStore::Enter+0x1bb [C:\Users\mihw\source\repos\mdh1418\runtime\src\coreclr\vm\threads.cpp @ 5092] 
06 0000001b`0657f560 00007ffa`5b8b18ad     coreclr!ThreadSuspend::LockThreadStore+0x25b [C:\Users\mihw\source\repos\mdh1418\runtime\src\coreclr\vm\threadsuspend.cpp @ 1882] 
07 0000001b`0657f690 00007ffa`5b808f13     coreclr!ThreadSuspend::SuspendEE+0x12d [C:\Users\mihw\source\repos\mdh1418\runtime\src\coreclr\vm\threadsuspend.cpp @ 5559] 
08 0000001b`0657f770 00007ffa`672cd542     coreclr!ProfToEEInterfaceImpl::EnumerateGCHeapObjects+0x473 [C:\Users\mihw\source\repos\mdh1418\runtime\src\coreclr\vm\proftoeeinterfaceimpl.cpp @ 7707] 
09 0000001b`0657f8f0 00007ffa`672cdda8     Profiler!GCHeapEnumerationProfiler::EnumerateGCHeapObjects+0xa2 [C:\Users\mihw\source\repos\mdh1418\runtime\src\tests\profiler\native\gcheapenumerationprofiler\gcheapenumerationprofiler.cpp @ 186] 
0a 0000001b`0657f960 00007ffa`672cd335     Profiler!GCHeapEnumerationProfiler::ValidateEnumerateGCHeapObjects+0x18 [C:\Users\mihw\source\repos\mdh1418\runtime\src\tests\profiler\native\gcheapenumerationprofiler\gcheapenumerationprofiler.cpp @ 190] 
0b 0000001b`0657f9a0 00007ffa`672ccab4     Profiler!<lambda_e2419102f21665b3332b03f113586c25>::operator()+0x25 [C:\Users\mihw\source\repos\mdh1418\runtime\src\tests\profiler\native\gcheapenumerationprofiler\gcheapenumerationprofiler.cpp @ 74] 
0c 0000001b`0657f9d0 00007ffa`672cbe19     Profiler!std::invoke<<lambda_e2419102f21665b3332b03f113586c25> >+0x14 [C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\type_traits @ 1705] 
0d 0000001b`0657fa00 00007ffa`673b63a0     Profiler!std::thread::_Invoke<std::tuple<<lambda_e2419102f21665b3332b03f113586c25> >,0>+0x59 [C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\thread @ 61] 
0e 0000001b`0657fa60 00007ffb`055f257d     Profiler!thread_start<unsigned int (__cdecl*)(void *),1>+0xb0 [minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 97] 
0f 0000001b`0657fac0 00007ffb`06d8af28     KERNEL32!BaseThreadInitThunk+0x1d
10 0000001b`0657faf0 00000000`00000000     ntdll!RtlUserThreadStart+0x28

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test case to simulate scenario 1, where the thread suspending runtime through GC spins off another native thread to invoke EnumerateGCHeapObjects.

Will add the test case to simulate scenario 2 if I can find the test attribute that bypasses contract violation assert and hits the expected errors.

SHUTDOWNGUARD();
printf("GCHeapEnumerationProfiler::GarbageCollectionStarted\nSleeping for 10 seconds\n");
std::this_thread::sleep_for(std::chrono::seconds(10));
return S_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets validate that if you try to call EnumerateGCHeapObjects from inside CollectionStart/CollectionFinished that it doesn't work. I believe these callbacks are both no-trigger callbacks so I'd expect the calls to fail with CORPROF_E_UNSUPPORTED_CALL_SEQUENCE if you've added the additional kP2EETriggers check above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling EnumerateGCHeapObjects from inside GarbageCollectionStarted throws a Contract violation assert from the CONTRACTL/CONTRACTL_END scope before hitting the PROFILER_TO_CLR_ENTRYPOINT_ASYNC_EX macro and additional !IsCalledAsynchronously() && !AreCallbackStateFlagsSet(COR_PRF_CALLBACKSTATE_IN_TRIGGERS_SCOPE) check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I expect that test case would only work in a retail build. I thought we had some test attribute that did that but don't worry about including it if it isn't straigtforward.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor suggestions inline, overall looks good! I'll also respond back to some of the existing comments in just a moment.

src/coreclr/inc/corprof.idl Outdated Show resolved Hide resolved
src/coreclr/vm/proftoeeinterfaceimpl.cpp Outdated Show resolved Hide resolved
@mdh1418
Copy link
Member Author

mdh1418 commented Jul 11, 2024

The dotnet-linker-tests failure should be fixed by #104606

@mdh1418 mdh1418 merged commit 39b6512 into dotnet:main Jul 11, 2024
93 of 95 checks passed
@mdh1418 mdh1418 deleted the async_heap_enumeration_icorprof_shim branch July 11, 2024 17:14
@github-actions github-actions bot locked and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants