Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use canonical method in TransitionFrames whenever we parse signatures #33733

Merged
merged 3 commits into from
Apr 11, 2020

Conversation

fadimounir
Copy link
Contributor

I've been investigating the crash dump from #32848, and the main issue is that the GC is calling into the TypeLoader to construct new types. This should never be allowed, and what made it worse in that crash dump is the fact that server GC was being used, so the call to GetThread() was returning NULL, causing an AV during type construction.

The AV needs the following conditions to repro:

  1. The use of server GC. I also believe it should be possible to use regular GC, and craft a repro test that would hit an assert in the type loader, but may not necessarily AV.
  2. Some luck with inlining heuristics
  3. Virtual calls into generic methods (or static methods on generic types)
  4. The generic method must be using reference types as instantiation arguments (i.e. have a valid canonical method)
  5. For that generic method to be executing its PreStub for the first time.
  6. And finally, for GC to trigger on another thread, and enumerate that method that is currently executing its Prestub.

cc @dotnet/crossgen-contrib

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't add an area label to this PR.

Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future.

@fadimounir fadimounir added area-GC-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.) labels Mar 18, 2020
src/coreclr/src/vm/frames.h Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented Mar 18, 2020

I believe that this is expected to be handled by passing in CLASS_LOAD_APPROXPARENTS. See https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1057 and related code. Do you see CLASS_LOAD_APPROXPARENTS not being passed in when you see the crash?

@fadimounir
Copy link
Contributor Author

Do you see CLASS_LOAD_APPROXPARENTS not being passed in when you see the crash?

CLASS_LOAD_APPROXPARENTS is passed correctly. I'm also able to repro this locally pretty easily with gcstress and server gc (see attached regression test)

@fadimounir
Copy link
Contributor Author

@jkotas to make sure I understand the issue correctly, is this supposed to be handled on the TypeLoader side, to ensure we do not construct new types?

@fadimounir fadimounir changed the title Use canonical method in TransitionFrames whenever we parse signatures [WIP] Use canonical method in TransitionFrames whenever we parse signatures Mar 18, 2020
@davidwrighton
Copy link
Member

@fadimounir I understand that approxparents is being passed correctly, but why isn't it preventing the exact type load? It seems that should be fixed instead of this.

@fadimounir
Copy link
Contributor Author

I understand that approxparents is being passed correctly, but why isn't it preventing the exact type load?

This is a type that has never been loaded before, and approxparents doesn't help in this case of server GC (and AFAIK, not even with regular GC). The issue with server GC is that it doesn't have a valid Thread object associated with its thread, and the TypeLoader uses that Thread object in its locking logic. Without one, we AV right now. Even without server GC, in a checked build, we will hit this debug only code here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1104, which uses the DontLoadTypes flag, and we'll hit the assert right after this call because we'll get back a null TypeHandle.

My understanding is that approxparents won't help if the type has not been previously loaded to at least approxparents, because we should never be allowed to create types (not even to the level of approxparents) during GC. Please correct me if my understanding is wrong.

One possible easy way to fix this is to handle the null Thread* case, to avoid the AV, but I believe this would be a hack to mask one of the symptoms, and the bottom line is, we should still never load any type during GC. So I don't think this would be a correct fix.

I currently see 2 patterns here where we could be calling into a MethodDesc's entry point without potentially loading the valuetypes in the signature:

  1. Entry points returned from dictionary lookups that we directly call into (Ex: delegates. I haven't verified that the bug exists in this scenario though).
  2. Entry points returned by the JIT_VirtualFunctionPointer helper (the case of this bug here)

In these two cases, when jitting the callers of such methods, the JIT is dealing with the canonical version of things, so all the places where the JIT/JIT interface calls into the classMustBeLoadedBeforeCodeIsRun won't help.

The possible fix I see here is to call the MethodDesc::WalkValueTypeParameters() on these MethodDescs for which we return an entry point out of the two lookups I mentioned above, and load valuetypes in the signature to at least approxparents level.

If this doesn't feel like the right thing to do, this means that when we load these MethodDescs on the TypeLoader side, we should perform this step of walking the parameters and loading them, but this could be an overkill, and I think it might be better to perform that kind of loading only when we're about to execute the methods.

@jkotas, @davidwrighton I'd like to get your thoughts on this proposed fix. If you'd like, I can post an updated PR with what I'm proposing for clarification.

@davidwrighton
Copy link
Member

@fadimounir My understanding from our offline conversations is that there is some attempt to load a type like SomeGenericStruct<object>, and we are forcing that actual type to be loaded. The dropGenericArgumentLevel stuff is supposed to force us to attempt to load SomeGenericStruct<__Canon>, which was already loaded. If there isn't a canonical type already loaded, that's a different problem, but if the issue is parsing through a signature that specifies SomeGenericStruct<object> and we're actually attempting to load that type (at all), we shouldn't be doing that, and the fix isn't to use a canonical method, its to have the signature loader trigger the attempted load of the __Canon variant of the struct.

@fadimounir
Copy link
Contributor Author

The dropGenericArgumentLevel stuff is supposed to force us to attempt to load SomeGenericStruct<__Canon>, which was already loaded

@davidwrighton thanks for bringing that to my attention. Let me examine this further and see why we're still attempting to load the fully instantiated type even though the dropGenericArgumentLevel is set to true

@fadimounir
Copy link
Contributor Author

Looks like the issue here is that dropGenericArgumentLevel is not handled correctly. We only attempt to canonicalize some variants of the signature (ex: if the signature contains ref type arguments), but we're not handling things like ELEMENT_TYPE_MVAR correctly.

First of all, is there a reason why ELEMENT_TYPE_VAR is treated differently than ELEMENT_TYPE_MVAR or is that just a bug that hasn't been noticed before? See: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1238

Also, would it make sense to add something like this for each instantiation type argument we load, before attempting to load the instantiation itself (to be on the safe side)?:

if (dropGenericArgumentLevel && level == CLASS_LOAD_APPROXPARENTS)
{
    typeHnd = ClassLoader::CanonicalizeGenericArg(typeHnd);
}

@davidwrighton
Copy link
Member

@fadimounir that sounds like the right fix. I'm wresting with 2 questions as I struggle to give you advice.

  1. Should this cover both VAR and MVAR cases?
  2. Should this change be restricted to only the stackwalking and GC scenarios for dropGenericArgumentLevel or not? dropGenericArgumentLevel is used in a bunch of places in the typeloader, and affecting those cases might not be desireable.

I'd like to see the results of a full outerloop (including some GCSTRESS) with the change applied to VAR and MVAR, and not protected to only the stackwalking/gc scenarios.

@jkotas
Copy link
Member

jkotas commented Mar 20, 2020

is there a reason why ELEMENT_TYPE_VAR is treated differently than ELEMENT_TYPE_MVAR

I do not think you can ever get substitutions with MVARs today. Substituations are only used for method overriding that makes them applicable to type VARs only.

@fadimounir
Copy link
Contributor Author

fadimounir commented Mar 20, 2020

Should this cover both VAR and MVAR cases?

We don't have the same problem with VARs. I couldn't repro any failures there, not even with static methods on generic types.

If there's a risk of breaking change here, we can limit it to MVAR and only to the GC/stackwalk scenario:

thRet = (psig.GetTypeVariableThrowing(pModule, typ, fLoadTypes, pTypeContext));
if (fLoadTypes == ClassLoader::LoadTypes)
    ClassLoader::EnsureLoaded(thRet, level);

if ((IsGCThread() || IsStackWalkerThread()) && dropGenericArgumentLevel && level == CLASS_LOAD_APPROXPARENTS)
    thRet = ClassLoader::CanonicalizeGenericArg(thRet);

However, i'll submit a new commit to this PR where the changes are not just scoped to GC/stackwalking, and run an outerloop GCSTRESS job to see if that causes any failures

@davidwrighton
Copy link
Member

The reason I wanted the run on the VAR cases is that we have much better coverage for odd things happening to generic types than to methods. I'm not entirely confident it won't break anything, and I'm not even sure that if it doesn't we shouldn't only fix MVAR, but having some confidence that this change doesn't negatively impact even when applied to both VAR and MVAR would be good.

@fadimounir
Copy link
Contributor Author

but having some confidence that this change doesn't negatively impact even when applied to both VAR and MVAR would be good.

Sounds good. I pushed a commit that applies canonicalization after the call to GetTypeHandleThrowing() returns, so this covers VAR, MVAR, and more. I'm running GCSTRESS with outerloop right now, and we'll see how this goes

@fadimounir
Copy link
Contributor Author

The gcstress results are not as clean as baseline. I'll investigate further

@fadimounir
Copy link
Contributor Author

fadimounir commented Apr 6, 2020

@davidwrighton I reran the gcstress tests after submitting a fix around pinvokes, and the results are now comparable to the baseline (modulo the failures related to #33366).

I believe the changes should be ready to merge, although I have seen the regression test that I added fail non-deterministically on Windows arm configurations. It's not an AV or anything, but for some reason the test script returns 3 as an exit code, and the test does not emit any stdout. No idea what's going on there. I've been running the test constantly in a loop on a separate Windows arm device, using the same payload from the CI, and haven't seen any crashes. Not sure how to investigate this issue further. There are no dumps from the CI, and it doesn't seem like the test crashed really...
cc @jkotas

Here's the test output:

    GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd [FAIL]
      
      Return code:      1
      Raw output file:      D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Reports\GC.Regressions\Github\runtime_32848\runtime_32848\runtime_32848.output.txt
      Raw output:
      BEGIN EXECUTION
       "D:\h\w\B2610973\p\corerun.exe" runtime_32848.dll 
      Expected: 100
      Actual: 3
      END EXECUTION - FAILED
      FAILED
      Test Harness Exitcode is : 1
      To run the test:
      > set CORE_ROOT=D:\h\w\B2610973\p
      > D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd
      Expected: True
      Actual:   False
      Stack Trace:
        F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.arm.Checked\TestWrappers\GC.Regressions\GC.Regressions.XUnitWrapper.cs(140,0): at GC_Regressions._Github_runtime_32848_runtime_32848_runtime_32848_._Github_runtime_32848_runtime_32848_runtime_32848_cmd()
      Output:
        
        Return code:      1
        Raw output file:      D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Reports\GC.Regressions\Github\runtime_32848\runtime_32848\runtime_32848.output.txt
        Raw output:
        BEGIN EXECUTION
         "D:\h\w\B2610973\p\corerun.exe" runtime_32848.dll 
        Expected: 100
        Actual: 3
        END EXECUTION - FAILED
        FAILED
        Test Harness Exitcode is : 1
        To run the test:
        > set CORE_ROOT=D:\h\w\B2610973\p
        > D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd

@fadimounir fadimounir changed the title [WIP] Use canonical method in TransitionFrames whenever we parse signatures Use canonical method in TransitionFrames whenever we parse signatures Apr 6, 2020
@davidwrighton
Copy link
Member

I suspect something very fishy is happening with test startup somewhere. Could you modify the test to always print out a message at the start of Main? That might help show the problem.
The other detail is that error code 3 in windows is ERROR_PATH_NOT_FOUND which indicates to me that something might be wrong with distributing binaries to this test hardware.

@fadimounir
Copy link
Contributor Author

@davidwrighton Same issue with the test failure here. I added a Console.WriteLine at the beginning of Main, but still that is not showing in the stdout.
@MattGal Would you have any suggestions on how to fix this, or what the underlying issue is?

@MattGal
Copy link
Member

MattGal commented Apr 7, 2020

@davidwrighton Same issue with the test failure here. I added a Console.WriteLine at the beginning of Main, but still that is not showing in the stdout.
@MattGal Would you have any suggestions on how to fix this, or what the underlying issue is?

Taking a look

@MattGal
Copy link
Member

MattGal commented Apr 7, 2020

I had a quick phone chat with Fadi. I'm going to take one of our machines that ran this earlier offline and try out these bits on it, and synch up with Fadi later on. My naive guess here is that the precommands being included might be messing up the current path.

@MattGal
Copy link
Member

MattGal commented Apr 7, 2020

I had a quick phone chat with Fadi. I'm going to take one of our machines that ran this earlier offline and try out these bits on it, and synch up with Fadi later on. My naive guess here is that the precommands being included might be messing up the current path.

The issue is a CHK assert coming out of the runtime, which when dismissed by the dialog handler on the machine manifests as exit code 3. I'm working with Fadi on ideas but it's clear the machines we have have a newer OS and different hardware, so that is likely the source of the difference between local and remote repro of the bug.

@fadimounir
Copy link
Contributor Author

I'm working with Fadi on ideas

I changed all calls to assert() in the problematic file to _ASSERTE, which first of all does not show the assert GUI dialog, and has a better chance of causing the infrastructure to save a crash dump and associate that with the test results. Hopefully we get a crash dump this time, so we're able to figure out what's going on.

@fadimounir
Copy link
Contributor Author

I changed all calls to assert() in the problematic file to _ASSERTE

No crash dumps with _ASSERTE. I'll add some dummy code for now that will cause an AV, so I can get a dump for the failure

@fadimounir
Copy link
Contributor Author

With Matt's help, I was able to get my hands on a crash dump and look at the callstack:

0:000> !clrstack -f
OS Thread Id: 0x2b3c (0)
Child SP       IP Call Site
04E0C998 68446D20 coreclr!common_assert_to_message_box + 120 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:389
04E0CE50 68446C9F coreclr!common_assert + 79 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:424
04E0CE70 68447BEB coreclr!_wassert + 27 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:444
04E0CE90 68359225 coreclr!GCToOSInterface::GetCurrentProcessorNumber + 57 at F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.os.cpp:274
04E0CEA8 6839D0C3 coreclr!SVR::heap_select::select_heap + 19 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:5267
04E0CED0 683866CF coreclr!SVR::gc_heap::balance_heaps + 31 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:14174
04E0CF28 6838281B coreclr!SVR::gc_heap::allocate_more_space + 31 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:14533
04E0CF50 6837D731 coreclr!SVR::GCHeap::Alloc + 273 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:37197
04E0CF90 68236FC5 coreclr!Alloc + 205 at F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp:242
04E0CFB8 682377B1 coreclr!AllocateObject + 157 at F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp:1024
04E0CFE8 681CF3F1 coreclr!JIT_New + 209 at F:\workspace\_work\1\s\src\coreclr\src\vm\jithelpers.cpp:2313
04E0D024          [HelperMethodFrame: 04e0d024] 
04E0D0C0 299ad68a system.console.dll!System.Console..cctor() + 30
04E0DBFC          [HelperMethodFrame: 04e0dbfc] 
04E0DC98 299ad590 system.console.dll!System.Console.get_Out() + 40
04E0DCC0 299ad140 system.console.dll!System.Console.WriteLine(System.String) + 24
04E0DCD8 299acdbe runtime_32848.dll!Test.Main() + 34

There's nothing special here: it's a simple Console.WriteLine that is triggering the assert, and the only special thing is that server GC is enabled. I also couldn't find any evidence of heap corruption. There is a pretty large number of threads though (27 threads). It could be the case that this is just hitting a corner case bug in the OS. The processor number returned by the GetCurrentProcessorNumberEx() API is 0x70, which doesn't make sense.

So here's what i'm going to do:

  1. Package a repro and as much info and send that to some contact on the Windows team (@davidwrighton you mentioned you knew someone there, could you please send me some contact info)?
  2. Remove complus_gcserver from the test. The bug exists with non-server gc, and we have asserts in place on debug builds that would catch the issue (no crashes like the server GC case).

Also changing a bunch of assert() calls to _ASSERTE. Usually when _ASSERTE fails in CI lab runs, we tend to get crash dumps associated with test results, unlike assert() which shows a GUI dialog that DHandler dismisses by clicking on the Abort button.
The test will run with gcstress as part of the gcstress lab runs
@fadimounir fadimounir merged commit 7fe4e5b into dotnet:master Apr 11, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants