Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Closed
ahsonkhan opened this issue Oct 18, 2019 · 97 comments

Comments

@ahsonkhan
Copy link
Member

From dotnet/corefx#41753

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries
netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================

C:\helix\work\workitem>"C:\helix\work\correlation\dotnet.exe" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonwindowstests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries [FAIL]
      Timed out at 10/18/2019 11:24:33 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp
      	Process ID: 4708
      	Handle: 912
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(71,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 118.055s
----- end Fri 10/18/2019 11:25:06.32 ----- exit code 1 ----------------------------------------------------------

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw
netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec ~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw [FAIL]
      Timed out at 10/18/2019 11:07:51 AM after 60000ms waiting for remote process.
      	Process ID: 2283
      	Handle: 892
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(106,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 77.472s
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
----- end Fri Oct 18 11:07:57 UTC 2019 ----- exit code 1 ----------------------------------------------------------

cc @ViktorHofer

@stephentoub
Copy link
Member

stephentoub commented Oct 18, 2019

@ahsonkhan, it should have uploaded a dump from the remote process:
"Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp"
Were you able to grab that? It looks like you restarted the legs.

@danmoseley
Copy link
Member

Previously this was blocked by https://github.com/dotnet/core-eng/issues/7950 (https://github.com/dotnet/corefx/issues/41528) .. if dumps work now that's awesome.

@stephentoub
Copy link
Member

@danmosemsft, the one with the dump is Windows.

@danmoseley
Copy link
Member

@stephentoub ah, I just saw the RedHat above.

@danmoseley
Copy link
Member

It looks like https://github.com/dotnet/core-eng/issues/7950 was recently closed, so I would hope to get a dump from that one too.

@stephentoub
Copy link
Member

so I would hope to get a dump from that one too.

The dump that's relevant here is from the RemoteExecutor.Invoke child process; on Windows we now explicitly P/Invoke to MiniDumpWriteDump in order to save out a dump. That doesn't happen on Linux.
dotnet/arcade#4085

@danmoseley
Copy link
Member

Should we have an issue to find a way to do it on Linux? Presumably deploying and invoking createdump?

@stephentoub
Copy link
Member

Sure :)

@danmoseley
Copy link
Member

dotnet/arcade#4153

@ahsonkhan
Copy link
Member Author

Were you able to grab that? It looks like you restarted the legs.

No, sorry. Will remember to do that next time.

@ViktorHofer
Copy link
Member

ViktorHofer commented Oct 19, 2019

updated link above (edited above post as the link was wrong)

0:000> k
 # Child-SP          RetAddr           Call Site
00 00000005`6597c310 00007ffc`becda9b5 coreclr!memset_repmovs+0x9 [d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 66] 
01 00000005`6597c320 00007ffc`becda417 coreclr!WKS::gc_heap::adjust_limit_clr+0x155 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12087] 
02 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_segment_end_p+0x134 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12864] 
03 (Inline Function) --------`-------- coreclr!WKS::gc_heap::soh_try_fit+0x1ad [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13002] 
04 00000005`6597c390 00007ffc`becda148 coreclr!WKS::gc_heap::allocate_small+0x247 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13067] 
05 (Inline Function) --------`-------- coreclr!WKS::gc_heap::try_allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13939] 
06 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369] 
07 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x116 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400] 
08 00000005`6597c490 00007ffc`bec54ee7 coreclr!WKS::GCHeap::Alloc+0x148 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35826] 
09 00000005`6597c4f0 00007ffc`bec54732 coreclr!Alloc+0x16f [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 251] 
0a 00000005`6597c550 00007ffc`bec6603c coreclr!AllocateSzArray+0xee [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 543] 
0b (Inline Function) --------`-------- coreclr!AllocateSzArray+0x19 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 421] 
0c 00000005`6597c5f0 00007ffc`bec64d55 coreclr!AllocateObjectArray+0x48 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 964] 
0d 00000005`6597c630 00007ffc`bec54020 coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::CheckGrowth+0xc9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 248] 
0e (Inline Function) --------`-------- coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::Add+0x9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 438] 
0f 00000005`6597c660 00007ffc`bec5399e coreclr!CrossLoaderAllocatorHash<InliningInfoTrackerHashTraits>::Add+0x314 [f:\workspace\_work\1\s\src\vm\crossloaderallocatorhash.inl @ 389] 
10 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInliningDontTakeLock+0x5f3 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 664] 
11 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInlining+0xa46 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 648] 
12 (Inline Function) --------`-------- coreclr!Module::AddInlining+0xa6e [f:\workspace\_work\1\s\src\vm\ceeload.cpp @ 142] 
13 00000005`6597c740 00007ffc`d4649e8a coreclr!CEEInfo::reportInliningDecision+0x110e [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 8319] 
14 00000005`6597cd10 00007ffc`d45c6b13 clrjit!InlineResult::Report+0xde [f:\workspace\_work\1\s\src\jit\inline.cpp @ 749] 
15 (Inline Function) --------`-------- clrjit!InlineResult::{dtor}+0x9 [f:\workspace\_work\1\s\src\jit\inline.h @ 420] 
16 00000005`6597cd50 00007ffc`d45c5041 clrjit!Compiler::fgInline+0x273 [f:\workspace\_work\1\s\src\jit\flowgraph.cpp @ 22027] 
17 00000005`6597ced0 00007ffc`d4622d38 clrjit!Compiler::fgMorph+0x131 [f:\workspace\_work\1\s\src\jit\morph.cpp @ 16632] 
18 00000005`6597d1b0 00007ffc`d4631726 clrjit!Compiler::compCompile+0x368 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 4560] 
19 00000005`6597d2f0 00007ffc`d462fd7d clrjit!Compiler::compCompileHelper+0x296 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6099] 
1a 00000005`6597d3a0 00007ffc`d462fa33 clrjit!Compiler::compCompile+0x21d [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 5427] 
1b 00000005`6597d450 00007ffc`d462ecc2 clrjit!jitNativeCode+0x273 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6727] 
1c 00000005`6597d610 00007ffc`bec79699 clrjit!CILJit::compileMethod+0x92 [f:\workspace\_work\1\s\src\jit\ee_il_dll.cpp @ 319] 
1d (Inline Function) --------`-------- coreclr!invokeCompileMethodHelper+0x63 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12464] 
1e 00000005`6597d680 00007ffc`bec7952c coreclr!invokeCompileMethod+0xdd [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12529] 
1f 00000005`6597d720 00007ffc`bec78dbf coreclr!CallCompileMethodWithSEHWrapper+0x50 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12583] 
20 00000005`6597d7c0 00007ffc`bec77d91 coreclr!UnsafeJitFunction+0xe1f [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 13070] 
21 00000005`6597dbc0 00007ffc`bec77889 coreclr!MethodDesc::JitCompileCodeLocked+0x2e1 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 974] 
22 00000005`6597ddb0 00007ffc`bec771f4 coreclr!MethodDesc::JitCompileCodeLockedEventWrapper+0x33d [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 830] 
23 00000005`6597def0 00007ffc`becbbdca coreclr!MethodDesc::JitCompileCode+0x194 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 769] 
24 00000005`6597df90 00007ffc`bed595a5 coreclr!MethodDesc::PrepareILBasedCode+0xfa [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 398] 
25 (Inline Function) --------`-------- coreclr!MethodDesc::PrepareCode+0xc [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 330] 
26 00000005`6597dfc0 00007ffc`becb9d06 coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x121 [f:\workspace\_work\1\s\src\vm\codeversion.cpp @ 1800] 
27 00000005`6597e140 00007ffc`becb9815 coreclr!MethodDesc::DoPrestub+0x2b6 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1972] 
28 00000005`6597e210 00007ffc`bed4db95 coreclr!PreStubWorker+0x455 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1798] 
29 00000005`6597e420 00007ffc`5f264362 coreclr!ThePreStub+0x55 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm @ 22] 
2a 00000005`6597e4d0 000001c6`00010fa8 0x00007ffc`5f264362
2b 00000005`6597e4d8 00000005`6597e518 0x000001c6`00010fa8
2c 00000005`6597e4e0 000001c6`0000cc90 0x00000005`6597e518
2d 00000005`6597e4e8 00000000`ffffffff 0x000001c6`0000cc90
2e 00000005`6597e4f0 00000000`0000003e 0xffffffff
2f 00000005`6597e4f8 00000000`00000003 0x3e
30 00000005`6597e500 00000000`00000000 0x3

@ViktorHofer
Copy link
Member

cc @jkotas

@ViktorHofer
Copy link
Member

And for learning purposes, to find the uploaded files (i.e. after a retry) you can use the Helix API directly: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files

This will list all uploaded files, in this case I searched for the ".dmp" file and added it to the url: "/4708.wrrjax4w.c3b.dmp".

@stephentoub
Copy link
Member

Thanks, @ViktorHofer.

This is happening early on in the invocation of the remote process, as we're looking up the method to invoke:
image

@stephentoub stephentoub transferred this issue from dotnet/corefx Oct 19, 2019
@jkotas
Copy link
Member

jkotas commented Oct 19, 2019

coreclr!memset_repmovs+0x9

This seems to be stuck on a page-in request in the OS. I do not see anything wrong on the runtime side.

The GC is trying to clear memory from 0x000001c600011398 to 0x000001c6000139d8. The clearing is stuck at address 0x000001c600013000 (ie page boundary).

We need to get more dumps for this to see the pattern.

@jkotas
Copy link
Member

jkotas commented Oct 19, 2019

From just looking at this dump, my bet would be on infrastructure problem: The physical machine is over-subscribed and super slow.

@jkotas
Copy link
Member

jkotas commented Oct 19, 2019

we're looking up the method to invoke

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

@stephentoub
Copy link
Member

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

Right, understood. I was pasting the image from VS to show the managed frames that were obscured in the windbg stack trace as well as the arguments to those methods.

@danmoseley
Copy link
Member

Odd that we've had 3 hangs (in remote executor in Regex cache tests). I'd have thought a slow machine might give a more distributed set of hangs..

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

They can be each different root cause. Do you happen to have dumps for the other 2 hangs?

@ViktorHofer
Copy link
Member

We don't have more dumps yet and we only produce ones on Windows for these kinds of failures.

@jeffschwMSFT jeffschwMSFT changed the title System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout Oct 21, 2019
@safern
Copy link
Member

safern commented Oct 21, 2019

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp

We are in the middle of JITing System.Text.RegularExpressions.Regex.InitDefaultMatchTimeout called from RegEx static constructor. The thread is not blocked for any obvious reason.

So the common theme is:

  • Windows Nano
  • Early during startup, in the middle of the JIT

Keep more dumps coming...

@jkotas
Copy link
Member

jkotas commented Oct 23, 2019

Another hit in dotnet/coreclr#27375 . System.Transactions tests. The process is stuck early during startup inside System.Transactions.TransactionsEtwProvider. This hit was on checked build of the runtime that allows us to see what happened in the past using stresslog.

The full stress log is here:
log.txt

@jkotas
Copy link
Member

jkotas commented Oct 23, 2019

The interesting bits of the stress log are:

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
173c 153.830905200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 20006be = 00007FFD94C9C678 (System.Globalization.UnicodeCategory) 
173c 152.449501000 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 200037c = 00007FFD94AB3200 (System.Runtime.Intrinsics.Vector128`1) 

1108 110.031347100 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 110.031328100 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 110.031286600 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 110.031261900 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 3, (requested generation = 2, collect_classes = 9) ==========

1108 110.031259000 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 110.031258700 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 110.031258500 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 110.031258300 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 110.031258000 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 110.031257700 : `GC`                 Start of GC stack 

1108  61.804306200 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  61.804288500 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  61.804249300 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  61.804225600 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 2, (requested generation = 1, collect_classes = 9) ==========

1108  61.804222700 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  61.804222500 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  61.804222300 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  61.804222000 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  61.804221800 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  61.804221600 : `GC`                 Start of GC stack 

1108  13.291604900 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  13.291578000 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  13.291504100 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  13.291459400 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 1, (requested generation = 1, collect_classes = 9) ==========

1108  13.291456300 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  13.291456000 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  13.291455800 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  13.291455500 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  13.291455300 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  13.291455100 : `GC`                 Start of GC stack 
1108  13.291396100 : `GC`                 GC is triggered 

173c   0.335993200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 2000175 = 00007FFD94AB2D30 (System.SpanHelpers) 

@jkotas
Copy link
Member

jkotas commented Oct 23, 2019

  • The process was making good progress for the first 0.3s, but then the progress stopped
  • Several GCs were explicitly triggered by LowResourceNotifications at 13s, 61s and 110s.
  • The process started making progress at 152s again and got killed by the timeout shortly after that

@jkotas
Copy link
Member

jkotas commented Oct 23, 2019

This confirms my theory that the machine is oversubscribed when this hang/timeout happens. We need to find out what else is going on the machine that is taking all resources.

@safern
Copy link
Member

safern commented Apr 22, 2020

@jkotas I've seen we are hitting this again in some PRs. What thread count and parallelism would you recommend setting if memory is < 1GB? -parallel collections -maxthread 1?

@jkotas
Copy link
Member

jkotas commented Apr 22, 2020

-maxthread 1 sounds reasonable to me to start with.

Does -parallel collections have any affect when you specify -maxthread 1?

There is still a chance that we find the machine to be in fine state when the test is starting, and then the expensive background services kick in while the test is running. It may be better to throttle this in the xunit itself to be more agile.

It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor:

  • When the machine is oversubscribed, wait until other remote executor processes finish before starting new one (ie do not have more than one running)
  • Otherwise, if the machine is still oversubscribed, wait 1 second before starting the remote executor process
  • Make sure to log diagnostics in any of these cases

@ViktorHofer
Copy link
Member

It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor

@jkotas sounds good. Do you have cycles to submit a PR with the proposed changes?

@jkotas
Copy link
Member

jkotas commented Apr 27, 2020

Sorry, I do not have cycles to deal with this.

I've seen we are hitting this again in some PRs

Do you have the list of recent PRs that are hitting this? The problem may have shifted somewhat compared to what was identified earlier in this issue.

The one that I have seen recently (#35451) has Memory load: 14 that is pretty reasonable. The signature of the problem identified earlier in this issue was very high memory load (like Memory load: 98).

@ViktorHofer
Copy link
Member

Sorry, I do not have cycles to deal with this.

No problem at all. Thought it's worth asking. @Anipik this might be something worth to pick up as it directly contributes to CI pass rate. cc @ericstj

@ericstj
Copy link
Member

ericstj commented Apr 29, 2020

When the machine is oversubscribed, wait until other remote executor processes finish before starting new one (ie do not have more than one running)
Otherwise, if the machine is still oversubscribed, wait 1 second before starting the remote executor process
Make sure to log diagnostics in any of these cases

I think it's reasonable to try to address this by throttling the number of concurrent RemoteExecutor processes when resources are low. @jkotas's suggestion above seems like a good starting point. @ViktorHofer / @jkotas are we only running a single test process, such that handling in the static RemoteExecutor would be sufficient, or do we need to consider multiple processes launching RemoteExecutors on the machine?

We'd want to measure the prevalence of this problem before and after a change.

Here's the command I ran to measure:
runfo search-helix -d runtime -c 100 --pr -v "after 60000ms waiting for remote process."

Build Kind Console Log
623148 PR #35566 console.log
622638 Rolling console.log
622949 PR #35600 console.log
623077 PR #35573 console.log
622787 PR #35566 console.log
621930 PR #35421 console.log
622225 PR #35483 console.log
622286 PR #35590 console.log
622860 PR #35285 console.log
622455 PR #35593 console.log
622684 PR #35573 console.log
622514 PR #35592 console.log
621713 PR #35565 console.log
621834 PR #35579 console.log
619501 PR #34973 console.log
621362 PR #31874 console.log
621635 Rolling console.log
622808 PR #35538 console.log
621573 PR #35573 console.log
622206 PR #35589 console.log
619954 PR #35013 console.log
622502 PR #35584 console.log
621439 PR #35567 console.log
620597 PR #35383 console.log
620431 PR #35368 console.log
621318 PR #35316 console.log
621273 PR #35569 console.log
622077 Rolling console.log
622279 PR #35570 console.log
622156 PR #35183 console.log
622081 PR #35246 console.log
621261 PR #35568 console.log
621799 PR #35582 console.log
621334 Rolling console.log
619918 PR #35353 console.log
621680 PR #35549 console.log
615697 PR #31874 console.log
621675 PR #35522 console.log
621600 PR #35575 console.log
621354 PR #32552 console.log
620050 PR #35233 console.log
621188 PR #35567 console.log

@jkotas
Copy link
Member

jkotas commented Apr 29, 2020

such that handling in the static RemoteExecutor would be sufficient

Yes, handling this in the static RemoteExecutor should be sufficient.

However, I am not convinced that this will address the problem we are seeing recently. If the problem was really caused by oversubscribed machine, we would see crashes with variety of stacks, but that's not the case. We are always seeing crash with EventUnregister on the stack. It means that there is something wrong happening at this specific stack, and it is unlikely to be related to oversubscribed machine.

Trying to gather more data about this failure like @noahfalk suggested in #35451 (comment) may be a better course of action.

@ericstj
Copy link
Member

ericstj commented Apr 29, 2020

I see what you mean. All these logs show the following callstack hanging:

[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
      System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
      System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
      System.Diagnostics.Tracing.EventProvider.Dispose()
      System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
      System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
      System.AppContext.OnProcessExit()

@noahfalk do you think the number of occurrences we are seeing here indicate that progress is not being made?

@ericstj
Copy link
Member

ericstj commented Apr 29, 2020

The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang.

@noahfalk
Copy link
Member

@noahfalk do you think the number of occurrences we are seeing here indicate that progress is not being made?

If we are seeing many occurences of that stack I agree it raises suspicion that something else is going on. In terms of progress I don't yet have a good explanation and we've got contradictory evidence. On the one hand the dump doesn't indicate that we are waiting on anything other than the OS scheduler to assign a core, however if that were truly the only issue then I wouldn't expect to see this particular callstack at high volume. Right now I don't think we've got the data to draw a conclusion in either direction. Additional dumps after a time delay and/or an ETW trace could provide that evidence.

@safern
Copy link
Member

safern commented Apr 29, 2020

The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang.

Based on this info, we were able to track down that the shas in between those two package versions are: c409bd0 and 44ed52a

Then we looked at Kusto data to find the first build that hit this issue and it was: https://dev.azure.com/dnceng/public/_build/results?buildId=608569&view=results -- which used: 629dba5 to merge with master to checkout the repo for the build. The builds started failing in april 21st.

So based on this I can only thing of 2 commits in question: 629dba5 and aa5b204

@jkotas @noahfalk do you see anything that pops up there?

@jkotas
Copy link
Member

jkotas commented Apr 30, 2020

The diagnostic server PR 629dba5 is the obvious culprit. It is closely coupled with the code where this hang occurs.

It is likely that the diagnostic server eventpipe code is doing something bad during shutdown, and that corruption is leading to this hang. We had similar intermittent shutdown bugs in eventpipe before, so it is not that surprising that a new one popped up after significant eventpipe change.

Let's use this issue to track the original problem with machine oversubscription, and reactivate #35451 to track the hang in EventUnregister.

cc @josalem

@danmoseley
Copy link
Member

Kudos to @safern and @ericstj for quickly helping narrow it to 2 commits.

@noahfalk
Copy link
Member

Indeed, thanks for narrowing to a few possible commits. I still have little idea what exactly the server change would do to produce the particular result we saw in #35451, but it narrows the context enough that we can investigate.

@josalem
Copy link
Contributor

josalem commented Apr 30, 2020

Taking a look today! Thanks @safern and @ericstj for triaging.

@ViktorHofer
Copy link
Member

Triage: will need to trigger runfo to get current data to decide if this issue should be closed or further action is necessary.

@josalem
Copy link
Contributor

josalem commented Jul 20, 2020

@ViktorHofer I think there are few different issues being discussed in this thread. The last one in the thread was resolved with the same fixes in here I believe: #35451. I'm not sure about the other issues discussed here though.

@danmoseley
Copy link
Member

@safern is this test now stable? (Also is there info on the OneNote to remind me how to make such queries myself)

If the test is stable we can close this issue.

@safern
Copy link
Member

safern commented Jul 21, 2020

I think we could still hit this issue it just depends the available memory at the test execution but we don't hit it that often.

Also is there info on the OneNote to remind me how to make such queries myself

I don't think so, we can add it.

@Anipik Anipik modified the milestones: 5.0.0, Future Aug 11, 2020
@ViktorHofer
Copy link
Member

Runfo doesn't list any failures in the last 14 days for the described remote executor hang: https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aruntime++started%3A%7E14&tq=name%3A%22after+60000ms+waiting+for+remote+process%22.

Feel free to open an issue for following-up on improvements to RemoteExecutor or the test runner.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests