Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tailcall_v4/hijacking/hijacking.cmd Timed Out #106134

Open
lewing opened this issue Aug 8, 2024 · 11 comments
Open

tailcall_v4/hijacking/hijacking.cmd Timed Out #106134

lewing opened this issue Aug 8, 2024 · 11 comments
Assignees
Labels
area-VM-coreclr Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@lewing
Copy link
Member

lewing commented Aug 8, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=765968
Build error leg or test failing: JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd
Pull request: #105868

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "tailcall_v4/hijacking/hijacking.cmd Timed Out",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=765968
Error message validated: [tailcall_v4/hijacking/hijacking.cmd Timed Out]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 8/8/2024 3:07:27 PM UTC

Report

Build Definition Test Pull Request
888563 dotnet/runtime JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd #110484
888417 dotnet/runtime JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd #110484

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 2
@lewing lewing added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Aug 8, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 8, 2024
@ericstj ericstj added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Infrastructure-libraries labels Aug 9, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Aug 12, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Aug 12, 2024
@jakobbotsch
Copy link
Member

Removing blocking from this since we haven't seen other failures.

@jakobbotsch jakobbotsch removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Aug 13, 2024
@jakobbotsch
Copy link
Member

This looks like some failure to suspend the EE. The test runs two threads for 30 seconds: one that continuously performs helper-based tailcalls, and another that does GCs. The stack trace seems to indicate that most threads are captured and waiting on the one thread doing portable tailcalls to be suspended.
Interestingly it seems like windbg is not able to unwind fully out of it, but unsure if that is related:

0:005> k
 # Child-SP          RetAddr               Call Site
00 000000c5`36f7ef98 00007ffc`a26e70c6     coreclr!memset+0x15b [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 277] 
01 000000c5`36f7efa0 00007ffc`a29a2f5f     coreclr!Thread::ObjectRefFlush+0x96 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5785] 
02 000000c5`36f7efd0 00007ffc`a1367f20     coreclr!TailCallHelp::GetTailCallInfo+0x7f [D:\a\_work\1\s\src\coreclr\vm\tailcallhelp.cpp @ 34] 
03 000000c5`36f7f040 00007ffc`433622e7     System_Private_CoreLib+0x507f20
04 000000c5`36f7f0b0 00000000`00000004     0x00007ffc`433622e7
05 000000c5`36f7f0b8 00007ffc`a2f0bd50     0x4
06 000000c5`36f7f0c0 0000023c`edad66a0     coreclr!EEContract::`vftable'
07 000000c5`36f7f0c8 000000c5`36f7f4f0     0x0000023c`edad66a0
08 000000c5`36f7f0d0 000000c5`36f7f130     0x000000c5`36f7f4f0
09 000000c5`36f7f0d8 000000c5`36f7f128     0x000000c5`36f7f130
0a 000000c5`36f7f0e0 00000000`47525440     0x000000c5`36f7f128
0b 000000c5`36f7f0e8 00007ffc`43362269     0x47525440
0c 000000c5`36f7f0f0 000000c5`36f7f130     0x00007ffc`43362269
0d 000000c5`36f7f0f8 000000c5`36f7f128     0x000000c5`36f7f130
0e 000000c5`36f7f100 00000000`4752543f     0x000000c5`36f7f128
0f 000000c5`36f7f108 00007ffc`433623e8     0x4752543f
10 000000c5`36f7f110 000000c5`36f7f130     0x00007ffc`433623e8
11 000000c5`36f7f118 000000c5`36f7f128     0x000000c5`36f7f130
12 000000c5`36f7f120 00000000`4752543f     0x000000c5`36f7f128
13 000000c5`36f7f128 000000c5`36f7f130     0x4752543f
14 000000c5`36f7f130 000000c5`36f7f128     0x000000c5`36f7f130
15 000000c5`36f7f138 00000000`47525440     0x000000c5`36f7f128
16 000000c5`36f7f140 00007ffc`43750468     0x47525440
17 000000c5`36f7f148 00007ffc`a1367f5e     0x00007ffc`43750468
18 000000c5`36f7f150 00007ffc`43361fb7     System_Private_CoreLib+0x507f5e
19 000000c5`36f7f1c0 000000c5`36f7f1f8     0x00007ffc`43361fb7
1a 000000c5`36f7f1c8 00007ffc`43750360     0x000000c5`36f7f1f8
1b 000000c5`36f7f1d0 00000000`00000000     0x00007ffc`43750360

Another thing that strikes me as odd after looking at the dump for a bit is that the thread doing portable tailcalls has Thread::TS_GCSuspendRedirected thread state set even though it is in cooperative mode. If I'm reading the code right, it means that all the hijacking attempts on it are not doing anything.
@VSadov can you take a look?

@VSadov
Copy link
Member

VSadov commented Aug 20, 2024

Was this on win-x86?

@jakobbotsch
Copy link
Member

Was this on win-x86?

No, the failures are from win-x64. I haven't been able to repro this locally, so I've only looked at a dump.

@VSadov
Copy link
Member

VSadov commented Aug 20, 2024

TS_GCSuspendRedirected means we changed the thread context to jump to suspension routine and released.
Somehow the thread ignored that and continues running app code?

This style of suspension is used when no signal-like API is available. So likely win-x86 specific.

@VSadov
Copy link
Member

VSadov commented Aug 20, 2024

The other possibility is the state was set by accident, maybe some uninitialized data was involved.

I’ll take a look.

@jakobbotsch jakobbotsch assigned VSadov and unassigned jakobbotsch Aug 20, 2024
@jakobbotsch jakobbotsch added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member

This style of suspension is used when no signal-like API is available. So likely win-x86 specific.

In the dump I can see that Thread::s_pfnQueueUserAPC2Proc == 0, so presumably that's why it is using this suspension method? Perhaps the problem can be reproduced by forcing this mode.

@v-wenyuxu
Copy link

Failed in: runtime-coreclr outerloop 20240913.3

Failed tests:

R2R-CG2 windows x64 Checked no_tiered_compilation @ Windows.10.Amd64.Open
    - JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd

Error message:

 
cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)

Return code:      -100
Raw output file:      C:hw909107FEwAD530939uploads	ailcall_v4hijackingoutput.txt
Raw output:
BEGIN EXECUTION
hijacking.dll
        1 file(s) copied.
20:10:56.30
Response file: C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll.rsp
C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackingIL-CG2hijacking.dll
-o:C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll
--targetarch:x64
--targetos:windows
--verify-type-and-field-layout
--method-layout:random
-r:C:hw909107FEpSystem.*.dll
-r:C:hw909107FEpMicrosoft.*.dll
-r:C:hw909107FEp\xunit.*.dll
-r:C:hw909107FEpmscorlib.dll
-r:C:hw909107FEp
etstandard.dll
" "C:hw909107FEpcrossgen2crossgen2.exe" @"C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll.rsp"   -r:C:hw909107FEwAD530939eJITMethodicalMethodical_othersIL-CG2*.dll  -r:C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackingIL-CG2*.dll"
Emitting R2R PE file: C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll
" "dotnet" "C:hw909107FEp
2rdump
2rdump.dll" --header --sc --in C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll --out C:hw909107FEwAD530939eJITMethodical	ailcall_v4hijackinghijacking.dll.r2rdump --val"
20:10:56.80
 "C:hw909107FEpcorerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  hijacking.dll 

cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)
Invoking: C:Program Files (x86)Windows Kits10Debuggersdcdb.exe -c "$<C:hw909107FE		mpkrpkmd.tmp" -z "C:corescrashdump_2080.dmp"
stdout: 
Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:corescrashdump_2080.dmp]
User Mini Dump File with Full Memory: Only application data is available


************* Path validation summary **************
Response                         Time (ms)     Location
OK                                             C:hw909107FEpPDB
Symbol search path is: C:hw909107FEpPDB
Executable search path is: 
Windows 10 Version 14393 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
10.0.14393.6343 (rs1_release.230913-1727)
Machine Name:
Debug session time: Fri Sep 13 20:40:56.000 2024 (UTC + 0:00)
System Uptime: 0 days 0:50:23.232
Process Uptime: 0 days 0:30:00.000
............................

************* Symbol Loading Error Summary **************
Module name            Error
ntdll                  The system cannot find the file specified

You can troubleshoot most symbol related issues by turning on symbol loading diagnostics (!sym noisy) and repeating the command that caused symbols to be loaded.
You should also verify that your symbol search path (.sympath) is correct.
For analysis of this file, run !analyze -v
ntdll!ZwWaitForSingleObject+0x14:
00007ffe`14196124 c3              ret
0:000> cdb: Reading initial command '$<C:hw909107FE		mpkrpkmd.tmp'
0:000> .load C:Users
unner.dotnetsossos.dll
0:000> ~*k

.  0  Id: 820.13c0 Suspend: 0 Teb: 00000018`1924b000 Unfrozen
Child-SP          RetAddr           Call Site
00000018`1957dd08 00007ffe`106d6d1f ntdll!ZwWaitForSingleObject+0x14
*** WARNING: Unable to verify checksum for coreclr.dll
00000018`1957dd10 00007ffd`e58063ef KERNELBASE!WaitForSingleObjectEx+0x8f
00000018`1957ddb0 00007ffd`e56d764b coreclr!Thread::RareDisablePreemptiveGC+0x27f
*** WARNING: Unable to verify checksum for System.Private.CoreLib.dll
00000018`1957de80 00007ffd`e4134e3f coreclr!JIT_PInvokeEndRarePath+0x10b
*** W

Stack trace:

   at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 146
   at Xunit.Assert.True(Boolean condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 128
   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String assemblyPath, String testPathPrefix)
   at Program.<<Main>$>g__TestExecutor55|2_56(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass2_0&)

@VSadov
Copy link
Member

VSadov commented Sep 19, 2024

This did not reproduce in more than a month. It is possible it was a sideeffect of some other issue, which is now fixed.

@VSadov
Copy link
Member

VSadov commented Sep 19, 2024

I think this we can keep this open - in case it reproduces, but put it into 10.0 as it does not seem blocking.

@VSadov VSadov modified the milestones: 9.0.0, 10.0.0 Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

No branches or pull requests

6 participants