-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: readytorun/coreroot_determinism/coreroot_determinism/coreroot_determinism.cmd #101060
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
This only failed once, so I'm going to unmark it as blocking (also doesn't repro easily it looks like). |
From the dump it definitely looks OSR related. The stack trace looks like the following:
(not sure why managed names are missing. Opened dotnet/diagnostics#4637.) The previous valid return address is:
So seems like something went wrong while transitioning from the patchpoint. @AndyAyersMS do you have any thoughts? I don't suppose there are some diagnostics available to e.g. obtain the particular |
The
That method code corresponds to
seen above. Unclear whether the issue here is OSR related or whether something bad just happened after transitioning. |
This is the code for the OSR version: (lldb) clru 72f7b0bdb4c0
Normal JIT generated code
ILCompiler.LazyGenericsSupport+GraphBuilder.<.ctor>g__AddProcessType|0_1(Internal.TypeSystem.TypeDesc, <>c__DisplayClass0_0 ByRef)
ilAddr is 000072F7AC0B796C pImport is 00007ECC5801F550
Begin 000072F7B0BDB4C0, size 147
>>> 000072f7b0bdb4c0 488b4500 mov rax, qword ptr [rbp]
000072f7b0bdb4c4 50 push rax
000072f7b0bdb4c5 4883ec20 sub rsp, 0x20
000072f7b0bdb4c9 4c89bc24c8000000 mov qword ptr [rsp + 0xc8], r15
000072f7b0bdb4d1 4c89b424c0000000 mov qword ptr [rsp + 0xc0], r14
000072f7b0bdb4d9 48899c24b8000000 mov qword ptr [rsp + 0xb8], rbx
000072f7b0bdb4e1 488d6c2420 lea rbp, [rsp + 0x20]
000072f7b0bdb4e6 33c0 xor eax, eax
000072f7b0bdb4e8 488945e0 mov qword ptr [rbp - 0x20], rax
000072f7b0bdb4ec 488b9d80000000 mov rbx, qword ptr [rbp + 0x80]
000072f7b0bdb4f3 4c8b7d78 mov r15, qword ptr [rbp + 0x78]
000072f7b0bdb4f7 4885db test rbx, rbx
000072f7b0bdb4fa 0f84e7000000 je 0x72f7b0bdb5e7
000072f7b0bdb500 498b3f mov rdi, qword ptr [r15]
000072f7b0bdb503 40383f cmp byte ptr [rdi], dil
000072f7b0bdb506 488d55e0 lea rdx, [rbp - 0x20]
000072f7b0bdb50a 488bf3 mov rsi, rbx
000072f7b0bdb50d ff153d8f61fe call qword ptr [rip - 0x19e70c3]
000072f7b0bdb513 85c0 test eax, eax
000072f7b0bdb515 742b je 0x72f7b0bdb542
000072f7b0bdb517 498b7f08 mov rdi, qword ptr [r15 + 0x8]
000072f7b0bdb51b ff4714 inc dword ptr [rdi + 0x14]
000072f7b0bdb51e 488b5708 mov rdx, qword ptr [rdi + 0x8]
000072f7b0bdb522 8b7710 mov esi, dword ptr [rdi + 0x10]
000072f7b0bdb525 397208 cmp dword ptr [rdx + 0x8], esi
000072f7b0bdb528 0f8693000000 jbe 0x72f7b0bdb5c1
000072f7b0bdb52e 8d4601 lea eax, [rsi + 0x1]
000072f7b0bdb531 894710 mov dword ptr [rdi + 0x10], eax
000072f7b0bdb534 4863f6 movsxd rsi, esi
000072f7b0bdb537 488bfa mov rdi, rdx
000072f7b0bdb53a 488bd3 mov rdx, rbx
000072f7b0bdb53d e8fe8c15fe call 0x72f7aed34240 (System.Runtime.CompilerServices.CastHelpers.StelemRef(System.Object[], IntPtr, System.Object), mdToken: 0000000006006843)
000072f7b0bdb542 4c8bf3 mov r14, rbx
000072f7b0bdb545 48be10a8fcaff7720000 movabs rsi, 0x72f7affca810
000072f7b0bdb54f 493936 cmp qword ptr [r14], rsi
000072f7b0bdb552 7556 jne 0x72f7b0bdb5aa
000072f7b0bdb554 4533f6 xor r14d, r14d
000072f7b0bdb557 4d85f6 test r14, r14
000072f7b0bdb55a 0f8595000000 jne 0x72f7b0bdb5f5
000072f7b0bdb560 48bf10a8fcaff7720000 movabs rdi, 0x72f7affca810
000072f7b0bdb56a 48393b cmp qword ptr [rbx], rdi
000072f7b0bdb56d 7560 jne 0x72f7b0bdb5cf
000072f7b0bdb56f 48837b7000 cmp qword ptr [rbx + 0x70], 0x0
000072f7b0bdb574 740b je 0x72f7b0bdb581
000072f7b0bdb576 4c8b7370 mov r14, qword ptr [rbx + 0x70]
000072f7b0bdb57a bbffffffff mov ebx, 0xffffffff
000072f7b0bdb57f eb1d jmp 0x72f7b0bdb59e
000072f7b0bdb581 488bfb mov rdi, rbx
000072f7b0bdb584 ff15864237ff call qword ptr [rip - 0xc8bd7a]
000072f7b0bdb58a ebea jmp 0x72f7b0bdb576
000072f7b0bdb58c 3bdf cmp ebx, edi
000072f7b0bdb58e 7351 jae 0x72f7b0bdb5e1
000072f7b0bdb590 498b7cde10 mov rdi, qword ptr [r14 + 8*rbx + 0x10]
000072f7b0bdb595 498bf7 mov rsi, r15
000072f7b0bdb598 ff15b2b5e3ff call qword ptr [rip - 0x1c4a4e]
000072f7b0bdb59e ffc3 inc ebx
000072f7b0bdb5a0 418b7e08 mov edi, dword ptr [r14 + 0x8]
000072f7b0bdb5a4 3bfb cmp edi, ebx
000072f7b0bdb5a6 7fe4 jg 0x72f7b0bdb58c
000072f7b0bdb5a8 eb3d jmp 0x72f7b0bdb5e7
000072f7b0bdb5aa 488bf3 mov rsi, rbx
000072f7b0bdb5ad 48bff064cfaff7720000 movabs rdi, 0x72f7afcf64f0
000072f7b0bdb5b7 e8e47513fe call 0x72f7aed12ba0 (System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object), mdToken: 0000000006006838)
000072f7b0bdb5bc 4c8bf0 mov r14, rax
000072f7b0bdb5bf eb96 jmp 0x72f7b0bdb557
000072f7b0bdb5c1 488bf3 mov rsi, rbx
000072f7b0bdb5c4 ff158ead4dfe call qword ptr [rip - 0x1b25272]
000072f7b0bdb5ca e973ffffff jmp 0x72f7b0bdb542
000072f7b0bdb5cf 488bfb mov rdi, rbx
000072f7b0bdb5d2 488b03 mov rax, qword ptr [rbx]
000072f7b0bdb5d5 488b4048 mov rax, qword ptr [rax + 0x48]
000072f7b0bdb5d9 ff5038 call qword ptr [rax + 0x38]
000072f7b0bdb5dc 4c8bf0 mov r14, rax
000072f7b0bdb5df eb99 jmp 0x72f7b0bdb57a
000072f7b0bdb5e1 e8ca6eee7b call 0x72f82cac24b0 (JitHelp: CORINFO_HELP_RNGCHKFAIL)
000072f7b0bdb5e6 cc int3
000072f7b0bdb5e7 4881c4b8000000 add rsp, 0xb8
000072f7b0bdb5ee 5b pop rbx
000072f7b0bdb5ef 415e pop r14
000072f7b0bdb5f1 415f pop r15
000072f7b0bdb5f3 5d pop rbp
000072f7b0bdb5f4 c3 ret
000072f7b0bdb5f5 498b7e28 mov rdi, qword ptr [r14 + 0x28]
000072f7b0bdb5f9 498bf7 mov rsi, r15
000072f7b0bdb5fc ff154eb5e3ff call qword ptr [rip - 0x1c4ab2]
000072f7b0bdb602 e959ffffff jmp 0x72f7b0bdb560 I'm not sure what the Both frame 7 and frame 8 have the same frame pointer: (lldb) fr sel 7
frame #7: 0x0000000000000000
error: core file does not contain 0x0
(lldb) register read
General Purpose Registers:
rax = 0x000072f82d6376a0
rbx = 0x000072b799d213b8
rcx = 0x0000000000000602
rdx = 0x000072f82d638160
rdi = 0x00007fff40434170
rsi = 0x000072f7b0bda404
rbp = 0x00007fff40434e70
rsp = 0x00007fff40434dc8
r8 = 0x00007fff40434138
r9 = 0x000072f7ac145008
r10 = 0x00007fff40434170
r11 = 0x0000000000000000
r12 = 0x000072b796044770
r13 = 0x000072b7962a1390
r14 = 0x000072f7afcf4fe0
r15 = 0x000072b799659e40
rip = 0x0000000000000000
...
(lldb) fr sel 8
frame #8: 0x000072f7b0bda404
-> 0x72f7b0bda404: mov rdi, qword ptr [rbp - 0x30]
0x72f7b0bda408: xor esi, esi
0x72f7b0bda40a: call qword ptr [rip - 0xf03158]
0x72f7b0bda410: test eax, eax
(lldb) register read
General Purpose Registers:
rbx = 0x000072b799d213b8
rbp = 0x00007fff40434e70
rsp = 0x00007fff40434dd0
r12 = 0x000072b796044770
r13 = 0x000072b7962a1390
r14 = 0x000072f7afcf4fe0
r15 = 0x000072b799659e40
rip = 0x000072f7b0bda404
16 registers were unavailable. This probably indicates that we're returning. Indeed, it appears we tried to return to 0:
However, it's quite odd that RBP appears to be the same as frame 8. That means we should be about to return to the tier0 version of a method we transitioned out of. That shouldn't be possible.
|
Another pointer: |
It looks like all others threads are suspended for GC:
and everyone is waiting for this transitioning thread to suspend. Perhaps some edge case around EE suspension/hijacking during OSR transitions? |
Yeah I have similar looking crashes where we end up trying to execute at address zero. Haven't gotten very far trying to debug. As far as I know nothing has changed in the OSR prolog generation and/or the patchpoint helper, so I think the problem lies elsewhere, but am not sure. |
We've definitely called Sadly the |
@janvorli this is similar to the crash dump I was asking you about. Also seems to be correlated with either unusual patchpoint placement (via I have never been able to repro this locally. I was going to try running with both OSR stress and GC stress to see if perhaps that would lead to a more consistent repro. |
Actually it looks like the signal handler is called on a separate stack, so I am not sure why the * thread #1, name = 'corerun', stop reason = signal SIGSEGV
frame #0: sp=0x000072f82d64afa0 fp=0x000072f82d64d240 pc=0x000072f82d077613 libc.so.6`___lldb_unnamed_symbol3325 + 115
frame #1: sp=0x000072f82d64d100 fp=0x000072f82d64d240 pc=0x000072f82d0606ca libc.so.6`fprintf + 154
frame #2: sp=0x000072f82d64d1e0 fp=0x000072f82d64d240 pc=0x000072f82ceb4787 libcoreclr.so`PROCCreateCrashDump(argv=size=16, errorMessageBuffer=0x0000000000000000, cbErrorMessageBuffer=0, serialize=<unavailable>) + 1111 at process.cpp:2318 [opt]
frame #3: sp=0x000072f82d64d250 fp=0x000072f82d64d2d0 pc=0x000072f82ceb5a8e libcoreclr.so`::PROCCreateCrashDumpIfEnabled(signal=11, siginfo=0x000072f82d64d4b0, serialize=true) + 2942 at process.cpp:2524 [opt]
frame #4: sp=0x000072f82d64d2e0 fp=0x000072f82d64d320 pc=0x000072f82ce533a9 libcoreclr.so`invoke_previous_action(action=0x000072f82cfa6e98, code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380, signalRestarts=true) + 377 at signal.cpp:394 [opt]
frame #5: sp=0x000072f82d64d330 fp=0x000072f82d64d370 pc=0x000072f82ce525f5 libcoreclr.so`sigsegv_handler(code=11, siginfo=0x000072f82d64d4b0, context=0x000072f82d64d380) + 341 at signal.cpp:630 [opt]
frame #6: sp=0x000072f82d64d380 fp=0x00007fff40434e70 pc=0x000072f82d042520 libc.so.6`___lldb_unnamed_symbol3237 + 1
* frame #7: sp=0x00007fff40434dc8 fp=0x00007fff40434e70 pc=0x0000000000000000
frame #8: sp=0x00007fff40434dd0 fp=0x00007fff40434e70 pc=0x000072f7b0bda404
...
(lldb) p (CONTEXT*)$r10
(CONTEXT *) $60 = 0x00007fff40434170
(lldb) p (CONTEXT*)($rbp-0xA8-8-0xC50)
(CONTEXT *) $61 = 0x00007fff40434170
(lldb) p *(CONTEXT*)$r10
(CONTEXT) $62 = {
Rax = 0
Rcx = 71776119061217280
Rdx = 0
Rbx = 4294967295
Rsp = 4294967295
Rbp = 4294967295
Rsi = 4294967295
Rdi = 0
R8 = 0
R9 = 34
R10 = 4294967290
R11 = 4294967299993
R12 = 0
R13 = 0
R14 = 0
R15 = 0
Rip = 0 Maybe we should try adding some debug-only code to get some more context the next time we get a dump... have you only seen this issue on linux-x64 @AndyAyersMS? |
One thing I wonder is if there should be a volatile read on runtime/src/coreclr/vm/jithelpers.cpp Lines 5160 to 5163 in 5fe1a56
to PCODE osrMethodCode = ppInfo->m_osrMethodCode;
if (ppInfo->m_osrMethodCode == NULL)
{
...
}
// use osrMethodCode that can still be null This is more of a theoretical problem though, I do not see this codegen in my dump. |
Yeah, I believe it is linux only (though perhaps also on arm64) so changes in clang behavior might be a possibility. |
Can you link some of the other reports/dumps? |
This is the one I was looking at: https://dev.azure.com/dnceng-public/public/_build/results?buildId=650270&view=ms.vss-test-web.build-test-results-tab |
This is likely the same problem but hasn't happened recently: #98292 |
The other dump similarly has all threads but the transitioning one waiting for the EE to be suspended. |
Hopefully will help with diagnosing dotnet#101060 once we get a new dump.
This other dump has a 0 entry for the OptimizedTier1 code which seems surprising:
The
At least
|
FYI this is the code versioning issue fix: #94542. Roughly corresponds to when I recall this OSR issue starting to pop up. |
There's a new dump over in https://dev.azure.com/dnceng-public/public/_build/results?buildId=656632&view=ms.vss-test-web.build-test-results-tab. The crashing registers look like:
The stored
Everything is reflected properly except In the dump these are the first instructions of that OSR function that is being resumed at:
Notice that in particular we can see that the last instruction there has no been executed yet since |
Tagging subscribers to this area: @mangod9 |
@jakobbotsch which dump did you look at? I cannot make any progress with the LibraryImportGenerator.Unit.Tests crashdumps... |
I was looking at the |
I wonder if the xml test stack overflow is related? It is also using OSR stress: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-3756f90697b746dfba/System.Runtime.Serialization.Xml.ReflectionOnly.Tests/1/console.f8de9632.log?helixlogtype=result |
…otnet#101537) Hopefully will help with diagnosing dotnet#101060 once we get a new dump.
…otnet#101537) Hopefully will help with diagnosing dotnet#101060 once we get a new dump.
…otnet#101537) Hopefully will help with diagnosing dotnet#101060 once we get a new dump.
Failed in runtime-jit-experimental on Linux x64.
Error message:
cc @dotnet/jit-contrib.
The text was updated successfully, but these errors were encountered: