Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MicrosoftNETCoreCoreDisToolsVersion to 1.4.0 #96291

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

BruceForstall
Copy link
Member

No description provided.

@ghost ghost assigned BruceForstall Dec 23, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 23, 2023
@ghost
Copy link

ghost commented Dec 23, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: BruceForstall
Assignees: BruceForstall
Labels:

area-CodeGen-coreclr

Milestone: -

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr superpmi-diffs, runtime-coreclr r2r

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr superpmi-diffs, runtime-coreclr r2r

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@BruceForstall
Copy link
Member Author

@janvorli Can you help me figure out how to debug a problem with this change?

This PR updates coredistools.dll/so/dylib to a newly built version. This PR includes #96286, which has various related changes but doesn't bump the coredistools package version. That PR passes GCStress.

This PR fails on linux x64 GCStress (maybe others), apparently during tests which are run out of process, so are spawned. Swapping in an older coredistools succeeds.

For example (in my local WSL build/run):

export DOTNET_TieredCompilation=0
export DOTNET_GCStress=4
...
brucefo@Megatron:~/gh/runtime/artifacts/tests/coreclr/linux.x64.Checked/Regressions/Regressions$ export CORE_ROOT=/home/brucefo/gh/runtime/artifacts/tests/coreclr/linux.x64.Checked/Tests/Core_Root
brucefo@Megatron:~/gh/runtime/artifacts/tests/coreclr/linux.x64.Checked/Regressions/Regressions$ ./Regressions.sh

BEGIN EXECUTION
/home/brucefo/gh/runtime/artifacts/tests/coreclr/linux.x64.Checked/Tests/Core_Root/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false -p System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true Regressions.dll ''
13:02:44.759 Running test: Regressions/coreclr/0041/expl_double_1/expl_double_1.cmd
Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at System.Diagnostics.Process.ForkAndExecProcess(System.Diagnostics.ProcessStartInfo, System.String, System.String[], System.String[], System.String, Boolean, UInt32, UInt32, UInt32[], Int32 ByRef, Int32 ByRef, Int32 ByRef, Boolean, Boolean)
   at System.Diagnostics.Process.StartCore(System.Diagnostics.ProcessStartInfo)
   at CoreclrTestLib.CoreclrTestWrapperLib.RunTest(System.String, System.String, System.String, System.String, System.String, System.String)
   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(System.String)
   at Program.<<Main>$>g__TestExecutor1|0_2(System.IO.StreamWriter, System.IO.StreamWriter, <>c__DisplayClass0_0 ByRef)
   at Program.<Main>$(System.String[])
./Regressions.sh: line 444: 36140 Aborted                 $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
Expected: 100
Actual: 134
END EXECUTION - FAILED

The question is: how do I track down what is causing the 0x80004005 error? It seems I need to use lldb to get SOS functionality, and I'm not sure how to work with LLDB with multiple spawned processes as well as GCStress (so lots of "illegal" instructions). It seems that process handle --pass true --stop false --notify false SIGILL helps with GCStress, but I'm not sure how and where to stop when the actual interesting exception occurs.

Any suggestions?

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

@BruceForstall I've just found that starting with LLDB 14, there is a setting to debug child processes - at fork, the debugger switches to the spawned process and continues debugging it. You can use the following LLDB command settings set target.process.follow-fork-mode child to enable that.

However, when debugging stuff in child process, I usually try to figure out the command line to run the child standalone so that I don't have to do anything special.

There is also an env variable BuildAsStandalone that you can set to true before building the tests. That causes the tests to be built the "old way" without multiple tests being merged". So, each test gets its own .sh. I find it useful in cases like this.

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

@BruceForstall I have tried to build your branch and build the tests with env var BuildAsStandalone=true. The baseservices/exceptions/unhandled/unhandledTester/unhandledTester.sh reproduces the issue in the main process, so you don't need to fiddle with child process debugging. Btw, with the GC stress, you'll also need to treat the SIGSEGV the same way as SIGILL, both are coming from the instrumentation.

@BruceForstall
Copy link
Member Author

@janvorli Thanks! I'll go build with BuildAsStandalone=true.

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

I've also noticed that the issue occured on the first SIGILL, there were SIGSEGVs before that and they were processed ok. So just using process handle --pass true --stop false --notify false SIGSEGV will get the debugger break at the problematic spot.

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

It seems that the problem is actually that we only handle the STATUS_PRIVILEGED_INSTRUCTION as the GC stress instrumentation instruction exceptions. But the SIGILL generates STATUS_ILLEGAL_INSTRUCTION, which we don't process in any way, which explains the behavior.
The code looks like this:

    0x7fff792dcc2c: lock
    0x7fff792dcc2d: hlt
    0x7fff792dcc2e: movb   $0x11, %cl
    0x7fff792dcc30: hlt

My guess is that the "lock" is not expected there and may somehow stem from incorrect disassembly of the previous instruction when we instrument it.

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

Hmm, I guess this makes it clear:

(lldb) clru 0x7fff792dcc2c
Error: Failed to find runtime directory
Normal JIT generated code
System.Net.Sockets.SocketAsyncEventArgs.SetBuffer(System.Memory`1<Byte>)
ilAddr is 00007FFFF40DE8A8 pImport is 000000000140F570
Begin 00007FFF792DCC00, size 15e
00007fff792dcc00 55                   push    rbp
00007fff792dcc01 53                   push    rbx
00007fff792dcc02 4883ec28             sub     rsp, 0x28
00007fff792dcc06 488d6c2430           lea     rbp, [rsp + 0x30]
00007fff792dcc0b 488965d0             mov     qword ptr [rbp - 0x30], rsp
00007fff792dcc0f 48897de0             mov     qword ptr [rbp - 0x20], rdi
00007fff792dcc13 488975e8             mov     qword ptr [rbp - 0x18], rsi
00007fff792dcc17 488955f0             mov     qword ptr [rbp - 0x10], rdx
00007fff792dcc1b 40383f               cmp     byte ptr [rdi], dil
00007fff792dcc1e 488d8fac000000       lea     rcx, [rdi + 0xac]
00007fff792dcc25 baffffffff           mov     edx, 0xffffffff
00007fff792dcc2a 33c0                 xor     eax, eax
>>> 00007fff792dcc2c f0                   lock
00007fff`792dcc2d 0fb111               cmpxchg dword ptr [rcx], edx (gcstress)

Seems like we have skipped the lock and instrumented just the cmpxchg that the lock is part of.

@BruceForstall
Copy link
Member Author

Ah, interesting. The change here is that I've rebuilt coredistools with a current LLVM. That's the disassembler we're using for instrumenting during GCStress. Perhaps it has a bug disassembling "lock", or perhaps the VM needs to handle "lock" differently with the new disassembler?

@janvorli
Copy link
Member

janvorli commented Jan 2, 2024

One or the other may be true. Depends on whether the change in the coredistool / LLVM was intentional or not.

@BruceForstall BruceForstall force-pushed the UpdateCoredistoolsVersion branch from 5d807c0 to 235df7e Compare January 3, 2024 04:34
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr superpmi-diffs, runtime-coreclr r2r

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@BruceForstall
Copy link
Member Author

GCStress failures are #94393, #96364

@BruceForstall
Copy link
Member Author

@janvorli Thanks again for your help investigating.

There was code in coredistools to specially handle x86 prefixes, to work around an LLVM bug that didn't treat the prefixes as part of the prefixed instruction. I removed this code because it was written 7-8 years ago, and various information on the internet made it sound like the LLVM bug had been fixed. Apparently it has not been fixed, at least for lock. It is surprising that the problem only manifested on Linux and not Windows -- maybe we don't use lock on Windows?

In any case, I restored the removed code and everything looks good now.

@BruceForstall BruceForstall changed the title Update MicrosoftNETCoreCoreDisToolsVersion to 1.3.0 Update MicrosoftNETCoreCoreDisToolsVersion to 1.4.0 Jan 3, 2024
@BruceForstall BruceForstall force-pushed the UpdateCoredistoolsVersion branch from 235df7e to 5628601 Compare January 3, 2024 17:34
@BruceForstall
Copy link
Member Author

@jakobbotsch @kunalspathak @dotnet/jit-contrib PTAL

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BruceForstall BruceForstall merged commit 9459844 into dotnet:main Jan 3, 2024
193 of 197 checks passed
@BruceForstall BruceForstall deleted the UpdateCoredistoolsVersion branch January 3, 2024 20:31
@github-actions github-actions bot locked and limited conversation to collaborators Feb 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants