Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][eventpipe] Port over Dotnet/Diagnostics to enable eventpipe tests on Android #64358

Merged
merged 27 commits into from
Mar 2, 2022

Conversation

mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Jan 26, 2022

In effort to enable diagnostics tests for mobile (specifically Android) in #60879, it was realized that some changes to Microsoft.Diagnostics.NETCore.Client are needed in order for the eventpipe runtime tests to work (bigevent, buffersize, eventsourceerror). However, some of the changes (including a need to update the runtime version used in Dotnet/Diagnostics to detect Android as the platform) are expected to lead to a significant amount of changes for everything to build properly. As such, we look to port over the barebone necessities from Dotnet/Diagnostics in order to get the eventpipe runtime tests working after #60879.

This PR has the following changes/commits:

  1. Ports over the complete set of files from Dotnet/Diagnostics/src/Microsoft.Diagnostics.NETCore.Client/

  2. Modify some files for TCPIP support and wrap with a runtime defined preprocessor directive so that the changes would not surface in the upstream (Dotnet/Diagnostics/src/Microsoft.Diagnostics.NETCore.Client), disable CS1591, CS8073, and CS0162 warnings in the previously mentioned preprocessor directive, Avoids CS0108 by wrapping with NET6_0 preprocessor directive,

  3. Removes Microsoft.Diagnostics.NETCore.Client package dependency from tests as the local files are an exact copy and avoids package version conflict.

  4. Ports over a couple of IpcTraceTest changes from Dotnet/Diagnostics/src/eventpipe/common/IpcTraceTest.cs, including DiagnosticsClient adaptation needed for diagnostics to work on Android.

  5. Hooks in TCP/IP support into the eventpipe tests through IpcTraceTest by leveraging the DiagnosticsClient internals made accessible to src/tests/tracing/eventpipe/common/common.csproj through AssemblyInfo.cs

  6. Disables eventpipe tests requiring subprocesses on Android as there is no current functionality to run multiple apps at the same time.

  7. Updates the 5 eventpipe tests (bigevent, buffersize, eventsourceerror, processinfo, processinfo2) that now run on Android, and updates remaining tests (rundownvalidation, providervalidation, gcdump, debuginfo/tester) to use local Microsoft.Diagnostics.NETCore.Client source files.

  8. Makes sure collect trace 2 command parsing logic won't end up in reading its provider name and config filter strings from an unaligned address

Testing

The diagnostics tests pass locally on Android x64, Windows x64, MacCatalyst x64.

@ghost ghost assigned mdh1418 Jan 26, 2022
@mdh1418 mdh1418 changed the title WIP diagnostics port [tests][eventpipe] Port over minimal code from Dotnet/Diagnostics to enable eventpipe tests on Android Jan 28, 2022
@mdh1418 mdh1418 marked this pull request as ready for review January 28, 2022 00:05
src/tests/tracing/eventpipe/common/IpcUtils.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/EventPipeProvider.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/IpcHeader.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/IpcUnixDomain.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/IpcTransport.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/IpcSocket.cs Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/IpcUnixDomain.cs Outdated Show resolved Hide resolved
@josalem
Copy link
Contributor

josalem commented Jan 28, 2022

If we are going to maintain a copy of Microsoft.Diagnostics.NetCore.Client in the runtime repository for testing, I would prefer if it were wholesale or not at all. Having code differences or drift over time will make things harder to maintain eventually.

I'd suggest we create a new directory (perhaps src/tests/tracing/eventpipe/common/Microsoft.Diagnostics.NetCore.Client) and copy the entirety of the library from the diagnostics repo. From there we can edit tests as needed to depend on the local copy rather than the external one. This minimizes changes to existing tests while allowing you to light up mobile tests one-by-one. I'd also make sure to leave sufficient documentation of what the last commit and date were when the copy took place, so that maintainers don't need to guess when code is from.

I would also suggest that any changes specific to testing in the runtime repo that are not going to be merged back into the diagnostics repo be put in their own library that depends on the local copy of M.D.NetCore.Client. If you need access to internals of the libraries, you can use the InternalsVisibleTo property to expose them to arbitrary projects.

CC @dotnet/dotnet-diag

@mikem8361
Copy link
Member

mikem8361 commented Jan 28, 2022 via email

@josalem
Copy link
Contributor

josalem commented Jan 28, 2022

Why not just reference the package we publish on the dotnet-tools feed? It is a little old @hoyosjs could probably tell us why.

There are things we do in the tests that require us to inspect parts of the IPC protocol etc. that the library doesn't expose. We also don't want to get into a cyclic dependency where we need to update the library in dotnet/diagnostics before we can test a feature that is being added in dotnet/runtime. This leaves us room to use the internals of the library, e.g., inspecting IPC message payloads and headers, using InternalsVisibleTo, while building tests around functionality the library doesn't support yet.

@hoyosjs
Copy link
Member

hoyosjs commented Jan 28, 2022

The decision to keep a local version for agility was from the perspective of runtime devs. I'd say we should really get the tests to use the library now that it's matured enough. Diagnostics builds nightly at the very least, and there's no major dependencies here where I'd foresee a blocked insertion.

@josalem
Copy link
Contributor

josalem commented Jan 28, 2022

I'd say we should really get the tests to use the library now that it's matured enough

Certain tests can, but some tests are doing things at a lower level than what's exposed by the library.

If this is solely to light up BigEvent, EventProvider, EventSourceError, BufferSize, and similar on mobile, then we should just use the external package for those tests. Other tests should remain unchanged to make sure they're validating at the lower level.

@lateralusX
Copy link
Member

lateralusX commented Jan 28, 2022

As previously discussed in meetings, we will need to do TCP/IP extensions to the library in order to run test on mobile platforms, we would need to extend constructors accepting TCP/IP based URI's, so instead of extending the diagnostic client in diagnostic repro, it was decided to cut the circular dependency between runtime repro and diagnostic repro and extend a local version of diagnostic client in dotnet runtime repro. Current tests are also depending on nuget reference that is an old diagnostic client, where we only have the nuget packages available but no source code.

@josalem, @hoyosjs before we spend more time on this we will need to make up our minds on how to proceed (since we have been pointed in different directions). If we believe we still want to extend the diagnostic client in diagnostic repro with TCP/IP support and use that when testing, we will need to update the tests to use a new diagnostic repro builds. That means we will need support to configure TCP/IP as an alternative diagnostic protocol and since it will be accessible through a public constructor it might leak into the public (maybe we could make it protected and use reflection in dotnet runtime as an alternative). We would also need to get changes merged into the diagnostic repro and then upgrade tests in dotnet runtime to use new diagnostic client including the extended TCP/IP support.

@noahfalk
Copy link
Member

The tests work locally because a modified manual build of dotnet/diagnostics was used. (#60879 (comment))

Is this modified version in a branch somewhere we could look at? I'm hoping to understand what changes were necessary to get the tests running. Based on the description at the top of this PR it sounded like the M.D.N.C code added here is a subset of the public API/functionality available in the official version. However if I understood that part correctly then I didn't understand why the original NuGet package would not also be workable.

@noahfalk
Copy link
Member

As previously discussed in meetings, we will need to do TCP/IP extensions to the library in order to run test on mobile platforms

Apologies to ask for the repeat since I wasn't at the meeting, was there a plan in place for how we would release this functionality to customers, or was TCP/IP purely an internal feature for testing that customers would never have access to?

@lateralusX
Copy link
Member

As previously discussed in meetings, we will need to do TCP/IP extensions to the library in order to run test on mobile platforms

Apologies to ask for the repeat since I wasn't at the meeting, was there a plan in place for how we would release this functionality to customers, or was TCP/IP purely an internal feature for testing that customers would never have access to?

This is currently for internal testing in dotnet runtime CI only, running EventPipe runtime tests on mobile (Android/iOS) that needs TCP/IP support in diagnostic client and should currently not be available to customers.

@noahfalk
Copy link
Member

should currently not be available to customers

Is the plan that it will eventually be available to them or they will use EventPipe on Android some other way?

@lateralusX
Copy link
Member

The tests work locally because a modified manual build of dotnet/diagnostics was used. (#60879 (comment))

Is this modified version in a branch somewhere we could look at? I'm hoping to understand what changes were necessary to get the tests running. Based on the description at the top of this PR it sounded like the M.D.N.C code added here is a subset of the public API/functionality available in the official version. However if I understood that part correctly then I didn't understand why the original NuGet package would not also be workable.

@mdh1418 can reference the changes he originally did to get diagnostic client in diagnostic repro working together with runtime tests. We need an extension to the library and current nuget package used by dotnet/runtime won't work, it is an old nuget, that we don't even have any source code hooked up to. Since we will need to enable TCP/IP support in upstream diagnostic client, merge that, upgrade dotnet/runtime to take dependency on new nuget package from diagnostic repro for tests to run on mobile in dotnet/runtime CI. @josalem suggested that we should cut the dependency between repository, put a copy of diagnostic client into dotnet/runtime repository, add the extensions needed into that copy and make dotnet/runtime CI independent of diagnostic repro in order to run the low level CoreCLR runtime tests on CI.

For us, we can go either way, but if we think we should keep dependency on diagnostic repro from dotnet/runtime repro we would need to include code to enable TCP/IP in the diagnostic client, get it merged, update EventPipe tests on dotnet/runtime to use new diagnostic client (instead of old nuget that we don't have any source code for anymore).

@lateralusX
Copy link
Member

lateralusX commented Jan 28, 2022

should currently not be available to customers

Is the plan that it will eventually be available to them or they will use EventPipe on Android some other way?

It depends if we think we should include TCP/IP support into all our diagnostic client tooling or not. In .net6 we only added support into dsrouter, reducing the need to do compliance work for all diagnostic client tools that continued to use named pipes or unix domain sockets only (via dsrouter in mobile scenarios). Reduced the amount of needed compliance work a lot.

@noahfalk
Copy link
Member

Thanks @lateralusX, now the pieces are clicking better.

I can see pros and cons of both the source copy or the binary reference so I'm fine with whichever you guys decide. I agree with @josalem above that if we are going to do the source copy route then we should aim to make resyncing the repos as easy as possible - ideally it would just be a direct file copy from one repo to the other. If you want to add tcp support to the upstream version of the library that seems fine to me. I think it could be done using an internal API which could either be invoked via reflection, InternalsVisibleTo, or the client source is directly compiled into the test assembly so that there is no assembly boundary to cross.

@mdh1418
Copy link
Member Author

mdh1418 commented Jan 28, 2022

The tests work locally because a modified manual build of dotnet/diagnostics was used. (#60879 (comment))

Is this modified version in a branch somewhere we could look at? I'm hoping to understand what changes were necessary to get the tests running. Based on the description at the top of this PR it sounded like the M.D.N.C code added here is a subset of the public API/functionality available in the official version. However if I understood that part correctly then I didn't understand why the original NuGet package would not also be workable.

dotnet/diagnostics#2833
I originally withheld pushing up a pull request before the decision was made to port over files to dotnet/runtime.
I apologize for making this pull request confusing by porting over files from the local branch (It completely slipped my mind that I was pulling files from my branch and not main) I'll rebase and make it so that the commits show the original files, then my diagnostics changes, and then footprint reduction later if we decide to follow through with porting over to runtime.

It doesn't fully work as there needs to be a way to identify Android as the running platform. I originally made it work locally by forcing the Android options when building the diagnostics repo locally, but I didnt manage to detect the Android platform as OperatingSystem used by diagnostics didnt recognize Android

@josalem
Copy link
Contributor

josalem commented Jan 28, 2022

Apologies for making the plan a little murky! My main point was mostly that we should make a full copy of the diagnostics library to make maintenance easier.

As for any other changes, I want to avoid the following workflow:

  1. Add new feature in runtime
  2. make change to support new feature in diagnostics
  3. do release of diagnostics repo
  4. update dependency version in runtime
  5. add test to runtime

I want to be able to still do the following:

  1. Add new feature + test for feature in runtime
  2. Update library in diagnostics to support new feature (hopefully the code for the test in step 1 is effectively the same as this)

@noahfalk
Copy link
Member

I apologize for making this pull request confusing

No worries and no apologies needed. Minor miscommunications happen all the time and are trivially worked out : )

@lateralusX
Copy link
Member

lateralusX commented Jan 31, 2022

@noahfalk @josalem @hoyosjs, so do we agree that we should continue down the path started by this PR, making a local copy of diagnostic client in dotnet/runtime repro and extend that, making it clear what is upstream and what is extension to the library in dotnet/runtime, making future sync of the code bases simple, to support mobile platforms in EventPipe runtime tests in dotnet/runtime?

@noahfalk
Copy link
Member

noahfalk commented Feb 1, 2022

do we agree that we should continue down the path started by this PR ... making future sync of the code bases simple

I agree, and @josalem is OOF but as best I understand he agreed too. I think both he and I were primarily worried about making the sync of source simple and we are happy as long as they stay more closely aligned.
@hoyosjs your last suggestion sounded like you were advocating to use a binary dependency. Not sure if you had any further chance to discuss with @josalem before he left, or if anything in the discussion changed your opinion?

@mdh1418
Copy link
Member Author

mdh1418 commented Feb 2, 2022

@noahfalk @hoyosjs @lateralusX Can I have another set of 👀 to look over the changes?
I modified the PR with the first commit being a direct copy of Dotnet/Diagnostics/Microsoft.Diagnostics.NETCore.Client at dotnet/diagnostics@99fab30
The second commit extends with changes from dotnet/diagnostics#2833 and I will also address some of the comments from that PR here in later commits.
The 3rd 4th and 5th commits are temporary to avoid the compiler warnings, and I would appreciate advice in how to address all of the public methods/variables and the best way to handle the hides inherited member error.

@mdh1418 mdh1418 changed the title [tests][eventpipe] Port over minimal code from Dotnet/Diagnostics to enable eventpipe tests on Android [tests][eventpipe] Port over Dotnet/Diagnostics to enable eventpipe tests on Android Feb 2, 2022
@mdh1418
Copy link
Member Author

mdh1418 commented Feb 22, 2022

@josalem @noahfalk @hoyosjs @lateralusX
It appears that segmenting the tests at a test file level is not possible for the runtime tests, as they will all be compiled on the target-agnostic managed test leg (and therefore ignore the preprocessor directives route for segmenting the tests) to be then downloaded by each test leg to run.

There is an option of having platform-variant source code by splitting the projects into two pairs (perhaps bigevent.csproj and bigevent_legacy.csproj ?) and then marking the project with CLRTestTargetUnsupported depending on the platform/architecture or disabling the project in issues.targets. Exemplified by https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/ELEMENT_TYPE_IU/i_array_merge_Target_32BIT_il_d.ilproj and https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Methodical/ELEMENT_TYPE_IU/i_array_merge_Target_64BIT_il_d.ilproj

I believe our options to push this PR to the finish line are:

  1. Almost duplicating the 7 test projects to run only on Linux arm CoreCLR (No dropped test coverage, but somewhat superfluous projects)
  2. Disabling the 7 test projects only on Linux arm CoreCLR (Quickest, but drops some test coverage)
  3. Making the tests work on Linux arm CoreCLR (Feels the most correct, but I am unable to do this and even if I had linux arm machine, it would take the longest)

To me, option 1 looks the best until someone is able to test on Linux arm.

@noahfalk
Copy link
Member

@hoyosjs - I think you have more expertise with what is possible in our runtime tests than I. I think (1) is a fine option unless you have a better suggestion?

Thanks @mdh1418 👍 I'd say give Juan 24 hours to respond, and if he is busy with other things and can't get to it then assume option 1 is good : )

@hoyosjs
Copy link
Member

hoyosjs commented Feb 23, 2022

Option 1 is ok here if we can't figure out 3. I am looking at the dumps to see what happened with 3.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 24, 2022

Finally managed to get enough parts for the dump:

The main testing thread is here:

(lldb) clrstack -f
OS Thread Id: 0x3df (1)
Child SP       IP Call Site
FFB72AF8 F28CAF04 libpthread.so.0!__libc_do_syscall + 4 at /build/glibc-lMNZxQ/glibc-2.27/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
FFB72B00 F28C6072 libpthread.so.0!__pthread_cond_wait + 342 at /build/glibc-lMNZxQ/glibc-2.27/sysdeps/unix/sysv/linux/futex-internal.h:88
FFB72B00 F28C605A libpthread.so.0!__pthread_cond_wait + 318 at /build/glibc-lMNZxQ/glibc-2.27/sysdeps/unix/sysv/linux/futex-internal.h:87
FFB72B00 F28C5FF8 libpthread.so.0!__pthread_cond_wait + 220 at /build/glibc-lMNZxQ/glibc-2.27/nptl/pthread_cond_wait.c:480
FFB72B80 F2547872 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 426 at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:478
FFB72BD8 F254704E libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 538 at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:301
FFB72C30 F254E6B2 libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 2266 at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:646
FFB72D20 F254EEAC libcoreclr.so!WaitForMultipleObjectsEx + 164 at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:201
FFB72D68 F21BF924 libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 604 at /__w/1/s/src/coreclr/vm/threads.cpp:0
FFB72D68 F21BF916 libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 590 at /__w/1/s/src/coreclr/vm/threads.cpp:3398
FFB72DE8 F21BA36A libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 202 at /__w/1/s/src/coreclr/vm/threads.cpp:3247
FFB72DE8 F21BA35A libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 186 at /__w/1/s/src/coreclr/vm/threads.cpp:3247
FFB72E58 F22FE0D4 libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 148 at /__w/1/s/src/coreclr/vm/synch.cpp:459
FFB72EC0 F21C0BD8 libcoreclr.so!Thread::Block(int, PendingSync*) + 88 at /__w/1/s/src/coreclr/vm/threads.cpp:4010
FFB72EC0 F21C0BBA libcoreclr.so!Thread::Block(int, PendingSync*) + 58 at /__w/1/s/src/coreclr/vm/threads.cpp:4004
FFB72ED8 F21B7284 libcoreclr.so!SyncBlock::Wait(int) + 628 at /__w/1/s/src/coreclr/vm/syncblk.cpp:0
FFB72F70 F2490AAC libcoreclr.so!ObjectNative::WaitTimeout(int, Object*) + 192 at /__w/1/s/src/coreclr/classlibnative/bcltype/objectnative.cpp:0
FFB72F70 F2490AA4 libcoreclr.so!ObjectNative::WaitTimeout(int, Object*) + 184 at /__w/1/s/src/coreclr/vm/object.h:212
FFB72FC4          [HelperMethodFrame_1OBJ: ffb72fc4] System.Private.CoreLib.dll!System.Threading.Monitor.ObjWait(Int32, System.Object)
FFB73060 EA8B2A68 System.Private.CoreLib.dll!System.Threading.Monitor.Wait(System.Object, Int32) + 40
FFB73080 EA8C60CE System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) + 622
FFB730E4 EA8E07FE
FFB730E8 EA8E07FE System.Private.CoreLib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) + 174
FFB7311C EA8E0612
FFB73120 EA8E0612 System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken) + 322
FFB73154 EA8E0184
FFB73158 EA8E0184 System.Private.CoreLib.dll!System.Threading.Tasks.Task.Wait(Int32, System.Threading.CancellationToken) + 52
FFB7316C EA8E36F6
FFB73170 EA8E36F6 System.Private.CoreLib.dll!System.Threading.Tasks.Task.WaitAnyCore(System.Threading.Tasks.Task[], Int32, System.Threading.CancellationToken) + 150
FFB7319C EA8E34EA
FFB731A0 EA8E34EA System.Private.CoreLib.dll!System.Threading.Tasks.Task.WaitAny(System.Threading.Tasks.Task[]) + 42
FFB731C0 E7D62CB6 common.dll!Tracing.Tests.Common.IpcTraceTest.Validate() + 1162 [/__w/1/s/src/tests/tracing/eventpipe/common/IpcTraceTest.cs @ 290]
FFB73270 E9B594EE common.dll!Tracing.Tests.Common.IpcTraceTest.RunAndValidateEventCounts(System.Collections.Generic.Dictionary`2<System.String,Tracing.Tests.Common.ExpectedEventCount>, System.Action, System.Collections.Generic.List`1<Microsoft.Diagnostics.NETCore.Client.EventPipeProvider>, Int32, System.Func`2<Microsoft.Diagnostics.Tracing.EventPipeEventSource,System.Func`1<Int32>>) + 130 [/__w/1/s/src/tests/tracing/eventpipe/common/IpcTraceTest.cs @ 401]
FFB732A0 EBFEB420 buffersize.dll!Tracing.Tests.BufferValidation.BufferValidation.Main(System.String[]) + 460 [/__w/1/s/src/tests/tracing/eventpipe/buffersize/buffersize.cs @ 45]

Which is just waiting for the reads to happen:

Task.WaitAny(readerTask, waitSentinelEventTask);

The faulting thread is in:

* thread #3, stop reason = signal SIGABRT
  * frame #0: 0xf28caf04 libpthread.so.0`__libc_do_syscall at libc-do-syscall.S:46
    frame #1: 0xf28c9caa libpthread.so.0`__waitpid at waitpid.c:30
    frame #2: 0xf28c9c94 libpthread.so.0`__waitpid(pid=1062, stat_loc=0xf15fe550, options=0)
    frame #3: 0xf2555b8c libcoreclr.so`PROCCreateCrashDump(argv=<unavailable>) at process.cpp:3260:22
    frame #4: 0xf25563de libcoreclr.so`::PROCCreateCrashDumpIfEnabled(signal=<unavailable>) at process.cpp:3425:9
    frame #5: 0xf25535ee libcoreclr.so`::PROCAbort(signal=6) at process.cpp:3451:5
    frame #6: 0xf25534ae libcoreclr.so`::RaiseFailFastException(pExceptionRecord=<unavailable>, pContextRecord=<unavailable>, dwFlags=<unavailable>) at process.cpp:1266:5
    frame #7: 0xf23d902a libcoreclr.so`::_DbgBreakCheck(LPCSTR, int, LPCSTR, BOOL) [inlined] TerminateOnAssert() at debug.cpp:189:5
    frame #8: 0xf23d901c libcoreclr.so`::_DbgBreakCheck(szFile=0xf2025ee0, iLine=43, szExpr=0xf2035f6b, fConstrained=<unavailable>) at debug.cpp:424
    frame #9: 0xf23d94c2 libcoreclr.so`_DbgBreakCheckNoThrow(szFile=0xf2025ee0, iLine=43, szExpr=0xf2035f6b, fConstrained=NO) at debug.cpp:534:18
    frame #10: 0xf23d978c libcoreclr.so`::DbgAssertDialog(szFile=0xf2025ee0, iLine=43, szExpr=<unavailable>) at debug.cpp:0
    frame #11: 0xf24e92e4 libcoreclr.so`EESocketCleanupHelper(bool) [inlined] GetThread() at threads.inl:43:5
    frame #12: 0xf24e92ca libcoreclr.so`EESocketCleanupHelper(isExecutingOnAltStack=<unavailable>) at ceemain.cpp:583
    frame #13: 0xf2501c38 libcoreclr.so`invoke_previous_action(action=0xf25d5fa0, code=7, siginfo=0xf15fe790, context=0xf15fe810, signalRestarts=<unavailable>) at signal.cpp:429:5
    frame #14: 0xf25010fc libcoreclr.so`sigbus_handler(code=7, siginfo=0xf15fe790, context=0xf15fe810) at signal.cpp:690:5
    frame #15: 0xf262c760 libc.so.6 at sigrestorer.S:77
    frame #16: 0xf2515ebc libcoreclr.so`UTF8Encoding::GetByteCount(this=0xf15feb44, chars=u"SentinelEventSource", count=<unavailable>) at utf8.cpp:2748:17
    frame #17: 0xf2515392 libcoreclr.so`::UnicodeToUTF8(lpSrcStr=u"SentinelEventSource", cchSrc=20, lpDestStr=0x00000000, cchDest=0) at utf8.cpp:2915:19
    frame #18: 0xf2515006 libcoreclr.so`::WideCharToMultiByte(CodePage=65001, dwFlags=<unavailable>, lpWideCharStr=u"SentinelEventSource", cchWideChar=20, lpMultiByteStr=0x00000000, cbMultiByte=0, lpDefaultChar=0x00000000, lpUsedDefaultChar=NO) at unicode.cpp:349:18
    frame #19: 0xf24e1b20 libcoreclr.so`ep_rt_utf16_to_utf8_string(str=0xf0c0483f, len=4294967295) at ep-rt-coreclr.h:2617:22
    frame #20: 0xf24e1890 libcoreclr.so`eventpipe_collect_tracing_command_try_parse_config(buffer=0xf15fed00, buffer_len=0xf15fecfc, result=<unavailable>) at ds-eventpipe-protocol.c:182:24
    frame #21: 0xf24dd502 libcoreclr.so`ds_eventpipe_protocol_helper_handle_ipc_message(_DiagnosticsIpcMessage*, _DiagnosticsIpcStream*) [inlined] eventpipe_collect_tracing2_command_try_parse_payload(buffer=<unavailable>, buffer_len=<unavailable>) at ds-eventpipe-protocol.c:292:4
    frame #22: 0xf24dd404 libcoreclr.so`ds_eventpipe_protocol_helper_handle_ipc_message(_DiagnosticsIpcMessage*, _DiagnosticsIpcStream*) [inlined] ds_ipc_message_try_parse_payload(message=0xf15fed68)(unsigned char*, unsigned short)) at ds-protocol.c:478
    frame #23: 0xf24dd3ec libcoreclr.so`ds_eventpipe_protocol_helper_handle_ipc_message(_DiagnosticsIpcMessage*, _DiagnosticsIpcStream*) [inlined] eventpipe_protocol_helper_collect_tracing_2(message=0xf15fed68, stream=0xf0c047d0) at ds-eventpipe-protocol.c:462
    frame #24: 0xf24dd3ec libcoreclr.so`ds_eventpipe_protocol_helper_handle_ipc_message(message=0xf15fed68, stream=0xf0c047d0) at ds-eventpipe-protocol.c:535
    frame #25: 0xf24dc74a libcoreclr.so`server_thread(data=<unavailable>) at ds-server.c:158:4
    frame #26: 0xf2559460 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0e65b8e0) at thread.cpp:1829:16
    frame #27: 0xf28c1614 libpthread.so.0`start_thread(arg=0x04805997) at pthread_create.c:463
    frame #28: 0xf269f7fc libc.so.6 at clone.S:73

We get a SIGBUS for reading at 0xf0c04845, which is part of the ep_rt_utf16_to_utf8_string requested by eventpipe_collect_tracing_command_try_parse_config; This shuts down the PAL, and we hit the assert because there's no managed thread. This can only happen if isExecutingOnAltStack==true, but g_enable_alternate_stack_check is false when we don't have mach exceptions, so we are always on AltStack mode. That means that for SigBus, we'll always AV further down in PAL shutdown in EESocketCleanupHelper. cc: @janvorli

I am not sure what caused the sigbus. [r1] and [r1]+4 are available in the memory I see that's the Sentinel provider name string.

    0xf2515eba <+498>: bhs    0xf2515f3c                ; <+628> at utf8.cpp:2748:24
->  0xf2515ebc <+500>: ldrd   r3, r2, [r1]

I can't get this to reproduce locally - it repros relatively consistenty in CI though. I've requested a CI machine to try to look if it reproduces more easily there. That might be a while, so I'd say split works fine for now. I also can't explain why it only happens in Checked (I'd expect such shutdown path to happen any time, so it makes little sense unless something in checked is what's causing the SIGBUS).

@mdh1418
Copy link
Member Author

mdh1418 commented Feb 24, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lateralusX
Copy link
Member

lateralusX commented Feb 24, 2022

@hoyosjs Could the signal be due to an alignment issue on that string (looks like it by just looking at the address in the callstack)?

ARM architectures handle unaligned reads, but it is possible to configure them to trigger faults on unaligned data access. I have run EventPipe and dotnet-trace on ARM64 hardware without issues, and we also have pass on CI on several ARM64 configurations, maybe we are doing something on this specific ARM configuration that disable unaligned data access that in turn will trigger the fault or the specific instruction selection in this case, LDRD?

Looking into the parsing rules of eventpipe_collect_tracing2_command_try_parse_payload I see that it ends up with an unaligned string buffer, eventpipe_collect_tracing_command_try_parse_rundown_requested will only read 1 byte, this happens only for the collect tracing 2 command, since collect tracing command didn't have the additional rundown request parameter, and I guess the old IPC library didn't use collect tracing 2 command, meaning CI never hit this code path in the past.

Looks like this has been an issue all along since we introduced the collect tracing 2 command, since the protocol fields are unaligned by design, but code would try to reuse unicode strings directly from buffer (without copy) but in the past it was probably never run on an ARM configuration that would trigger a SIGBUS on unaligned reads in the past.

I believe a fix to this issue would mean making temp alloc/memcpy of the string from the buffer and then convert it to utf8 and free the temp allocated string. I also believe we don't need to perform this for all our string parsing, just the once in collect tracing 2 parsing, or it will introduce additional temporary heap allocations for all our string parsing of IPC buffers.

@mdh1418 How can we leverage this PR to try out the fix? I guess we will need to get back so we have at least one test that reproduce the issue on ARM Linux, once that is done, I could push a fix towards this branch and we can see how it plays out.

@mdh1418
Copy link
Member Author

mdh1418 commented Feb 24, 2022

@lateralusX The projects are all duplicated, with one copy being disabled in issues.targets for Linux arm coreclr, and the other copy being disabled in issues.targets for not Linux arm coreclr. The only one not disabled in issues.targets is JIT/Directed/debugging/debuginfo/tester.cs because using a regex wildcard will also match the legacy shell script or whichever exact files are being excluded, so that is utilizing the CLRTestTargetUnsupported property in the test file itself.

I just removed one of the ExcludeList Items for bigevent on Linux arm CoreCLR if you'd want to try out the fix for the underlying issue.

Collect tracing 2 EventPipe command triggers an unaligned UTF16 string
read that could cause a SIGBUS on platforms not supporting unalinged
reads of UTF16 strings (so far only seen on 32-bit ARM Linux CI machine).

On CoreCLR this could even cause a unalinged int read due to
optimizations used in UTF8Encoding::GetByteCount.
@lateralusX
Copy link
Member

Pushed a fix that makes sure collect trace 2 command parsing logic won't end up in reading its provider name and config filter strings from an unaligned address. This happens since collect trace 2 payload added a bool (1 byte) in its payload and the parsing code would read the configuration like it did for regular collect trace command (reusing UTF16 string directly from payload). Due to the structure of the payload in collect trace 2 we will need to make a temporary copy of the payload strings into allocated memory (ending up in a correctly aligned buffer) and then do the conversion from UTF16 to UTF8 using a correctly aligned UTF16 buffer.

@lateralusX
Copy link
Member

lateralusX commented Feb 28, 2022

Looks like this is mainly triggered by ARM 32-bit optimization of:

int GetByteCount(WCHAR *chars, int count)

On 32-bit ARM, word size is 32-bit and it looks like its probably optimize this code:

ch = *(int*)pSrc;
into a LDRD instruction and that can't use an unaliged address and will trigger a SIGBUS. On ARM 64-bit word is 64-bit so the same code is not compiled into a LDRD instruction, so won't trigger a SIGBUS.

@lateralusX
Copy link
Member

lateralusX commented Feb 28, 2022

@mdh1418 Looks like fixing the unaligned read made CoreCLR Linux ARM test lane to pass on CI, https://github.com/dotnet/runtime/pull/64358/checks?check_run_id=5358747816.

So if that fix take care of your Linux ARM issues in this PR, then maybe we should remove the workaround to special handle ARM Linux using the old IPC client, cb2cacd

@mdh1418
Copy link
Member Author

mdh1418 commented Feb 28, 2022

Thanks @lateralusX !! I reverted the few commits that special cased duplicated the test projects for Linux arm CoreCLR. We'll see how CI goes 🤞

@mdh1418
Copy link
Member Author

mdh1418 commented Feb 28, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -306,6 +314,50 @@ ipc_message_flatten (
ep_exit_error_handler ();
}

static
bool
ipc_message_try_parse_string_utf16_t_byte_array (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lateralusX Is this mostly meant as a targeted fix? Feels odd we are only safekeeping this one case. I wonder if we can statically test for alignment on these reads or check dynamically if the platform supports it and perform copies when necessary. This also feels like a potential pain point for our portable builds (pretty much would for reads to always be aligned). The fix itself (copying) LGTM, just feels like we might be chasing bugs in this area for a bit.

Copy link
Member

@lateralusX lateralusX Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, fix was evaluating if it was the root cause of the ARM 32-bit issue seen on CoreCLR. Second I didn't want to switch to copy semantics in all cases, since it has been like that for a long time (inherited from C++ codebase) and its an optimization that can be used in most cases of current parsing logic (all except in the collect trace 2). Switching to copy semantics would also change the ownership rules of parsed UTF16 strings meaning we would need to do more adjustments in parsing logic artifacts that currently don't have ownership of the UTF16 strings. I added a runtime assert to make sure the version not doing a byte copy, ds_ipc_message_try_parse_string_utf16_t, will enforce that the passed in buffer is property aligned, in order to catch any future unaligned use. I also looked through all existing calls to ds_ipc_message_try_parse_string_utf16_t and they all currently parsed on aligned buffer address (except the fixed collect trace 2 parsing logic) The above function, ipc_message_try_parse_string_utf16_t_byte_array is also made static, so only called from within that source. Doing a copy if necessary logic would even complicate the ownership rules of these strings more.

Since the alignment is based on the parsing logic implemented in functions like eventpipe_collect_tracing2_command_try_parse_payload there is no static test, but as I said above, I added a runtime validation on debug builds, so as long as we have tests covering this (that we apparently didn't have until we updated the Diagnostic Client), we should hit assert on debug builds, if we add new payloads that could trigger the same issue in future.

The actual SIGBUS happening in this case is also rather interesting since you could argue that it is a compiler optimization bug. ARM supports unaligned reads for word and halfword instructions, but not for double word. The code does two word size reads inside UTF, that would have worked if optimizer didn't bundle them in to a LDRD instruction that will need address to be aligned, so optimization moved from a program that would have worked to one that triggers a SIGBUS.

So to sum up, current fix take care of the scenario currently hit with unaligned UTF16 string in IPC message payload. I added an assert to prevent future incorrect use of UTF16 string parsing function using unaligned addresses. Alternative would be to abandon idea to reuse UTF16 string from payload and always take copy, totally doable, but will add more changes to all current *CommandPayload structs, since they will need to switch to owning copy of UTF16 string and make sure they are freed as part of their fini/free implementations.

Copy link
Member

@hoyosjs hoyosjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the test exclusions and everything looks sane to me. Thanks for pushing through all the oddness here :)

@mdh1418
Copy link
Member Author

mdh1418 commented Mar 2, 2022

The failure on Libraries Test Run release coreclr windows x86 Release should be #65890
The failures on Build Browser wasm windows Release LibraryTests is #65978
The failures on Build Android x86 Release AllSubsets_Mono is #65958
The failures on Libraries Test Run release coreclr Linux arm Release is logged #66032
The failures on Libraries Test Run release coreclr windows arm64 Release are unnecessary runs, but the sheer amount of tests hits the timeout error, which leads to dead-lettering - https://github.com/dotnet/core-eng/issues/15700
The intended tests to run on Android (in the description 7.) are indeed passing with exit code 100

Thanks for all of your help! @hoyosjs @noahfalk @lateralusX

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants