Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ELT profiler tests #44285

Merged
merged 4 commits into from
Nov 6, 2020
Merged

Fix ELT profiler tests #44285

merged 4 commits into from
Nov 6, 2020

Conversation

davmason
Copy link
Member

@davmason davmason commented Nov 5, 2020

Fixes #44055

Recently profiler/elt/slowpatheltenter/slowpatheltenter started failing. On investigation it was because a timer firing was triggering managed code to run as the runtime was shutting down, and we would occasionally try to use pCorProfilerInfo after it was set to NULL.

This PR add some synchronization so we bail on the ELT hooks once Shutdown is called. It also changes the output when a test fails, the previous wording was confusing because it implied PROFILER_TEST_PASSES in the output was the only thing necessary for a test to pass when we check the exit code of the process as well.

@davmason davmason added this to the 6.0.0 milestone Nov 5, 2020
@davmason davmason requested a review from a team November 5, 2020 10:22
@davmason davmason self-assigned this Nov 5, 2020
@ghost
Copy link

ghost commented Nov 5, 2020

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

Recently profiler/elt/slowpatheltenter/slowpatheltenter started failing

What changed? You mention a timer firing; it wasn't before?

I'm asking because we saw some other diagnostic failures start frequently around the same time, e.g. #44037, and I'm wondering if there might be a link.

Copy link
Contributor

@sdmaclea sdmaclea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

LogTestFailure($"Profilee returned exit code {process.ExitCode} instead of expected exit code 100.");
}

return 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you said in an earlier PR that the exit code didn't matter for profiler tests. It still would be nice to have a comment to that effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly what I said, but I think I was trying to say that the exit code doesn't matter for that particular test. We do check the exit code, it's just that a lot of profiler tests don't use it to indicate failure. Some of them do, however, and crashes will return a non 100 exit code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I just wanted to make it clear to someone other than you that return 100; can happen even though the test failed. I just feels confusing to me. If we hadn't talked about it, I would have asked again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added a comment

Copy link
Member

@noahfalk noahfalk Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that logging a failure message and then returning 100 feels counter intuitive. If the exit code matters I was expecting there would be some conditional logic or if it doesn't matter then perhaps we should return 0?

EDIT: nm, I see that LogTestFailure() also throws an exception so 100 is only what happens in the success case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood what Steve was saying. As Noah points out in the edit above LogTestFailure intentionally throws an unhandled exception to fail the test. I can see how that is unclear to others though, since it just says log in the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely unclear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed it from LogTestFailure to FailFastWithMessage so hopefully its more clear just from the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@davmason
Copy link
Member Author

davmason commented Nov 5, 2020

Recently profiler/elt/slowpatheltenter/slowpatheltenter started failing

What changed? You mention a timer firing; it wasn't before?

I shouldn't have worded it so strongly. I was inferring something changed due to it only recently starting to fail, but have no proof. These tests register enter/leave hooks through the profiler APIs, and recently they started failing where after the profiler shut down and cleaned up resources some background thread would still be calling managed code and the test would AV trying to use pCorProfilerInfo, which is now null.

When I debugged, it consistently hit in a managed timer callback but I never actually verified what code path was hitting it. For the purposes of this test the exact code that causes the failure is irrelevant, it has always been legal and possible for other threads to continue running during shutdown. This is one of those "I'm surprised this worked until now" moments.

I will take a look to see if there is an obvious change causing this, but I would be pretty surprised if this was related to the other failing tests you linked.

@davmason
Copy link
Member Author

davmason commented Nov 5, 2020

It fails before it gets to anything that actually tells me what timer is firing:

(lldb) clrstack
OS Thread Id: 0x1abfd9 (10)
        Child SP               IP Call Site
00007FFFF40624E0 00007ffff4684bfa [HelperMethodFrame: 00007ffff40624e0]
00007FFFF4062800 00007FFF7DFDD6A9 System.Threading.PortableThreadPool+GateThread.EnsureRunning(System.Threading.PortableThreadPool)
00007FFFF4062830 00007FFF7DFD57D9 System.Threading.PortableThreadPool.RequestWorker()
00007FFFF4062860 00007FFF7DFD121E System.Threading.ThreadPool.RequestWorkerThread()
00007FFFF40628F0 00007FFF7DFD06D6 System.Threading.ThreadPoolWorkQueue.EnqueueTimeSensitiveWorkItem(System.Threading.IThreadPoolWorkItem)
00007FFFF4062920 00007FFF7DFD05F5 System.Threading.ThreadPool.UnsafeQueueTimeSensitiveWorkItemInternal(System.Threading.IThreadPoolWorkItem)
00007FFFF4062950 00007FFF7DFBC850 System.Threading.ThreadPool.UnsafeQueueUnmanagedWorkItem(IntPtr, IntPtr)

But, although it mainly fails in the timer path it also randomly fails in other codepaths too. I don't know what changed but it's very unlikely that this is anything but a test bug.

@davmason
Copy link
Member Author

davmason commented Nov 6, 2020

It occurs to me that the reason we're seeing managed code executing at times when it wasn't before is the portable thread pool work that @kouvel enabled recently. Earlier I said I didn't know what was causing the timer, but I think the new thread pool implementation itself is causing it.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kouvel
Copy link
Member

kouvel commented Nov 6, 2020

It occurs to me that the reason we're seeing managed code executing at times when it wasn't before is the portable thread pool work that @kouvel enabled recently. Earlier I said I didn't know what was causing the timer, but I think the new thread pool implementation itself is causing it.

Tiered compilation's background work currently uses the thread pool, a timer is used to get a thread pool thread and some new managed code is triggered.

@davmason davmason merged commit ba52589 into dotnet:master Nov 6, 2020
tqiu8 pushed a commit to tqiu8/runtime that referenced this pull request Nov 9, 2020
author Stephen Toub <stoub@microsoft.com> 1604601164 -0500
committer Tammy Qiu <tammy.qiu@yahoo.com> 1604960878 -0500

Add stream conformance tests for TranscodingStream (dotnet#44248)

* Add stream conformance tests for TranscodingStream

* Special-case 0-length input buffers to TranscodingStream.Write{Async}

The base implementation of Encoder.Convert doesn't like empty inputs.  Regardless, if the input is empty, we can avoid a whole bunch of unnecessary work.

JIT: minor inliner refactoring (dotnet#44215)

Extract out the budget check logic so it can vary by inlining policy.
Use this to exempt the FullPolicy from budget checking.

Fix inline xml to dump the proper (full name) hash for inlinees.

Update range dumper to dump ranges in hex.

Remove unused QCall for WinRTSupported (dotnet#44278)

ConcurrentQueueSegment allows spinning threads to sleep. (dotnet#44265)

* Allow threads to sleep when ConcurrentQueue has many enqueuers/dequeuers.

* Update src/libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs

Co-authored-by: Stephen Toub <stoub@microsoft.com>

* Apply suggestions from code review

Co-authored-by: Stephen Toub <stoub@microsoft.com>

Co-authored-by: AMD DAYTONA EPYC <amd@amd-DAYTONA-X0.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>

File.Exists() is not null when true (dotnet#44310)

* File.Exists() is not null when true

* Fix compile

* Fix compile 2

[master][watchOS] Add simwatch64 support (dotnet#44303)

Xcode 12.2 removed 32 bits support for watchOS simulators, this PR helps to fix xamarin/xamarin-macios#9949, we have tested the new binaries and they are working as expected

![unknown](https://user-images.githubusercontent.com/204671/98253709-64413200-1f49-11eb-9774-8c5aa416fc57.png)

Co-authored-by: dalexsoto <dalexsoto@users.noreply.github.com>

Implementing support to Debugger::Break. (dotnet#44305)

Set fgOptimizedFinally flag correctly (dotnet#44268)

- Initialize to 0 at compiler startup
- Set flag when finally cloning optimization kicks in

Fixes non-deterministic generation of nop opcodes into ARM32 code

Forbid `- byref cnst` -> `+ (byref -cnst)` transformation. (dotnet#44266)

* Add a repro test.

* Forbid the transformation for byrefs.

* Update src/coreclr/src/jit/morph.cpp

Co-authored-by: Andy Ayers <andya@microsoft.com>

* Update src/coreclr/src/jit/morph.cpp

* Fix the test return value.

WriteLine is just to make sure we don't delete the value.

* improve the test.

avoid a possible overflow and don't waste time on printing.

Co-authored-by: Andy Ayers <andya@microsoft.com>

Pick libmonosgen-2.0.so from cmake install directory instead of .libs (dotnet#44291)

This aligns Linux with what we already do for all the other platforms.

Update SharedPerformanceCounter assert (dotnet#44333)

Remove silly ToString in GetCLRInstanceString (dotnet#44335)

Use targetPlatformMoniker for net5.0 and newer tfms (dotnet#43965)

* Use targetPlatformMoniker for net5.0 and newer tfms

* disabling analyzer, update version to 0.0, and use new format.

* update the targetFramework.sdk

* removing supportedOS assembly level attribute

* fix linker errors and addressing feedback

* making _TargetFrameworkWithoutPlatform as private

[sgen] Add Ward annotations to sgen_get_total_allocated_bytes (dotnet#43833)

Attempt to fix https://jenkins.mono-project.com/job/test-mono-mainline-staticanalysis/

Co-authored-by: lambdageek <lambdageek@users.noreply.github.com>

[tests] Re-enable tests fixed by dotnet#44081 (dotnet#44212)

Fixes
mono/mono#15030 and
fixes mono/mono#15031 and
fixes mono/mono#15032

Add an implicit argument coercion check. (dotnet#43386)

* Add `impCheckImplicitArgumentCoercion`.

* Fix tests with type mismatch.

* Try to fix VM signature.

* Allow to pass byref as native int.

* another fix.

* Fix another IL test.

[mono] Change CMakelists.txt "python" -> Python3_EXECUTABLE (dotnet#44340)

Debian doesn't install a "python" binary for python3.

Tweak StreamConformanceTests for cancellation (dotnet#44342)

- Avoid unnecessary timers
- Separate tests for precancellation, ReadAsync(byte[], ...) cancellation, and ReadAsync(Memory, ...) cancellation

Use Dictionary for underlying cache of ResourceSet (dotnet#44104)

Simplify catch-rethrow logic in NetworkStream (dotnet#44246)

A follow-up on dotnet#40772 (comment), simplifies and harmonizes the way we wrap exceptions into IOException. Having one catch block working with System.Exception seems to be enough here, no need for specific handling of SocketException.

Simple GT_NEG optimization for dotnet#13837 (dotnet#43921)

* Simple arithmetic optimization with GT_NEG

* Skip GT_NEG optimization when an operand is constant. Revert bitwise rotation pattern

* Fixed Value Numbering assert

* Cleaned up code and comments for simple GT_NEG optimization

* Formatting

Co-authored-by: Julie Lee <jeonlee@microsoft.com>

[master] Update dependencies from mono/linker (dotnet#44322)

* Update dependencies from https://github.com/mono/linker build 20201105.1

Microsoft.NET.ILLink.Tasks
 From Version 6.0.0-alpha.1.20527.2 -> To Version 6.0.0-alpha.1.20555.1

* Update dependencies from https://github.com/mono/linker build 20201105.2

Microsoft.NET.ILLink.Tasks
 From Version 6.0.0-alpha.1.20527.2 -> To Version 6.0.0-alpha.1.20555.2

* Disable new optimization for libraries mode (it cannot work in this mode)

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: Marek Safar <marek.safar@gmail.com>

Tighten argument validation in StreamConformanceTests (dotnet#44326)

Add threshold on number of files / partition in SPMI collection (dotnet#44180)

* Add check for files count

* Fix the OS check

* decrese file limit to 1500:

* misc fix

* Do not upload to azure if mch files are zero size

Fix ELT profiler tests (dotnet#44285)

[master] Update dependencies from dotnet/arcade dotnet/llvm-project dotnet/icu (dotnet#44336)

[master] Update dependencies from dotnet/arcade dotnet/llvm-project dotnet/icu

 - Merge branch 'master' into darc-master-2211df94-2a02-4c3c-abe1-e3534e896267

Fix Send_TimeoutResponseContent_Throws (dotnet#44356)

If the client times out too quickly, the server may never have a connection to accept and will hang forever.

Match CoreCLR behaviour on thread start failure (dotnet#44124)

Co-authored-by: Aleksey Kliger (λgeek) <akliger@gmail.com>

Add slash in Windows SoD tool build (dotnet#44359)

* Add slash in Windows SoD tool build

* Update SoD search path to match output dir

* Fixup dotnet version

* Remove merge commit headers

* Disable PRs

Co-authored-by: Drew Scoggins <andrew.g.scoggins@gmail>

Reflect test path changes in .gitattributes; remove nonexistent files (dotnet#44371)

Bootstrapping a test for R2RDump (dotnet#42150)

Improve performance of Enum's generic IsDefined / GetName / GetNames (dotnet#44355)

Eliminates the boxing in IsDefined/GetName/GetValues, and in GetNames avoids having to go through RuntimeType's GetEnumNames override.

clarify http version test (dotnet#44379)

Co-authored-by: Geoffrey Kizer <geoffrek@windows.microsoft.com>

Update dependencies from https://github.com/mono/linker build 20201106.1 (dotnet#44367)

Microsoft.NET.ILLink.Tasks
 From Version 6.0.0-alpha.1.20555.2 -> To Version 6.0.0-alpha.1.20556.1

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>

Disable RunThreadLocalTest8_Values on Mono (dotnet#44357)

* Disable RunThreadLocalTest8_Values on Mono

It's failing on SLES

* fix typo

LongProcessNamesAreSupported: make test work on distros where sleep is a symlink/script (dotnet#44299)

* LongProcessNamesAreSupported: make test work on distros where sleep is a symlink/script

* PR feedback

Co-authored-by: Stephen Toub <stoub@microsoft.com>

* fix compilation

Co-authored-by: Stephen Toub <stoub@microsoft.com>

add missing constructor overloads (dotnet#44380)

Co-authored-by: Geoffrey Kizer <geoffrek@windows.microsoft.com>

change using in ConnectCallback_UseUnixDomainSocket_Success (dotnet#44366)

Clean up the samples (dotnet#44293)

Update dotnet/roslyn issue link

Delete stale comment about dotnet/roslyn#30797

Fix/remove TODO-NULLABLEs (dotnet#44300)

* Fix/remove TODO-NULLABLEs

* remove redundant !

* apply Jozkee's feedback

* address feedback

Update glossary (dotnet#44274)

Co-authored-by: Juan Hoyos <juan.hoyos@microsoft.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>
Co-authored-by: Günther Foidl <gue@korporal.at>

Add files need for wasm executable relinking/aot to the wasm runtime pack. (dotnet#43785)

Co-authored-by: Alexander Köplinger <alex.koeplinger@outlook.com>

Move some more UnmanagedCallersOnly tests to IL now that they're invalid C# (dotnet#43366)

Fix C++ build for mono/metadata/threads.c (dotnet#44413)

`throw` is a reserved keyword in C++.

Disable a failing test. (dotnet#44404)

Change async void System.Text.Json test to be async Task (dotnet#44418)

Improve crossgen2 comparison jobs (dotnet#44119)

- Fix compilation on unix platforms
  - Wrap use of wildcard in quotes
- Print better display name into log
- Fix X86 constant comparison handling
- Add ability to compile specific overload via single method switches

Remove some unnecessary GetTypeInfo usage (dotnet#44414)

Fix MarshalTypedArrayByte and re-enable it. Re-enable TestFunctionApply
@ghost ghost locked as resolved and limited conversation to collaborators Dec 6, 2020
@davmason davmason deleted the osx_failure branch January 20, 2021 08:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: profiler/elt/slowpatheltenter/slowpatheltenter.sh
5 participants