Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Flaky Test: Hang in Libuv and Sockets FunctionalTests/netcoreapp2.1 #2503

Closed
mikeharder opened this issue Apr 18, 2018 · 10 comments
Closed

Flaky Test: Hang in Libuv and Sockets FunctionalTests/netcoreapp2.1 #2503

mikeharder opened this issue Apr 18, 2018 · 10 comments
Assignees
Milestone

Comments

@mikeharder
Copy link
Contributor

mikeharder commented Apr 18, 2018

Kestrel FunctionalTests are hanging frequently on CI:

KestrelHttpServer

[17:46:13] :	 [Step 4/7]   Starting test group: Libuv.FunctionalTests/netcoreapp2.1
[17:46:13] :	 [Step 4/7]   Microsoft (R) Test Execution Command Line Tool Version 15.7.0-preview-20180221-13
[17:46:13] :	 [Step 4/7]   Copyright (c) Microsoft Corporation.  All rights reserved.
[17:46:13] :	 [Step 4/7]   
[17:46:14] :	 [Step 4/7]   Starting test execution, please wait...
[17:46:14] :	 [Step 4/7]   Logging Vstest Diagnostics in file: C:\b\w\1e8cd6334ef22651\artifacts\logs\Libuv.FunctionalTests-netcoreapp2.1-17674.diag
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:00.7203468]     RegisterAddresses_IPv6_Success [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6_Success
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:01.2412585]     ConnectionCountingReturnsToZero [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  ConnectionCountingReturnsToZero
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:01.3837697]     CanListenToOpenTcpSocketHandle [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  CanListenToOpenTcpSocketHandle
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0159475]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0164664]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0178708]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.2575826]     ConnectionClosedEvenIfAppSwallowsException [SKIP]
[17:46:16] :	 [Step 4/7]   Skipped  ConnectionClosedEvenIfAppSwallowsException
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6803851]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6808460]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6812519]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"])
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[18:39:33]E:	 [Step 4/7] The build Projects::KestrelHttpServer #17674 {buildId=449083} has been running for more than 60 minutes. Terminating...

http://aspnetci/viewLog.html?buildId=449083&buildTypeId=Lite_KestrelHttpServer&tab=buildLog&_focus=39538

[14:35:15] :	 [Step 4/7]   Starting test group: Libuv.FunctionalTests/netcoreapp2.1
[14:35:15] :	 [Step 4/7] Microsoft (R) Test Execution Command Line Tool Version 15.7.0-preview-20180221-13
[14:35:15] :	 [Step 4/7] Copyright (c) Microsoft Corporation.  All rights reserved.
[14:35:15] :	 [Step 4/7] 
[14:35:15] :	 [Step 4/7] Starting test execution, please wait...
[14:35:15] :	 [Step 4/7] Logging Vstest Diagnostics in file: D:\b\w\1e8cd6334ef22651\artifacts\logs\Libuv.FunctionalTests-netcoreapp2.1-17658.diag
[14:35:17] :	 [Step 4/7] [xUnit.net 00:00:00.9797507]     CanListenToOpenTcpSocketHandle [SKIP]
[14:35:17] :	 [Step 4/7] Skipped  CanListenToOpenTcpSocketHandle
[14:35:17] :	 [Step 4/7] [xUnit.net 00:00:01.1545963]     ConnectionClosedEvenIfAppSwallowsException [SKIP]
[14:35:17] :	 [Step 4/7] Skipped  ConnectionClosedEvenIfAppSwallowsException
[14:35:18] :	 [Step 4/7] [xUnit.net 00:00:01.9128535]     ConnectionCountingReturnsToZero [SKIP]
[14:35:18] :	 [Step 4/7] Skipped  ConnectionCountingReturnsToZero
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0442866]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0448370]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"])
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0461888]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0852291]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0856344]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0858760]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1")
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1")
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1")
[14:35:20] :	 [Step 4/7] [xUnit.net 00:00:03.6085756]     RegisterAddresses_IPv6_Success [SKIP]
[14:35:20] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6_Success
[15:28:52]E:	 [Step 4/7] The build Projects::KestrelHttpServer #17658 {buildId=445522} has been running for more than 60 minutes. Terminating...
[15:28:52]E:	 [Step 4/7] Execution timeout

http://aspnetci/viewLog.html?buildId=445522&buildTypeId=Lite_KestrelHttpServer&tab=buildLog

Win2012-Kestrel

[23:37:09] :     [Step 8/9]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[23:37:09] :     [Step 8/9]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[00:50:45]E:     [Step 8/9] The build Test::Win2012-Kestrel #2.2.0-preview1-34007 {buildId=449994} has been running for more than 120 minutes. Terminating...
[00:50:45]E:     [Step 8/9] Execution timeout

http://aspnetci/viewLog.html?buildId=449994&tab=buildLog&buildTypeId=Lite_KestrelStressTest&logTab=tail

@mikeharder mikeharder added this to the 2.2.0-mq milestone Apr 18, 2018
@mikeharder mikeharder self-assigned this Apr 18, 2018
@mikeharder
Copy link
Contributor Author

PR to enable diagnostic messages, which should tell us which specific test is hanging: #2505

mikeharder added a commit that referenced this issue Apr 19, 2018
- Required to log long-running tests
- Will help investigate #2503
@mikeharder mikeharder changed the title Flaky Test: Hang during FunctionalTests Flaky Test: Hang in Libuv.FunctionalTests/netcoreapp2.1 Apr 19, 2018
@mikeharder mikeharder added the blocked Blocked label Apr 19, 2018
@mikeharder
Copy link
Contributor Author

I was able to get a local repro and capture dumps, and the root cause appears to be in the CLR (specifically the GC). Blocked waiting for assistance from the CLR.

@mikeharder
Copy link
Contributor Author

We believe the root cause is managed heap corruption. The ASP.NET team will attempt to find the root cause, which could be in either ASP.NET or CLR code.

A separate issue is that once the managed heap is corrupted, fatal error reporting can trigger a deadlock in the GC: https://github.com/dotnet/coreclr/issues/17679.

Repros

Kestrel.Transport.Libuv.FunctionalTests\bin\Release\netcoreapp2.1

http://aspnetci/viewLog.html?buildId=450255&buildTypeId=Lite_KestrelStressTest&tab=buildLog

Kestrel.Transport.Sockets.FunctionalTests\bin\Release\netcoreapp2.1

http://aspnetci/viewLog.html?buildId=451473&buildTypeId=Lite_KestrelStressTest&tab=buildLog
http://aspnetci/viewLog.html?buildId=450918&buildTypeId=Lite_KestrelStressTest&tab=buildLog

@mikeharder mikeharder changed the title Flaky Test: Hang in Libuv.FunctionalTests/netcoreapp2.1 Flaky Test: Hang in Libuv and Sockets FunctionalTests/netcoreapp2.1 Apr 19, 2018
@mikeharder
Copy link
Contributor Author

I have captured two dumps from hung tests (both in Libuv.FunctionalTests), and both were in the MaxRequestBufferSizeTests.LargeUpload test case. I haven't yet captured a dump from hung tests in Sockets.FunctionalTests.

@mikeharder mikeharder modified the milestones: 2.2.0-mq, 2.1.0-rc1 Apr 19, 2018
@mikeharder
Copy link
Contributor Author

According to CI logs, the first instance of this hang was 4/3/18 2:40 PM:

http://aspnetci/viewLog.html?buildId=439512&tab=buildResultsDiv&buildTypeId=Lite_KestrelStressTest

Earlier this day, the NETCoreApp dependency was updated from NETCoreApp 2.1.0-preview2-26330-03 to 2.1.0-preview3-26331-01:

aspnet/Universe@f828c9d#diff-30dcc9789f8ea3c7632e4f054e60c072

It's likely the hang is related to the dependency update, either as a bug in the dependency itself, or a bug in ASP.NET that is exposed by the new dependency.

@jkotas
Copy link

jkotas commented Apr 21, 2018

This issue was moved to dotnet/coreclr#17716

@mikeharder
Copy link
Contributor Author

Keeping this issue open until the fix is verified in a prodcon build by ASP.NET.

@sywhang
Copy link

sywhang commented Apr 23, 2018

@mikeharder Is there any way to run the CI run with private builds of coreclr?
The coreclr side issue was fixed with dotnet/coreclr#17710, but is not merged into 2.1 yet because we need to verify the fix addresses this CI hang issue. It would be helpful to get the fix approved for 2.1 if you can verify this fix addresses this issue using the latest master branch of coreclr.

@mikeharder
Copy link
Contributor Author

@sywhang: I verified the Kestrel test hang is fixed when using a private build of the GC with the fix (added you to mail thread). Do you need any further verification from ASP.NET?

@mikeharder
Copy link
Contributor Author

Verified the issue repros in build 2.1.300-preview3-008646 and is fixed in build 2.1.300-rc1-008662.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants