Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC crash in Kestrel Libuv and Sockets FunctionalTests/netcoreapp2.1 #10215

Closed
jkotas opened this issue Apr 21, 2018 · 6 comments
Closed

GC crash in Kestrel Libuv and Sockets FunctionalTests/netcoreapp2.1 #10215

jkotas opened this issue Apr 21, 2018 · 6 comments
Assignees
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Apr 21, 2018

From @mikeharder on April 18, 2018 18:42

Kestrel FunctionalTests are hanging frequently on CI:

KestrelHttpServer

[17:46:13] :	 [Step 4/7]   Starting test group: Libuv.FunctionalTests/netcoreapp2.1
[17:46:13] :	 [Step 4/7]   Microsoft (R) Test Execution Command Line Tool Version 15.7.0-preview-20180221-13
[17:46:13] :	 [Step 4/7]   Copyright (c) Microsoft Corporation.  All rights reserved.
[17:46:13] :	 [Step 4/7]   
[17:46:14] :	 [Step 4/7]   Starting test execution, please wait...
[17:46:14] :	 [Step 4/7]   Logging Vstest Diagnostics in file: C:\b\w\1e8cd6334ef22651\artifacts\logs\Libuv.FunctionalTests-netcoreapp2.1-17674.diag
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:00.7203468]     RegisterAddresses_IPv6_Success [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6_Success
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:01.2412585]     ConnectionCountingReturnsToZero [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  ConnectionCountingReturnsToZero
[17:46:15] :	 [Step 4/7]   [xUnit.net 00:00:01.3837697]     CanListenToOpenTcpSocketHandle [SKIP]
[17:46:15] :	 [Step 4/7]   Skipped  CanListenToOpenTcpSocketHandle
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0159475]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0164664]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.0178708]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1") [SKIP]
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1")
[17:46:16] :	 [Step 4/7]   [xUnit.net 00:00:02.2575826]     ConnectionClosedEvenIfAppSwallowsException [SKIP]
[17:46:16] :	 [Step 4/7]   Skipped  ConnectionClosedEvenIfAppSwallowsException
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6803851]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6808460]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   [xUnit.net 00:00:03.6812519]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"]) [SKIP]
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"])
[17:46:18] :	 [Step 4/7]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[18:39:33]E:	 [Step 4/7] The build Projects::KestrelHttpServer dotnet/runtime#10196 {buildId=449083} has been running for more than 60 minutes. Terminating...

http://aspnetci/viewLog.html?buildId=449083&buildTypeId=Lite_KestrelHttpServer&tab=buildLog&_focus=39538

[14:35:15] :	 [Step 4/7]   Starting test group: Libuv.FunctionalTests/netcoreapp2.1
[14:35:15] :	 [Step 4/7] Microsoft (R) Test Execution Command Line Tool Version 15.7.0-preview-20180221-13
[14:35:15] :	 [Step 4/7] Copyright (c) Microsoft Corporation.  All rights reserved.
[14:35:15] :	 [Step 4/7] 
[14:35:15] :	 [Step 4/7] Starting test execution, please wait...
[14:35:15] :	 [Step 4/7] Logging Vstest Diagnostics in file: D:\b\w\1e8cd6334ef22651\artifacts\logs\Libuv.FunctionalTests-netcoreapp2.1-17658.diag
[14:35:17] :	 [Step 4/7] [xUnit.net 00:00:00.9797507]     CanListenToOpenTcpSocketHandle [SKIP]
[14:35:17] :	 [Step 4/7] Skipped  CanListenToOpenTcpSocketHandle
[14:35:17] :	 [Step 4/7] [xUnit.net 00:00:01.1545963]     ConnectionClosedEvenIfAppSwallowsException [SKIP]
[14:35:17] :	 [Step 4/7] Skipped  ConnectionClosedEvenIfAppSwallowsException
[14:35:18] :	 [Step 4/7] [xUnit.net 00:00:01.9128535]     ConnectionCountingReturnsToZero [SKIP]
[14:35:18] :	 [Step 4/7] Skipped  ConnectionCountingReturnsToZero
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0442866]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0448370]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://*", testUrls: ["http://[::1]/"])
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0461888]     RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"]) [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0852291]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0856344]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] [xUnit.net 00:00:03.0858760]     RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1") [SKIP]
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://127.0.0.1", testUrl: "http://127.0.0.1")
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://*", testUrl: "http://127.0.0.1")
[14:35:19] :	 [Step 4/7] Skipped  RegisterAddresses_IPv4Port80_Success(addressInput: "http://localhost", testUrl: "http://127.0.0.1")
[14:35:20] :	 [Step 4/7] [xUnit.net 00:00:03.6085756]     RegisterAddresses_IPv6_Success [SKIP]
[14:35:20] :	 [Step 4/7] Skipped  RegisterAddresses_IPv6_Success
[15:28:52]E:	 [Step 4/7] The build Projects::KestrelHttpServer dotnet/coreclr#17658 {buildId=445522} has been running for more than 60 minutes. Terminating...
[15:28:52]E:	 [Step 4/7] Execution timeout

http://aspnetci/viewLog.html?buildId=445522&buildTypeId=Lite_KestrelHttpServer&tab=buildLog

Win2012-Kestrel

[23:37:09] :     [Step 8/9]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://localhost", testUrls: ["http://127.0.0.1/", "http://[::1]/"])
[23:37:09] :     [Step 8/9]   Skipped  RegisterAddresses_IPv6Port80_Success(addressInput: "http://[::1]", testUrls: ["http://[::1]/"])
[00:50:45]E:     [Step 8/9] The build Test::Win2012-Kestrel dotnet/coreclr#2.2.0-preview1-34007 {buildId=449994} has been running for more than 120 minutes. Terminating...
[00:50:45]E:     [Step 8/9] Execution timeout

http://aspnetci/viewLog.html?buildId=449994&tab=buildLog&buildTypeId=Lite_KestrelStressTest&logTab=tail

Copied from original issue: aspnet/KestrelHttpServer#2503

@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

From @mikeharder on April 18, 2018 23:47

PR to enable diagnostic messages, which should tell us which specific test is hanging: aspnet/KestrelHttpServer#2505

@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

From @mikeharder on April 19, 2018 19:25

I was able to get a local repro and capture dumps, and the root cause appears to be in the CLR (specifically the GC). Blocked waiting for assistance from the CLR.

@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

From @mikeharder on April 19, 2018 21:44

We believe the root cause is managed heap corruption. The ASP.NET team will attempt to find the root cause, which could be in either ASP.NET or CLR code.

A separate issue is that once the managed heap is corrupted, fatal error reporting can trigger a deadlock in the GC: https://github.com/dotnet/coreclr/issues/17679.

Repros

Kestrel.Transport.Libuv.FunctionalTests\bin\Release\netcoreapp2.1

http://aspnetci/viewLog.html?buildId=450255&buildTypeId=Lite_KestrelStressTest&tab=buildLog

Kestrel.Transport.Sockets.FunctionalTests\bin\Release\netcoreapp2.1

http://aspnetci/viewLog.html?buildId=451473&buildTypeId=Lite_KestrelStressTest&tab=buildLog
http://aspnetci/viewLog.html?buildId=450918&buildTypeId=Lite_KestrelStressTest&tab=buildLog

@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

From @mikeharder on April 19, 2018 21:49

I have captured two dumps from hung tests (both in Libuv.FunctionalTests), and both were in the MaxRequestBufferSizeTests.LargeUpload test case. I haven't yet captured a dump from hung tests in Sockets.FunctionalTests.

@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

From @mikeharder on April 19, 2018 23:28

According to CI logs, the first instance of this hang was 4/3/18 2:40 PM:

http://aspnetci/viewLog.html?buildId=439512&tab=buildResultsDiv&buildTypeId=Lite_KestrelStressTest

Earlier this day, the NETCoreApp dependency was updated from NETCoreApp 2.1.0-preview2-26330-03 to 2.1.0-preview3-26331-01:

aspnet/Universe@f828c9d#diff-30dcc9789f8ea3c7632e4f054e60c072

It's likely the hang is related to the dependency update, either as a bug in the dependency itself, or a bug in ASP.NET that is exposed by the new dependency.

@jkotas jkotas changed the title Flaky Test: Hang in Libuv and Sockets FunctionalTests/netcoreapp2.1 GC crash Kestrel Libuv and Sockets FunctionalTests/netcoreapp2.1 Apr 21, 2018
@jkotas jkotas changed the title GC crash Kestrel Libuv and Sockets FunctionalTests/netcoreapp2.1 GC crash in Kestrel Libuv and Sockets FunctionalTests/netcoreapp2.1 Apr 21, 2018
@jkotas jkotas assigned jkotas and unassigned mikeharder Apr 21, 2018
@jkotas
Copy link
Member Author

jkotas commented Apr 21, 2018

It is a GC bug caused by C++ compiler optimizations. The following code in find_first_object compiles into:

            if ((brick_entry =  brick_table [ prev_brick ]) >= 0)
            {
                break;
            }
00007ffe`f89ea625 480fbf0c42      movsx   rcx,word ptr [rdx+rax*2]
00007ffe`f89ea62a 6644391442      cmp     word ptr [rdx+rax*2],r10w
00007ffe`f89ea62f 7d08            jge     coreclr!SVR::gc_heap::find_first_object+0x71 (00007ffe`f89ea639)

Notice that the code reads the brick_table [ prev_brick ] value twice: Once for the comparison and second time to get the value to be value later. This is a valid optimization in C++ that has bitten us many times.

So what happens is that we read one value of the brick table, some other thread updates the brick table in the meantime, and the comparison uses the new value; but the rest of the method still uses the old value.

jkotas referenced this issue in jkotas/coreclr Apr 21, 2018
jkotas referenced this issue in jkotas/coreclr Apr 21, 2018
jkotas referenced this issue in dotnet/coreclr Apr 21, 2018
@jkotas jkotas reopened this Apr 21, 2018
jkotas referenced this issue in dotnet/coreclr Apr 21, 2018
@jkotas jkotas closed this as completed Apr 21, 2018
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants