SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

jaredpar · 2020-08-04T05:11:00Z

This issue comes from investigating test failures on this Roslyn PR dotnet/roslyn#46510

At the conclusion of running the unit tests for VBCSCompiler server the xUnit process will refuse to exit. The xUnit output will indicate that the tests have completed running but the process itself will not exit. Attaching the debugger to the xUnit process and there are two threads of note that are still running:

GC Finalizer

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnceCore(int sleep1Threshold) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SafeSocketHandle.CloseAsIs(bool abortive) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.Socket.Dispose(bool disposing) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.Socket.~Socket() (Unknown Source:0)
[Native to Managed Transition] (Unknown Source:0)

.NET Sockets

System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.EventLoop() (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine..ctor.AnonymousMethod__14_0(object s) (Unknown Source:0)
System.Private.CoreLib.dll!System.Threading.ThreadHelper.ThreadStart(object obj) (Unknown Source:0)
[Native to Managed Transition] (Unknown Source:0)

The VBCSCompiler server makes heavy use of named pipes. Looking through the Socket on the finalizer thread I can confirm it's a Unix Domain socket related to the named pipes the compiler is creating (the path in the end point matches the paths we create in the tests).

Unfortunately after a day of debugging I have not been able to narrow this problem down any further:

The problem only repros when running the entire test assembly. I ran each test class in the assembly and none of them individually reproduce the issue. Have to run them as a group.
Went through a cycle of causing the hang, identifying the test which created the socket that was hung in the finalizer, disable that test, re-run the assembly. Did this for about five different tests and it had no impact on the hang.
Thought this may be related to WaitForConnectionAsync does not respond to cancellation #40289 so I tried aggressively calling NamedPipeServerStream.Dispose on any instance that was hung in a WaitForConnectionAsync call.

None of these has had any impact though. I've also been unsuccessful in constructing a more concise repro. 😦

More than happy to provide any info to make tracking this down easier.

Repro Information:

Runtime: .NET 5 Preview 7
OS: Ubuntu 18.04, Ubuntu 18.04 via WSL2, OSX

The text was updated successfully, but these errors were encountered:

ghost · 2020-08-04T05:11:04Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

stephentoub · 2020-08-04T11:02:12Z

cc: @tmds

Smaller repro for dotnet/runtime#40301

jaredpar · 2020-08-06T17:34:04Z

Did my best to narrow this down to a smaller problem. Created a branch where you can see the delta between our unit tests passing and hanging. This commit shows that it's simply enabling one new test to run on Linux that causes this hang.

To repro this do the following:

> git clone https://github.com/jaredpar/roslyn -o jaredpar
> git checkout -B repro jaredpar/repro/pipe-hang
> cd src/Compilers/Server/VBCSCompilerServerTests
> dotnet build
> dotnet msbuild /t:Test

Couple notes:

About 1/2 of the time this will cause the tests to fail, the other half it will hang in the finalizer
Running the new test alone is not sufficient to demonstrate the hang, have to run the entire suite (ensure why).

A brief description of what this particular test is doing:

Opening four NamedPipeServerStream instances on the same pipe name
Creates one NamedPipeClientStream instance that fully connects
Dispose three of the NamedPipeServerStream instances, does not dispose the connected one
Disposes the NamedPipeClientStream
That means after the test completes one of the NamedPipeServerStream instances remains undisposed and it appears this is the one that ends up stuck in the finalizer thread.

tmds · 2020-08-24T10:00:26Z

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnceCore(int sleep1Threshold) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SafeSocketHandle.CloseAsIs(bool abortive) (Unknown Source:0)

Since .NET 5, SafeHandles are ref'ed for on-going Socket operations (instead of calling DangerousGetHandle). If we are spinning here, there is a on-going operation against the Socket that keeps the handle from being released. It would be interesting to identify the operation that is on-going on some other thread. It may help put together a reproducer.

Aborting the on-going Socket operations is attempted by the TryUnblockSocket. If the Socket doesn't have CLOEXEC this method no-ops. This method is best-effort. There are tests that verify abortive behavior for UDP/TCP, but no tests for domain sockets.

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Unix.cs

Lines 196 to 235 in be197ad

    
           private unsafe bool TryUnblockSocket(bool abortive) 
        
           { 
        
               // Calling 'close' on a socket that has pending blocking calls (e.g. recv, send, accept, ...) 
        
               // may block indefinitely. This is a best-effort attempt to not get blocked and make those operations return. 
        
               // We need to ensure we keep the expected TCP behavior that is observed by the socket peer (FIN vs RST close). 
        
               // What we do here isn't specified by POSIX and doesn't work on all OSes. 
        
               // On Linux this works well. 
        
               // On OSX, TCP connections will be closed with a FIN close instead of an abortive RST close. 
        
               // And, pending TCP connect operations and UDP receive are not abortable. 
        
               // Unless we're doing an abortive close, don't touch sockets which don't have the CLOEXEC flag set. 
        
               // These may be shared with other processes and we want to avoid disconnecting them. 
        
               if (!abortive) 
        
               { 
        
                   int fdFlags = Interop.Sys.Fcntl.GetFD(handle); 
        
                   if (fdFlags == 0) 
        
                   { 
        
                       return false; 
        
                   } 
        
               } 
        
               int type = 0; 
        
               int optLen = sizeof(int); 
        
               Interop.Error err = Interop.Sys.GetSockOpt(handle, SocketOptionLevel.Socket, SocketOptionName.Type, (byte*)&type, &optLen); 
        
               if (err == Interop.Error.SUCCESS) 
        
               { 
        
                   // For TCP (SocketType.Stream), perform an abortive close. 
        
                   // Unless the user requested a normal close using Socket.Shutdown. 
        
                   if (type == (int)SocketType.Stream && !_hasShutdownSend) 
        
                   { 
        
                       Interop.Sys.Disconnect(handle); 
        
                   } 
        
                   else 
        
                   { 
        
                       Interop.Sys.Shutdown(handle, SocketShutdown.Both); 
        
                   } 
        
               } 
        
               return true; 
        
           }

Note that async Socket operations always nicely abort because there are no on-going Socket operations that keep the SafeHandle ref'ed.

antonfirsov · 2020-08-24T14:11:05Z

It would be interesting to identify the operation that is on-going on some other thread.

I'm currently trying to figure that out, but my gut feel is that spinning in the finalizer thread is a bad idea in any case, and we should probably rethink the way CloseAsIs works, regardless of the outcome of the investigation.

jaredpar · 2020-08-24T17:16:37Z

@tmds

It would be interesting to identify the operation that is on-going on some other thread. It may help put together a reproducer.

Really wish I could've found a simpler repro here that would more clearly identify the operation occurring. Spent several hours on this but was unable to do so.

More than happy to help out with any questions on the current repro.

antonfirsov · 2020-08-26T15:25:58Z

Looks like the issue is not caused by outstanding blocking calls. I have a smaller repro now:
https://gist.github.com/antonfirsov/ce0cb4992e115bb4d6e8dd6862fd6780

Here is what is happening in my understanding:

SafePipeHandle.Unix increases the reference counter on SocketSafeHandle:

runtime/src/libraries/System.IO.Pipes/src/Microsoft/Win32/SafeHandles/SafePipeHandle.Unix.cs

Line 36 in c6da68c

_namedPipeSocketHandle.DangerousAddRef(ref ignored);
NamedPipeClientConnectionHost seems to leak (= not dispose) some NamedPipeServerStream instances. (@jaredpar I'm wondering if this only happens in the repro branch or is this a bug in the compiler server?)
A Socket finalizer is called before the finalizers of the "owner" SafePipeHandle and NamedPipeServerStream.

Since the reference count is not 0, SocketSafeHandle is not released. CloseAsIs will keep spinning, blocking the finalizer thread, preventing the finalization of SafePipeHandle (therefore the release of SocketSafeHandle):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.cs

Lines 106 to 114 in c6da68c

    
           while (!_released) 
        
           { 
        
               // The socket was not released due to the SafeHandle being used. 
        
               // Try to make those on-going calls return. 
        
               // On Linux, TryUnblockSocket will unblock current operations but it doesn't prevent 
        
               // a new one from starting. So we must call TryUnblockSocket multiple times. 
        
               canceledOperations |= TryUnblockSocket(abortive); 
        
               sw.SpinOnce(); 
        
           }

I have an idea for a PR that would remove the spinning.

For some reason the issue does not happen with 3.1.

jaredpar · 2020-08-26T16:26:50Z

NamedPipeClientConnectionHost seems to leak (= not dispose) some NamedPipeServerStream instances. (@jaredpar I'm wondering if this only happens in the repro branch or is this a bug in the compiler server?)

This should not be happening in the product anymore. After filing the bug I was able to find a few other places these were leaking, made a push to get rid of them, and I believe I got them all at this point. It's possible it's still happening in the tests though.

tmds · 2020-08-26T19:15:25Z

Since the reference count is not 0, SocketSafeHandle is not released.

👍 Makes sense.

I have an idea for a PR that would remove the spinning.

The spinning deals with a race condition that occurs as follows:

Thread A refs the SafeHandle
Thread B calls TryUnblockSocket
Thread A uses the handle for a Socket operation

If Thread B does not call TryUnblockSocket again (which it does now by spinning), the operation on thread A is not aborted.

antonfirsov · 2020-08-26T19:41:22Z

@tmds I may miss something important but:
do we really need to abort the operation when we are doing an abortive close from a finalizer? Isn't it better to ignore it, and enforce the abortive close?

tmds · 2020-08-27T02:51:48Z

do we really need to abort the operation when we are doing an abortive close from a finalizer

When the Socket gets closed (by the user explicitly, or on the finalizer thread) we want all threads that are performing a blocking operation on the Socket to unblock (and return something like SocketError.OperationAborted).

tmds · 2020-08-27T03:02:32Z

Anton, I think we could dup the SafeSocketHandle so the SafePipeHandle has its own kernel-level reference. wdyt?

antonfirsov · 2020-08-27T09:07:31Z

we want all threads that are performing a blocking operation on the Socket to unblock (and return something like SocketError.OperationAborted).

@tmds what will happen to those ongoing operations if we close the underlying handle?

I think we really need to distinguish between the finalizer and a normal Dispose. It already indicates a user bug, if a finalizer is being executed. Instead of being 100% correct handling blocking operations I would only aim for preventing handle leak in those cases.

I believe the way CloseAsIs works now in a very unsafe way:

We assume that !_released indicates an ongoing blocking operation. This is not always true.
Currently any user code relying on DangerousRelease in a finalizer is risking to block the finalizer thread (a fatal bug!)

I would prefer to find a way to solve this in Sockets instead of just changing the way NamedPipes are utilizing sockets. As an alternative, we can document that user code should never DangerousAddRef / DangerousRelease on socket.SafeHandle, but that sounds weird to me.

tmds · 2020-08-27T09:32:16Z

@tmds what will happen to those ongoing operations if we close the underlying handle?

With 3.1 on-going operations didn't ref the SafeHandle. With 5.0 they do, and as a consequence they prevent the SafeHandle from getting released (that is: actually closing). So we need something to unblock the operations.

I think we really need to distinguish between the finalizer and a normal Dispose

In case of a finalizer, we may assume there are no on-going operations, because those would prevent the SafeHandle from being finalized, right? So we may skip the TryUnblockSocket and prevent blocking the finalizer thread.

antonfirsov · 2020-08-27T10:25:45Z

With 5.0 they do, and as a consequence they prevent the SafeHandle from getting released (that is: actually closing)

What I mean is that in a finalizer (instead of waiting for the handle to be released) we may want force-close the handle, independently of the ref-count/release mechanism.

In case of a finalizer, we may assume there are no on-going operations, because those would prevent the SafeHandle from being finalized, right?

Unfortunately, this is not true. In this particular case we are in the finalizer of the Socket instance, so ongoing P/Invoke-s have no effect on whether it's executed or not.

Edit: on the other hand, this should be a very rare corner-case. We expect blocking socket operations to be triggered from managed Socket calls, I think it's unlikely the Socket gets finalized in the meanwhile.

antonfirsov · 2020-08-27T11:38:23Z

I just noticed this comment in TryUnblockSocket:

// Calling 'close' on a socket that has pending blocking calls (e.g. recv, send, accept, ...)
// may block indefinitely. This is a best-effort attempt to not get blocked and make those operations return.

What does "indefinitely" mean here? Is it only for the time the blocking socket call is executing? Is it guaranteed to eventually return? Since it's a rare corner-case, in a finalizer it might be acceptable to wait for a blocking call. (As an alternative of of keeping the behavior where SafeSocketHandle.DangerousAddRef is risking a finalizer thread hang.)

Edit: I found #37873 now, which answers my question. Still hoping we can find a way to remove the spinning instead of working around this in SafePipeHandle. It's not good that we don't distinguish between blocking syscalls, and other operations changing SocketSafeHandle's refcount.
For example, we can set/unset a flag on SafeSocketHandle when doing blocking syscalls.

antonfirsov · 2020-08-27T17:30:03Z

~Socket() was a NOP before the following PR:
dotnet/corefx#38499

I wonder if the concern of indefinite spinning (or blocking close) was discussed in that PR? SafeSocketHandle.DangerousAddRef was probably safe before that.
/cc @wfurt

antonfirsov · 2020-08-27T20:25:43Z

Just an illustration of the bigger issue we have here. It's this easy to create a finalizer thread hang now:

static void Main(string[] args)
{
    CreateSocket();
    Console.WriteLine("Dying!");
    GC.Collect();
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void CreateSocket()
{
    Socket socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
    bool dummy = false;
    socket.SafeHandle.DangerousAddRef(ref dummy);
}

geoffkizer · 2020-08-28T01:00:02Z

This is definitely related to the fix we made in dotnet/corefx#38499. Before that we didn't really have a socket finalizer, so we would not have hit this issue.

That said, we cannot undo the change in dotnet/corefx#38499 (at least not without making other, extensive changes). The reasoning is explained here: #29327 (comment)

I think the best fix for 5.0 is to simply not try to cancel outstanding operations when the Socket finalizer is called, as was suggested above. It's possible that this could lead to some unexpected behavior in weird corner cases, but I think those corner cases are restricted to the case where you do a DangerousAddRef without a corresponding DangerousRelease. Since we are currently hanging the finalizer thread in that case, we can probably live with a few quirky corner cases for now.

I do think we should revisit all of this in the future. Ideally we shouldn't need to have a finalizer on Socket. But changing that would take some effort and research etc.

tmds · 2020-08-28T04:06:20Z

The spinning was introduced as part of dotnet/corefx#38804. In that PR, Socket PInvoke changed to use SafeHandles instead of DangerousAddRef.

Because previously there were no out-standing references, the handle was released as soon as the Socket got Closed. On Windows, this caused on-going operations to abort, but on Linux those operations would just keep on blocking.

Due to the use of SafeHandle, the handle release no longer occurs on Close when there are on-going operations. That became the responsibility of TryUnblockSocket.

The spinning was added to deal with the race described in #40301 (comment). But I did not consider someone other than Socket operations could DangerousAddRef, which leads to infinite spinning.

Based on what was said I think we are safe to not call TryUnblockSocket when Closing on a finalizer thread, because there should be no on-going Socket operations that need to be aborted.

jaredpar · 2020-09-01T23:45:23Z

🎉

karelz · 2020-09-10T15:23:01Z

Fixed also in 5.0 in RC1 (PR #41688) and in RC2 (PR #41747)

Dotnet-GitSync-Bot added area-System.Net.Sockets untriaged New issue has not been triaged by the area owner labels Aug 4, 2020

scalablecory added this to the 5.0.0 milestone Aug 6, 2020

scalablecory removed the untriaged New issue has not been triaged by the area owner label Aug 6, 2020

antonfirsov self-assigned this Aug 6, 2020

jaredpar added a commit to jaredpar/roslyn that referenced this issue Aug 6, 2020

Smaller repro

b1bc273

Smaller repro for dotnet/runtime#40301

karelz added the bug label Aug 10, 2020

jaredpar mentioned this issue Aug 11, 2020

Keep alive tests failing on Mac dotnet/roslyn#46447

Open

antonfirsov mentioned this issue Aug 18, 2020

[NO REVIEW] [TEST ONLY] repro Linux/mac pipe issue dotnet/roslyn#46937

Closed

tmds mentioned this issue Aug 28, 2020

SafeSocketHandle: avoid potential blocking of finalizer thread #41508

Merged

antonfirsov closed this as completed in #41508 Sep 1, 2020

github-actions bot mentioned this issue Sep 1, 2020

[release/5.0] SafeSocketHandle: avoid potential blocking of finalizer thread #41668

Merged

marklio added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Sep 1, 2020

github-actions bot mentioned this issue Sep 2, 2020

[release/5.0-rc2] SafeSocketHandle: avoid potential blocking of finalizer thread #41747

Merged

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

jaredpar commented Aug 4, 2020

ghost commented Aug 4, 2020

stephentoub commented Aug 4, 2020

jaredpar commented Aug 6, 2020 •

edited

Loading

tmds commented Aug 24, 2020

antonfirsov commented Aug 24, 2020 •

edited

Loading

jaredpar commented Aug 24, 2020

antonfirsov commented Aug 26, 2020 •

edited

Loading

jaredpar commented Aug 26, 2020

tmds commented Aug 26, 2020

antonfirsov commented Aug 26, 2020

tmds commented Aug 27, 2020

tmds commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020

tmds commented Aug 27, 2020

antonfirsov commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020

geoffkizer commented Aug 28, 2020

tmds commented Aug 28, 2020 •

edited

Loading

jaredpar commented Sep 1, 2020

karelz commented Sep 10, 2020

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

Comments

jaredpar commented Aug 4, 2020

ghost commented Aug 4, 2020

stephentoub commented Aug 4, 2020

jaredpar commented Aug 6, 2020 • edited Loading

tmds commented Aug 24, 2020

antonfirsov commented Aug 24, 2020 • edited Loading

jaredpar commented Aug 24, 2020

antonfirsov commented Aug 26, 2020 • edited Loading

jaredpar commented Aug 26, 2020

tmds commented Aug 26, 2020

antonfirsov commented Aug 26, 2020

tmds commented Aug 27, 2020

tmds commented Aug 27, 2020 • edited Loading

antonfirsov commented Aug 27, 2020

tmds commented Aug 27, 2020

antonfirsov commented Aug 27, 2020 • edited Loading

antonfirsov commented Aug 27, 2020 • edited Loading

antonfirsov commented Aug 27, 2020 • edited Loading

antonfirsov commented Aug 27, 2020

geoffkizer commented Aug 28, 2020

tmds commented Aug 28, 2020 • edited Loading

jaredpar commented Sep 1, 2020

karelz commented Sep 10, 2020

jaredpar commented Aug 6, 2020 •

edited

Loading

antonfirsov commented Aug 24, 2020 •

edited

Loading

antonfirsov commented Aug 26, 2020 •

edited

Loading

tmds commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020 •

edited

Loading

antonfirsov commented Aug 27, 2020 •

edited

Loading

tmds commented Aug 28, 2020 •

edited

Loading