Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

sandersaares · 2023-09-14T11:58:41Z

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 86 in 62e50b6

    
           public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 100 in 62e50b6

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2724 in 62e50b6

    
           internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2762 in 62e50b6

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs

Lines 649 to 659 in 62e50b6

    
           internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType) 
        
           { 
        
               Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetwork || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetworkV6); 
        
               CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default; 
        
               // In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate 
        
               // synchronously.  This call is made here rather than in the Core async method below to preserve that behavior. 
        
               Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315

Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.

The text was updated successfully, but these errors were encountered:

ghost · 2023-09-14T11:58:50Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 86 in 62e50b6

    
           public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 100 in 62e50b6

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2724 in 62e50b6

    
           internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2762 in 62e50b6

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs

Lines 649 to 659 in 62e50b6

    
           internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType) 
        
           { 
        
               Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetwork || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetworkV6); 
        
               CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default; 
        
               // In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate 
        
               // synchronously.  This call is made here rather than in the Core async method below to preserve that behavior. 
        
               Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315

Author:	sandersaares
Assignees:	-
Labels:	`area-System.Net`, `untriaged`
Milestone:	-

ghost · 2023-09-14T13:26:52Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 86 in 62e50b6

    
           public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs

Line 100 in 62e50b6

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2724 in 62e50b6

    
           internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs

Line 2762 in 62e50b6

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs

Lines 649 to 659 in 62e50b6

    
           internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType) 
        
           { 
        
               Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetwork || 
        
                            endPoint.AddressFamily == AddressFamily.InterNetworkV6); 
        
               CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default; 
        
               // In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate 
        
               // synchronously.  This call is made here rather than in the Core async method below to preserve that behavior. 
        
               Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315

Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.

Author:	sandersaares
Assignees:	-
Labels:	`area-System.Net.Sockets`, `untriaged`
Milestone:	-

matt-augustine · 2023-09-21T19:24:26Z

A related problem is that when a CancellationToken that has been passed to Dns.GetHostAddressesAsync is cancelled, the current-dns-lookups event counter value skews higher. I believe this is because Dns.RunAsync logs BeforeResolution:

runtime/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs

Line 652 in 5149896

long startingTimestamp = NameResolutionTelemetry.Log.BeforeResolution(key);

but no corresponding AfterResolution is logged in the canceled task continuation:

runtime/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs

Line 692 in 5149896

    
           ((ICollection<KeyValuePair<object, Task>>)s_tasks).Remove(new KeyValuePair<object, Task>(key!, task));

I ran into this when working around the original problem reported in this issue by adding a SocketsHttpHandler.ConnectCallback implementation that calls Dns.GetHostAddressesAsync and caches the result.

antonfirsov · 2023-09-21T19:32:35Z

@matt-augustine that problem is tracked by #92045.

kristjanjogi-msft · 2023-10-19T14:00:47Z

I've come up with a workaround which seems to work based on a quick experiment. It goes like this:

var host = "https://example.com";

var ipAddresses = await Dns.GetHostAddressesAsync(host, cancel);

var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);

await socket.ConnectAsync(ipAddresses.First(), 443, cancel);

The overall idea is to not let the Socket connect to a DnsEndpoint, instead make it connect to the IP address directly.

Is this a valid workaround or are there any pitfalls here? Let's ignore the obvious missing error-handling regarding checking how many IPAddresses Dns.GetHostAddressesAsync() returned.

antonfirsov · 2024-07-02T19:30:43Z

The issue title "Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out" breaks down to two separate problems:

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query -- This is a bug in Socket code which is easy to fix. It would be nice to have it done in .NET 9, since the CT has no effect cancelling DNS resolution neither in Windows nor in Unix. Unfortunately it's not sufficient to fix the Unix symptoms presented by the repro app in the opening comment.
Code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out -- This is in fact a duplicate of Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023. The best solution to address that issue is to implement Proposal for DNS client and custom DNS queries #19443.

dotnet-issue-labeler bot added the area-System.Net label Sep 14, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Sep 14, 2023

MihaZupan added area-System.Net.Sockets and removed area-System.Net labels Sep 14, 2023

antonfirsov added this to the 9.0.0 milestone Sep 14, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2023

antonfirsov self-assigned this Sep 14, 2023

antonfirsov mentioned this issue Oct 1, 2023

Start new per-host queue when an async over sync Dns call is cancelled #92862

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 1, 2023

antonfirsov mentioned this issue Oct 1, 2023

Unblock subsequent requests when an async over sync Dns call runs for too long #92863

Closed

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 4, 2024

rokonec assigned rokonec and unassigned antonfirsov Jul 2, 2024

antonfirsov added the bug label Jul 2, 2024

rokonec mentioned this issue Jul 4, 2024

Fix cancelability of DNS queries #104420

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jul 4, 2024

rokonec closed this as completed in #104420 Jul 12, 2024

github-actions bot locked and limited conversation to collaborators Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

sandersaares commented Sep 14, 2023 •

edited

Loading

ghost commented Sep 14, 2023

ghost commented Sep 14, 2023

matt-augustine commented Sep 21, 2023

antonfirsov commented Sep 21, 2023

kristjanjogi-msft commented Oct 19, 2023 •

edited by karelz

Loading

antonfirsov commented Jul 2, 2024 •

edited

Loading

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

Comments

sandersaares commented Sep 14, 2023 • edited Loading

ghost commented Sep 14, 2023

ghost commented Sep 14, 2023

matt-augustine commented Sep 21, 2023

antonfirsov commented Sep 21, 2023

kristjanjogi-msft commented Oct 19, 2023 • edited by karelz Loading

antonfirsov commented Jul 2, 2024 • edited Loading

sandersaares commented Sep 14, 2023 •

edited

Loading

kristjanjogi-msft commented Oct 19, 2023 •

edited by karelz

Loading

antonfirsov commented Jul 2, 2024 •

edited

Loading