Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054

Closed
sandersaares opened this issue Sep 14, 2023 · 6 comments · Fixed by #104420
Assignees
Labels
area-System.Net.Sockets bug in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@sandersaares
Copy link

sandersaares commented Sep 14, 2023

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType)
{
Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified ||
endPoint.AddressFamily == AddressFamily.InterNetwork ||
endPoint.AddressFamily == AddressFamily.InterNetworkV6);
CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default;
// In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate
// synchronously. This call is made here rather than in the Core async method below to preserve that behavior.
Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

image

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

  • A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
  • Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
  • Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
  • Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315

Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 14, 2023
@ghost
Copy link

ghost commented Sep 14, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType)
{
Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified ||
endPoint.AddressFamily == AddressFamily.InterNetwork ||
endPoint.AddressFamily == AddressFamily.InterNetworkV6);
CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default;
// In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate
// synchronously. This call is made here rather than in the Core async method below to preserve that behavior.
Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

image

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

  • A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
  • Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
  • Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
  • Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315
Author: sandersaares
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@ghost
Copy link

ghost commented Sep 14, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).

The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).

Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:

public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)

One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.

But let's zoom in on the DNS part of the connection attempt:

>	System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501	C#
 	System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241	C#
 	System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182	C#
 	System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480	C#
 	System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541	C#

One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):

ValueTask connectTask = saea.ConnectAsync(this);

This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():

internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable)

If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():

pending = e.DnsConnectAsync(dnsEP, default, default);

DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.

internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType)
{
Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified ||
endPoint.AddressFamily == AddressFamily.InterNetwork ||
endPoint.AddressFamily == AddressFamily.InterNetworkV6);
CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default;
// In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate
// synchronously. This call is made here rather than in the Core async method below to preserve that behavior.
Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken);

image

In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.

On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:

  • A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
  • Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to Dns.GetHostAddressesAsync().
  • Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
  • Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.

In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).

Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs

Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:

azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.

None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.

dotnet-counters shows the queue size in this particular case to be over 1 minute!

[System.Net.NameResolution]
    Average DNS Lookup Duration (ms)                                  67,039.097
    Current DNS Lookups                                                7,932
    DNS Lookups Requested                                             22,315

Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.

Author: sandersaares
Assignees: -
Labels:

area-System.Net.Sockets, untriaged

Milestone: -

@antonfirsov antonfirsov added this to the 9.0.0 milestone Sep 14, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2023
@antonfirsov antonfirsov self-assigned this Sep 14, 2023
@matt-augustine
Copy link

A related problem is that when a CancellationToken that has been passed to Dns.GetHostAddressesAsync is cancelled, the current-dns-lookups event counter value skews higher. I believe this is because Dns.RunAsync logs BeforeResolution:

long startingTimestamp = NameResolutionTelemetry.Log.BeforeResolution(key);

but no corresponding AfterResolution is logged in the canceled task continuation:

((ICollection<KeyValuePair<object, Task>>)s_tasks).Remove(new KeyValuePair<object, Task>(key!, task));

I ran into this when working around the original problem reported in this issue by adding a SocketsHttpHandler.ConnectCallback implementation that calls Dns.GetHostAddressesAsync and caches the result.

@antonfirsov
Copy link
Member

@matt-augustine that problem is tracked by #92045.

@kristjanjogi-msft
Copy link

kristjanjogi-msft commented Oct 19, 2023

I've come up with a workaround which seems to work based on a quick experiment. It goes like this:

var host = "https://example.com";

var ipAddresses = await Dns.GetHostAddressesAsync(host, cancel);

var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);

await socket.ConnectAsync(ipAddresses.First(), 443, cancel);

The overall idea is to not let the Socket connect to a DnsEndpoint, instead make it connect to the IP address directly.

Is this a valid workaround or are there any pitfalls here? Let's ignore the obvious missing error-handling regarding checking how many IPAddresses Dns.GetHostAddressesAsync() returned.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 4, 2024
@rokonec rokonec assigned rokonec and unassigned antonfirsov Jul 2, 2024
@antonfirsov antonfirsov added the bug label Jul 2, 2024
@antonfirsov
Copy link
Member

antonfirsov commented Jul 2, 2024

The issue title "Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out" breaks down to two separate problems:

  • Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query -- This is a bug in Socket code which is easy to fix. It would be nice to have it done in .NET 9, since the CT has no effect cancelling DNS resolution neither in Windows nor in Unix. Unfortunately it's not sufficient to fix the Unix symptoms presented by the repro app in the opening comment.
  • Code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out -- This is in fact a duplicate of Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023. The best solution to address that issue is to implement Proposal for DNS client and custom DNS queries #19443.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jul 4, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.