-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out #92054
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsConsider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection). The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).
One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the But let's zoom in on the DNS part of the connection attempt:
One of the first things
This just sets up some bookeeping and takes us to an internal overload of
If we are dealing with a
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs Lines 649 to 659 in 62e50b6
In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active. On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:
In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue). Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:
None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.
|
Tagging subscribers to this area: @dotnet/ncl Issue DetailsConsider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection). The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).
One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the But let's zoom in on the DNS part of the connection attempt:
One of the first things
This just sets up some bookeeping and takes us to an internal overload of
If we are dealing with a
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs Lines 649 to 659 in 62e50b6
In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active. On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:
In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue). Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:
None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.
Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.
|
A related problem is that when a
but no corresponding
I ran into this when working around the original problem reported in this issue by adding a |
@matt-augustine that problem is tracked by #92045. |
I've come up with a workaround which seems to work based on a quick experiment. It goes like this: var host = "https://example.com";
var ipAddresses = await Dns.GetHostAddressesAsync(host, cancel);
var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
await socket.ConnectAsync(ipAddresses.First(), 443, cancel); The overall idea is to not let the Socket connect to a Is this a valid workaround or are there any pitfalls here? Let's ignore the obvious missing error-handling regarding checking how many |
The issue title "Socket.ConnectAsync(EndPoint, CT) does not pass CT to DNS query, enabling code with many concurrent connections to the same DNS name to reach a state where no further connections get established because all DNS queries time out" breaks down to two separate problems:
|
Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).
The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).
Socket.ConnectAsync(EndPoint, CancellationToken)
takes a CancellationToken:runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs
Line 86 in 62e50b6
One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the
CancellationToken
constructed fromSocketHttpHandler.ConnectTimeout
ends up in when we try to establish HTTP connections.But let's zoom in on the DNS part of the connection attempt:
One of the first things
Socket.ConnectAsync()
does it callAwaitableSocketAsyncEventArgs.ConnectAsync()
to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs
Line 100 in 62e50b6
This just sets up some bookeeping and takes us to an internal overload of
Socket.ConnectAsync()
:runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs
Line 2724 in 62e50b6
If we are dealing with a
DnsEndPoint
(i.e. the URL had a hostname instead of IP address), this now callsDnsConnectAsync()
:runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs
Line 2762 in 62e50b6
DnsConnectAsync()
will try to grab a CancellationToken for the DNS query from a variable called_multipleConnectCancellation
. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken toDns.GetHostAddressesAsync()
. The actual CancellationToken accompanying ourSocket.ConnectAsync()
we left behind several method calls ago.runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs
Lines 649 to 659 in 62e50b6
In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.
On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:
Dns.GetHostAddressesAsync()
.In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).
Repro app: https://github.com/sandersaares/whatiswrongwithyourdog-dns-does-not-get-canceled/blob/main/Program.cs
Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:
None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.
dotnet-counters
shows the queue size in this particular case to be over 1 minute!Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.
The text was updated successfully, but these errors were encountered: