-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start new per-host queue when an async over sync Dns call is cancelled #92862
Conversation
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWe had several customer reports of #81023 -- long running The downside of this mitigation is that it actually needs a cancellation (~ Fixes #92054
|
/// serialized. Once the data for that host is cached locally by the OS, the subsequent requests should all complete | ||
/// very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't | ||
/// block lots of threads all getting data for that one host. We also still want to issue the request to the OS, rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the data for that host is cached locally by the OS, the subsequent requests should all complete very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't block lots of threads all getting data for that one host
This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo
call sends an A & AAAA DNS request and takes 5-50ms. If the packet is lost, it waits 5 sec before retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo call sends an A & AAAA DNS request and takes 5-50ms.
On what are you basing this? On Ubuntu for:
[Benchmark]
public object GetHostEntry() => Dns.GetHostEntry("www.microsoft.com");
I get:
Method | Mean |
---|---|
GetHostEntry | 498.9 us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Benchmark]
public void GetHostAddresses() => Dns.GetHostAddresses("fb.lt.main.ml.mdn.skype.net");
averages to 24.84 ms
in my local 22.04 environment and to 9.051 ms
in the Azure D32 VM from #92054.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fb.lt.main.ml.mdn.skype.net
That name resolves through a CNAME of fb.lt.main.ml.mdn.trafficmanager.net
, which advertises a CNAME for fileserver-backend-lb.lt-westus-01.ic3-middle-lane-main-lt.westus-test.cosmic-int.office.net
with a 0 second TTL. So my guess is the OS is just honoring the TTLs in this case.
The www.microsoft.com
example above on the other hand uses non-0 TLLs.
(that's in no way saying that this isn't a real scenario, DNS-based connection load balancing works this way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also check whether DNS caching in your system is enabled, e.g.
systemctl is-active systemd-resolved
My point is simply that stating that it's a false assumption that OSes cache DNS information is itself a false assumption. Every major OS I'm aware of is capable of doing DNS caching. Whether it's enabled is a different question, though to my knowledge it is by default in Ubuntu 22.04. And we shouldn't just be killing the associated comment. If you want to augment the comment to say that caching might be disabled, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong thinking that answers are never cached on Ubuntu by default. However, to me the comment suggests that we can generally rely on short completion times because of the caching, which is also not true, eg. as Miha pointed out answers with TTL=0 are not cached, which then increases the probability of hitting #81023.
I will add the comment back & augment it.
if (cancellationToken.CanBeCanceled) | ||
{ | ||
task.ContinueWith((task, key) => | ||
task.ContinueWith(delegate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this wasn't allocating a new delegate and a new closure on every call. Now it is, as it's closing over the key and the task and the startingTimestamp. It'll even result in that allocation even if !CanBeCanceled, because state it's closing over is at the top-level method scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fix for #92045. I don't see a way to capture startingTimestamp
without an allocation. We can pass the info in a tuple to avoid the allocation in the !CanBeCanceled case. Or we can create a private class & cache instances to reduce the allocations in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum you can pass the required state in via the object state parameter. That'll require a tuple allocation/boxing, but the one allocation for that is still better than the two allocations for the closure and delegate. And it'll make it specific to the CanBeCanceled case, whereas with how it's currently written the closure allocation will happen first thing in the method body.
using (cancellationToken.UnsafeRegister(static s => | ||
{ | ||
// In case a request is cancelled, start a new queue for this key. | ||
// This helps avoiding congestion when a getaddrinfo call runs for too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for why a cancellation request means we no longer want to serialize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If configured well, a cancellation would likely kick-in for a long-running or stalled request. Closing the existing queue and starting a new one makes sure that requests following the cancellation would not be blocked by the long-running request. This helped in my experiments + validations are in progress in customer environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the original motivations for doing this queueing was a DNS request that could get stalled, and thus all requests for that DNS endpoint would end up blocking all threads in the pool. Doesn't this effectively put us back into the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see, #48566 doesn't talk about stalled (up to several seconds) requests but about slow (tens of milliseconds?) requests using up the ThreadPool threads. This does not remove the serialization of those. The queue is only being reset when a cancellation kicks in which is typically triggered by a ConnectTimeout
, which is usually in a multiple seconds range. If the customer mentioned in #34633 had stalled requests, they would have likely came back to us reporting #81023 much earlier.
cc @davidfowl if you can recall how long did the slow requests take in #34633. Without that I would assume it's a case similar to what we discussed in #92862 (comment) and "slow" means that the requests are in the 5-50 ms range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?
when a cancellation kicks in which is typically triggered by a ConnectTimeout
That's one source. The Dns APIs that support cancellation are public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?
Yes, but how likely is this scenario, and is it more probable than hitting #81023? To me it looks like having all requests to the same host stalled for a long enough period to starve the ThreadPool is a catsrophic event which would have serious impact on the application's funcionality even without the pool starvation, and the customer has to go and fix the environmental setup anyways. Occasional stalled requests are more common IMO, and the problem is that they are currently killing a possibly long chain of follow-up requests.
That's one source. The Dns APIs that support cancellation are public.
IMO this doesn't really change things. Regularly occuring cases of cancellation are most likely timeouts, and cancelling a significant percentage of DNS lookups by setting an overly short timeout doesn't look like a sane scenario to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasional stalled requests are more common IMO
What is the cause of the stalled requests you're seeing that makes this the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #92054 we see the OS resolver (succesfully) retrying a query after the default timeout
of 5 seconds, and getaddrinfo
is blocking for this 5sec period. The retries are sporadic and do not come in bursts. They are caused by a packet loss in my understanding, which shouldn't be very uncommon.
I don't have such confident information from the other cases, because this is typcially happening in production systems in a sporadic manner. Currently I'm working with @matt-augustine checking if a private build of the changes in this PR helps with their symptoms. Would that bring enough confidence for you that the change is net positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We see the issue in a production service with a containerized application running in a Kubernetes cluster on AKS. It usually occurs in new pods shortly after they are created. The cluster is using the default Kubernetes DNS configuration, which as I understand it means that DNS lookups are not cached in the container/pod, so every DNS query is routed to the CoreDNS
service as the nameserver, which might be running on a different node.
Our theory is that UDP packet loss might be responsible for failed/retried DNS queries when new pods are created, but we have not confirmed that. We are currently testing a private build to see if it helps, and I'm going to try to make a simpler way to test it in a more repeatable way.
Currently #81023 is no longer blocking the customers who originally reported the issue. Closing the PR in favor of a broader investigation into potential DNS improvements for 9.0. |
We had several customer reports of #81023 -- long running
getaddrinfo
lookups stallingDns.*Async
calls as a result of #49171. This is a simple mitigation of the problem by starting a new per-hostname Task queue in case a cancellation fires and also fixing #92054 to make sure theCancellationToken
flows down toGetHostAddressesAsync
.The downside of this mitigation is that it actually needs a cancellation (~
ConnectTimeout
) to work. IMO this is still better than not having a solution to the problem at .NET level at all. We could implement a more sophisticated fix (see #92863), but I'm assuming that we are looking for a low-risk change to service .NET 8 and maybe earlier versions.Fixes #92054
Fixes #92045
Mitigates #81023 (doesn't fully fix the issue)