-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptionWe have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+ Configuration
Regression?This is a regression in .NET6+ compared to .NET5. DataDump files captured during one of the incidents contain thousands of async calls looking like
and
The events from a trace captured using System.Net.Http provider: AnalysisTracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.
|
We're aware of this issue on .NET 6, and it can be mitigated by setting the new SocketsHttpHandler
{
ConnectTimeout = TimeSpan.FromSeconds(5)
}
Are you seeing similar behavior on .NET 7?
If this situation happens again, could you also collect the following events?
That would give us more information about what the connection pool was doing at the time. |
Seeing this happening on .NET 7.0 is very weird, I wouldn't be surprised if this is a different issue, especially if it's true that:
@sksk571 do you see the same scale of I would really like to see the additional traces collected on .NET 7.0, if setting |
The issue definitely happens on .NET7, though maybe to a lesser extent than on .NET6. The workaround with setting ConnectTimeout in SocketsHttpHandler doesn't work for us, as the affected code is in a netstandard2.0 library where SocketsHttpHandler is not available. I was able to collect the trace with the additional events, how can I share it privately? The trace is from production system so I'd avoid attaching it to the issue. |
Can you send it to |
After exchanging information with @sksk571 it looks like there are sporadic stalled We can consider mitigating this by:
Whether we are open to implement such mitigations depends whether such DNS misbehavior is common. It would be good to understand what's causing this on your infra @sksk571. Is the DNS server unresponsive? Is there any timeout configured in |
Looks like a +1 on #19443 |
Triage: only one customer report so far, the root cause is environmental. Putting to "Future" for now, if we get more customers hitting this we can bump the priority. |
We are experiencing similar behavior in a high-traffic ASP.NET Core application on .NET 7.0.101. The application runs on Kubernetes, and the issue typically occurs when new pods are created and begin processing a high volume of requests. The application makes numerous outbound HTTP requests to various cloud services while processing each inbound request. The symptoms include:
|
@matt-augustine I'm experimenting with a fix now. Would be great if you could try a custom build for your case for validation. Will reach out on Teams. |
Moving to 9.0 as it hit 3 customers. They found workarounds, but it would be nice to avoid the problem for others in future. |
I believe we were severely affected by this issue, which resulted in massive stability issues that were very difficult to debug. We were able to work around by using a custom DNS resolver and SocketsHttpHandler with ConnectCallback, but it seems this is wider spread than 3 customers and can cause massive damage for things like edge services that communicate via HTTP. |
Edit by @antonfirsov: This is caused by stalled DNS resolution attempts, see #81023 (comment).
Description
We have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+
Configuration
Regression?
This is a regression in .NET6+ compared to .NET5.
Data
Dump files captured during one of the incidents contain thousands of async calls looking like
and
The events from a trace captured using System.Net.Http provider:
dotnet_20230119_154543_2847c867.9.excel.xlsx
Analysis
Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.
This issue may be related.
The text was updated successfully, but these errors were encountered: