Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023

Open
sksk571 opened this issue Jan 23, 2023 · 12 comments
Open

Stalled getaddrinfo syscall can lead to massive HttpClient timeouts #81023

sksk571 opened this issue Jan 23, 2023 · 12 comments
Labels
Milestone

Comments

@sksk571
Copy link

sksk571 commented Jan 23, 2023

Edit by @antonfirsov: This is caused by stalled DNS resolution attempts, see #81023 (comment).

Description

We have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+

Configuration

  • Version: .NET6+
  • OS version: Alpine Linux 3.16
  • Architecture: x64

Regression?

This is a regression in .NET6+ compared to .NET5.

Data

Dump files captured during one of the incidents contain thousands of async calls looking like

.00007f0f9ec03878 (0) System.Net.Http.HttpConnectionPool+HttpConnectionWaiter`1+<WaitForConnectionAsync>d__5[[System.__Canon, System.Private.CoreLib]]
..00007f0f9ec03918 (4) System.Net.Http.HttpConnectionPool+<SendWithVersionDetectionAndRetryAsync>d__84
...00007f0f9ec03a38 (0) System.Net.Http.DiagnosticsHandler+<SendAsyncCore>d__10
....00007f0f9ec03b00 (0) System.Net.Http.HttpClient+<<SendAsync>g__Core|83_0>d
.....00007f0f9ec03bc0 (0) Amazon.Runtime.HttpWebRequestMessage+<GetResponseAsync>d__20

and

.00007f0f9ec1c840 (1) System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__99
..00007f0f9ec1c910 (0) System.Net.Http.HttpConnectionPool+<ConnectAsync>d__98
...00007f0f9ec1c9d8 (0) System.Net.Http.HttpConnectionPool+<CreateHttp11ConnectionAsync>d__100
....00007f0f9ec1ca98 (0) System.Net.Http.HttpConnectionPool+<AddHttp11ConnectionAsync>d__74

The events from a trace captured using System.Net.Http provider:
dotnet_20230119_154543_2847c867.9.excel.xlsx

Analysis

Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.
This issue may be related.

@sksk571 sksk571 added the tenet-performance Performance related issue label Jan 23, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 23, 2023
@ghost
Copy link

ghost commented Jan 23, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have a small web application that started sporadically throwing TaskCancelledException's after migration from .NET5 to .NET6 and later .NET7. The application consists of just one controller method that retrieves image from Amazon S3, resizes it and saves a resized copy back to S3. The application works fine under .NET5, but under .NET6 and .NET7 it sporadically throws thousands of TaskCancelledException errors from method HttpConnectionPool+HttpConnectionWaiter`1+

Configuration

  • Version: .NET6+
  • OS version: Alpine Linux 3.16
  • Architecture: x64

Regression?

This is a regression in .NET6+ compared to .NET5.

Data

Dump files captured during one of the incidents contain thousands of async calls looking like

.00007f0f9ec03878 (0) System.Net.Http.HttpConnectionPool+HttpConnectionWaiter`1+<WaitForConnectionAsync>d__5[[System.__Canon, System.Private.CoreLib]]
..00007f0f9ec03918 (4) System.Net.Http.HttpConnectionPool+<SendWithVersionDetectionAndRetryAsync>d__84
...00007f0f9ec03a38 (0) System.Net.Http.DiagnosticsHandler+<SendAsyncCore>d__10
....00007f0f9ec03b00 (0) System.Net.Http.HttpClient+<<SendAsync>g__Core|83_0>d
.....00007f0f9ec03bc0 (0) Amazon.Runtime.HttpWebRequestMessage+<GetResponseAsync>d__20

and

.00007f0f9ec1c840 (1) System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__99
..00007f0f9ec1c910 (0) System.Net.Http.HttpConnectionPool+<ConnectAsync>d__98
...00007f0f9ec1c9d8 (0) System.Net.Http.HttpConnectionPool+<CreateHttp11ConnectionAsync>d__100
....00007f0f9ec1ca98 (0) System.Net.Http.HttpConnectionPool+<AddHttp11ConnectionAsync>d__74

The events from a trace captured using System.Net.Http provider:
dotnet_20230119_154543_2847c867.9.excel.xlsx

Analysis

Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.
This issue may be related.

Author: sksk571
Assignees: -
Labels:

area-System.Net.Http, tenet-performance

Milestone: -

@MihaZupan
Copy link
Member

after migration from .NET5 to .NET6 and later .NET7

We're aware of this issue on .NET 6, and it can be mitigated by setting the ConnectTimeout (it's infinite by default).

new SocketsHttpHandler
{
    ConnectTimeout = TimeSpan.FromSeconds(5)
}

and later .NET7

Are you seeing similar behavior on .NET 7?
7.0 has a mitigation in place that should prevent this issue from affecting many requests - #71785.

The events from a trace captured using System.Net.Http provider

If this situation happens again, could you also collect the following events?

System.Net.Sockets
System.Net.Security
Private.InternalDiagnostics.System.Net.Http

That would give us more information about what the connection pool was doing at the time.

@antonfirsov
Copy link
Member

Seeing this happening on .NET 7.0 is very weird, I wouldn't be surprised if this is a different issue, especially if it's true that:

Tracing some of the failed requests to connections using ActivityID shows that the corresponding connection was established and closed, but not used for some reason.

@sksk571 do you see the same scale of TaskCancelledException-s and closed connections on .NET 7.0 as on .NET 6.0?

I would really like to see the additional traces collected on .NET 7.0, if setting ConnectTimeout (on .NET 6.0!) doesn't help.

@sksk571
Copy link
Author

sksk571 commented Jan 24, 2023

The issue definitely happens on .NET7, though maybe to a lesser extent than on .NET6. The workaround with setting ConnectTimeout in SocketsHttpHandler doesn't work for us, as the affected code is in a netstandard2.0 library where SocketsHttpHandler is not available.

I was able to collect the trace with the additional events, how can I share it privately? The trace is from production system so I'd avoid attaching it to the issue.

@MihaZupan
Copy link
Member

Can you send it to Miha.Zupan and Anton.Firszov @microsoft.com?

@antonfirsov
Copy link
Member

antonfirsov commented Jan 26, 2023

After exchanging information with @sksk571 it looks like there are sporadic stalled getaddrinfo calls on their server. As a result of #49171, a single stalled call will hang any follow-up DNS resolution request which go to the same host. This leads to a massive accumulation of pending connection attempts in HttpConnectionPool. These will only complete (fail) when the first stalled DNS resolution attempt completes. By this time the originating requests are timed out and end up throwing.

We can consider mitigating this by:

Whether we are open to implement such mitigations depends whether such DNS misbehavior is common. It would be good to understand what's causing this on your infra @sksk571. Is the DNS server unresponsive? Is there any timeout configured in resolv.conf? Does it help if you lower it?

@antonfirsov antonfirsov changed the title TaskCancelledException in HttpConnectionPool Stalled getaddrinfo syscall can lead to massive HttpClient timeouts Jan 26, 2023
@MihaZupan
Copy link
Member

Looks like a +1 on #19443

@antonfirsov
Copy link
Member

Triage: only one customer report so far, the root cause is environmental. Putting to "Future" for now, if we get more customers hitting this we can bump the priority.

@matt-augustine
Copy link

We are experiencing similar behavior in a high-traffic ASP.NET Core application on .NET 7.0.101. The application runs on Kubernetes, and the issue typically occurs when new pods are created and begin processing a high volume of requests. The application makes numerous outbound HTTP requests to various cloud services while processing each inbound request.

The symptoms include:

  • http11-requests-queue-duration event counter values reaching 40 seconds or more.
  • current-dns-lookups values of up to 2,000.
  • dns-lookups-duration values of 40 seconds or more.
  • current-outgoing-connect-attempts values of 100 or more.
  • threadpool-queue-length values of 100 or more.
  • Many logged exceptions similar to the one in the original report:
System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.TimeoutException: A connection could not be established within the configured ConnectTimeout.
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.CreateConnectTimeoutException(OperationCanceledException oce)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.TrySetCanceled(CancellationToken tokenToRecord, Object cancellationException)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.TrySetCanceled(CancellationToken tokenToRecord, Object cancellationException)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.SetException(Exception exception)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.TrySetCanceled(CancellationToken tokenToRecord, Object cancellationException)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.TrySetCanceled(CancellationToken tokenToRecord, Object cancellationException)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.SocketAsyncEventArgs.<DnsConnectAsync>g__Core|112_0(MultiConnectSocketAsyncEventArgs internalArgs, Task`1 addressesTask, Int32 port, SocketType socketType, ProtocolType protocolType, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.HttpConnectionWaiter`1.WaitForConnectionAsync(Boolean async, CancellationToken requestCancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)

@antonfirsov
Copy link
Member

antonfirsov commented Sep 21, 2023

@matt-augustine I'm experimenting with a fix now. Would be great if you could try a custom build for your case for validation. Will reach out on Teams.

@antonfirsov antonfirsov self-assigned this Oct 1, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 1, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 23, 2023
@karelz karelz modified the milestones: Future, 9.0.0 Jan 9, 2024
@karelz
Copy link
Member

karelz commented Jan 9, 2024

Moving to 9.0 as it hit 3 customers. They found workarounds, but it would be nice to avoid the problem for others in future.

@antonfirsov antonfirsov removed their assignment Jun 5, 2024
@MihaZupan MihaZupan removed this from the 9.0.0 milestone Jun 5, 2024
@DejanPelzel
Copy link

DejanPelzel commented Jun 19, 2024

I believe we were severely affected by this issue, which resulted in massive stability issues that were very difficult to debug. We were able to work around by using a custom DNS resolver and SocketsHttpHandler with ConnectCallback, but it seems this is wider spread than 3 customers and can cause massive damage for things like edge services that communicate via HTTP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants