-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework request retry logic to be based on retry count limit #48758
Conversation
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWe currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop. This logic is suboptimal for a couple reasons: This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired. Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure. Also:
Fixes #44669
|
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
WinHttp may need also project file change. |
hmm. and this one is interesting
|
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
There's something wacky going on deep in the sockets code that is causing an assert in SocketAsyncContext. I suspect this is a pre-existing issue that has been exposed by the combination of #47648 and this PR. Something involving sync socket IO and timeouts. cc @wfurt @antonfirsov @scalablecory @stephentoub... any ideas? |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
This is now ready for review. The socket assert got fixed in #50788. I also reworked a questionable WebSocket test that was failing due to change in timing. |
@@ -569,10 +569,13 @@ public async Task<HttpResponseMessage> SendAsyncCore(HttpRequestMessage request, | |||
_readOffset = 0; | |||
_readLength = bytesRead; | |||
} | |||
else | |||
{ | |||
await FillAsync(async).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I understand, this was already done later as part of ReadNextREsponseHeaderLineAsync, but we're preemptively doing it here so that we can definitively say after this that data was received?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs
Show resolved
Hide resolved
src/libraries/System.Net.WebSockets.Client/tests/ClientWebSocketOptionsTests.cs
Outdated
Show resolved
Hide resolved
233d69f
to
f6381cb
Compare
…nnectionFailureRetries) instead of isNewConnection logic
…retries, not including initial attempt
…t against timing issues
…erver test more robust against timing issues
… EOF from the server
83f9e88
to
608a96f
Compare
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
We currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop.
This logic is suboptimal for a couple reasons:
(1) There's nothing particularly unique about the first request on a connection. Servers can die or choose to close connections at any time.
(2) We currently do a bad job of deciding which request is the first request for an HTTP2 connection. It is timing dependent. This means that in certain scenarios, when the server sends valid REFUSED_STREAM errors, we end up not retrying requests that should be retried.
(3) We can in theory end up retrying a request many, many times until we actually use a new connection and break out of the retry loop. This is particularly problematic for HTTP2, for several reasons: (a) a single connection can handle many requests, yet connection failure only causes one to stop retrying; (b) we treat REFUSED_STREAM errors as retryable in all cases except for the initial request, even though the connection is still valid, which means that we may never actually choose a new connection for the request; (c) we handle GOAWAY to determine which requests are allowed to be retried, but these same requests could just end up being subject to REFUSED_STREAM or GOAWAY on a different connection, etc.
This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired.
Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure.
Also:
Fixes #44669
UPDATE:
From exploring certain test failures that were caused by this PR, it's clear to me that we are too lenient about allowing retries in many cases. For example, we are retrying on IO timeouts in one of the HttpWebRequest tests, even though the user is explicitly setting this timeout and presumably wants to fail (not retry) when the timeout is exceeded.
I believe many of these weird cases of lenient retry policy were masked by the way the old retry logic worked, which was that it never allowed retry on the first request on a connection. This means most unit tests never induced retry, even in failure cases that would have caused retry if the request were not the first on the connection -- which is extremely common in practice, but not common in our tests, unfortunately.
So it's actually good that this new retry policy has exposed these issues -- they already existed but were largely hidden.
To address these issues, I am changing the retry logic to be more conservative. We no longer will retry on arbitrary IO errors. We will only retry in cases where we believe that the server is attempting to gracefully tear down a connection -- that is, receiving EOF before any other response data for HTTP/1.1, or receiving GOAWAY for HTTP2.
Please take a look and comment. @stephentoub @scalablecory