Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Connection reset by peer is not handled well by retry policy #21091

Closed
kasobol-msft opened this issue Apr 30, 2021 · 8 comments · Fixed by #21110
Closed

[BUG] Connection reset by peer is not handled well by retry policy #21091

kasobol-msft opened this issue Apr 30, 2021 · 8 comments · Fixed by #21110
Assignees
Labels
Azure.Core azure-core Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing)

Comments

@kasobol-msft
Copy link
Contributor

kasobol-msft commented Apr 30, 2021

We observe Connection reset by peer being propagated as an error in tests and reported by customers.
Looking for holistic solution for this problem in the core.

References:

Customer reported: #21066

image

image

@kasobol-msft kasobol-msft added Client This issue points to a problem in the data-plane of the library. Azure.Core azure-core labels Apr 30, 2021
@gapra-msft gapra-msft added the customer-reported Issues that are reported by GitHub users external to the Azure organization. label Apr 30, 2021
@somanshreddy
Copy link

@kasobol-msft A similar issue seen here #15215

CC: @rickle-msft @alzimmermsft

@alzimmermsft
Copy link
Member

Thanks for submitting this issue @kasobol-msft.

Just finished a rough investigation on this. It appears that when the expected response body of a request is void/Void the networking layer won't eagerly read the response body (as there isn't expected to be one). But, later in the call stack when converting the response headers, and body but there is none here, into the returned response there is an attempt to consume/fast-forward the response body. This is where the issue lies as doing this can result in a state when the remote closes/closed the connection triggering an exception to be thrown. Unfortunately, at this point the response has egressed past the retry policy, resulting in the exception being thrown to the caller instead of being captured and retried.

@kasobol-msft
Copy link
Contributor Author

There seems to be few missing cases that we should cover:

  • buffered upload
  • download
  • pageable operations

https://dev.azure.com/azure-sdk/internal/_build/results?buildId=900469&view=ms.vss-test-web.build-test-results-tab&runId=19372912&resultId=103744&paneView=debug
image
image
image

@kasobol-msft
Copy link
Contributor Author

@alzimmermsft alzimmermsft added the pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) label Jun 18, 2021
@alzimmermsft alzimmermsft self-assigned this Jul 21, 2021
@alzimmermsft
Copy link
Member

@kasobol-msft, has this happened recently or has this issue been fixed by #22647

@kasobol-msft
Copy link
Contributor Author

kasobol-msft commented Aug 12, 2021

This has happened recently, most likely because we exhausted retry count (though we're not 100% sure). We should follow up with aggregated exception and logs from retry policy.
The frequency of this dropped a lot though.

@alzimmermsft
Copy link
Member

Closing as this hasn't been a reported issue for a while and if this is seen again a new issue should be opened.

azure-sdk pushed a commit to azure-sdk/azure-sdk-for-java that referenced this issue Oct 19, 2022
S360 swagger correctness - added missing property totalCount (Azure#21091)

Co-authored-by: Umang Shah <umangshah@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Core azure-core Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants