-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random MultipartUpload -> RequestError -> "use of closed network connection" when uploading a lot of data to S3 #3406
Comments
Hi @segevfiner, thanks for reaching out to us about this. It sounds like the SDK is attempting to re-use a connection that has been closed by S3. This should be solvable by implementing a custom HTTP client in your session's config with a |
This issue has not recieved a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled. |
(Commenting to stop auto close) |
@diehlaws I followed your recommendation for setting the keep-alive interval value to less than 30s, but I still see the Here's the relevant snippet. Note that the one difference is that I also set the // The values specified here are the default values defined in the net/http
// package's DefaultTransport instance, except where noted.
httpClient, err := NewHTTPClientWithSettings(HTTPClientSettings{
// The HTTP Client total timeout value for the request.
ClientTimeout: 30 * time.Second,
Connect: 30 * time.Second,
// Set the keep-alive interval to less than 30s.
ConnKeepAlive: 10 * time.Second,
ExpectContinue: 1 * time.Second,
IdleConn: 30 * time.Second,
MaxAllIdleConns: 100,
MaxHostIdleConns: 2,
ResponseHeader: 5 * time.Second,
TLSHandshake: 10 * time.Second,
}) |
I've been looking at this as I've experienced the same behaviour. It looks to me like:
The original error is defined here. Perhaps |
Scratch that. Adding the following, passing test case to {
Err: awserr.New(ErrCodeRequestError, "send request failed", errors.New("use of closed network connection")),
Retryable: true,
}, |
We see this error multiple times per day uploading Postgres backups to S3 with many TBs of data. We have a failure rate of about 5% of our backups. The error manifests through wal-g, a Go library that uploads a Postgres backup to S3. Here's the wal-G stack trace: The AWS sdk used in our version is v1.26.1.
|
When uploading large amounts of data to S3, we occasionally see failures where the AWS sdk tries to use a closed network connection. The upstream bug appears to be aws/aws-sdk-go#3406. I'm not sure why the error manifests but it's causing us significant pain. Rather than retry the entire base backup, we'll retry the WAL segment upload. ``` ERROR: 2020/08/06 08:39:52.198782 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET': MultipartUpload: upload multipart failed caused by: RequestError: send request failed caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2: write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection ERROR: 2020/08/06 08:39:52.198805 upload: could not upload 'base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' ERROR: 2020/08/06 08:39:52.198818 failed to upload 'basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br' to bucket 'S3_BUCKET': MultipartUpload: upload multipart failed caused by: RequestError: send request failed caused by: Put https://S3_BUCKET/basebackups_005/base_0000000100006F04000000E4/tar_partitions/part_19148.tar.br?partNumber=2 write tcp 10.64.18.161:42118->52.216.134.19:443: use of closed network connection ERROR: 2020/08/06 08:39:52.198833 Unable to complete uploads ```
=== ### Service Client Updates * `service/cloud9`: Updates service API and documentation * Add ConnectionType input parameter to CreateEnvironmentEC2 endpoint. New parameter enables creation of environments with SSM connection. * `service/comprehend`: Updates service documentation * `service/ec2`: Updates service API and documentation * Introduces support for IPv6-in-IPv4 IPsec tunnels. A user can now send traffic from their on-premise IPv6 network to AWS VPCs that have IPv6 support enabled. * `service/fsx`: Updates service API and documentation * `service/iot`: Updates service API, documentation, and paginators * Audit finding suppressions: Device Defender enables customers to turn off non-compliant findings for specific resources on a per check basis. * `service/lambda`: Updates service API and examples * Support for creating Lambda Functions using 'java8.al2' and 'provided.al2' * `service/transfer`: Updates service API, documentation, and paginators * Adds security policies to control cryptographic algorithms advertised by your server, additional characters in usernames and length increase, and FIPS compliant endpoints in the US and Canada regions. * `service/workspaces`: Updates service API and documentation * Adds optional EnableWorkDocs property to WorkspaceCreationProperties in the ModifyWorkspaceCreationProperties API ### SDK Enhancements * `codegen`: Add XXX_Values functions for getting slice of API enums by type. * Fixes [#3441](#3441) by adding a new XXX_Values function for each API enum type that returns a slice of enum values, e.g `DomainStatus_Values`. * `aws/request`: Update default retry to retry "use of closed network connection" errors ([#3476](#3476)) * Fixes [#3406](#3406) ### SDK Bugs * `private/protocol/json/jsonutil`: Fixes a bug that truncated millisecond precision time in API response to seconds. ([#3474](#3474)) * Fixes [#3464](#3464) * Fixes [#3410](#3410) * `codegen`: Export event stream constructor for easier mocking ([#3473](#3473)) * Fixes [#3412](#3412) by exporting the operation's EventStream type's constructor function so it can be used to fully initialize fully when mocking out behavior for API operations with event streams. * `service/ec2`: Fix max retries with client customizations ([#3465](#3465)) * Fixes [#3374](#3374) by correcting the EC2 API client's customization for ModifyNetworkInterfaceAttribute and AssignPrivateIpAddresses operations to use the aws.Config.MaxRetries value if set. Previously the API client's customizations would ignore MaxRetries specified in the SDK's aws.Config.MaxRetries field.
Release v1.34.3 (2020-08-12) === ### Service Client Updates * `service/cloud9`: Updates service API and documentation * Add ConnectionType input parameter to CreateEnvironmentEC2 endpoint. New parameter enables creation of environments with SSM connection. * `service/comprehend`: Updates service documentation * `service/ec2`: Updates service API and documentation * Introduces support for IPv6-in-IPv4 IPsec tunnels. A user can now send traffic from their on-premise IPv6 network to AWS VPCs that have IPv6 support enabled. * `service/fsx`: Updates service API and documentation * `service/iot`: Updates service API, documentation, and paginators * Audit finding suppressions: Device Defender enables customers to turn off non-compliant findings for specific resources on a per check basis. * `service/lambda`: Updates service API and examples * Support for creating Lambda Functions using 'java8.al2' and 'provided.al2' * `service/transfer`: Updates service API, documentation, and paginators * Adds security policies to control cryptographic algorithms advertised by your server, additional characters in usernames and length increase, and FIPS compliant endpoints in the US and Canada regions. * `service/workspaces`: Updates service API and documentation * Adds optional EnableWorkDocs property to WorkspaceCreationProperties in the ModifyWorkspaceCreationProperties API ### SDK Enhancements * `codegen`: Add XXX_Values functions for getting slice of API enums by type. * Fixes [#3441](#3441) by adding a new XXX_Values function for each API enum type that returns a slice of enum values, e.g `DomainStatus_Values`. * `aws/request`: Update default retry to retry "use of closed network connection" errors ([#3476](#3476)) * Fixes [#3406](#3406) ### SDK Bugs * `private/protocol/json/jsonutil`: Fixes a bug that truncated millisecond precision time in API response to seconds. ([#3474](#3474)) * Fixes [#3464](#3464) * Fixes [#3410](#3410) * `codegen`: Export event stream constructor for easier mocking ([#3473](#3473)) * Fixes [#3412](#3412) by exporting the operation's EventStream type's constructor function so it can be used to fully initialize fully when mocking out behavior for API operations with event streams. * `service/ec2`: Fix max retries with client customizations ([#3465](#3465)) * Fixes [#3374](#3374) by correcting the EC2 API client's customization for ModifyNetworkInterfaceAttribute and AssignPrivateIpAddresses operations to use the aws.Config.MaxRetries value if set. Previously the API client's customizations would ignore MaxRetries specified in the SDK's aws.Config.MaxRetries field.
I think this issue may still exist on Windows machines. I'm using https://github.com/peak/s5cmd with aws-sdk-go v1.34.12 built for Windows. Everything is working fine but I receive this error very often: It seems likely that this error message is unique to the Windows TCP/IP stack's WSAECONNRESET error code - see https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2. Because it's Windows-specific, it's being missed by the error string matching which only checks for a *nix message. I'm not familiar with how Go handles error messages on different platforms so it's possible I'm beating down the wrong path with this. However adding additional handling for this error message in the retry logic seemed to fix the issue for me. |
Happens in a Linux container |
update aws-sdk to fix aws/aws-sdk-go#3406
We are recently seeing more occurrences of this issue. Is there a reason why this is ticket is open again? |
I believe the problem here is that the AWS SDK for Go is The workaround is to disable HTTP transport keep-alive (which is NOT the same thing as Dialer keep-alive, that is unrelated to the problem), or set the idle connection timeout very short on the client side in the hope that you never get close to the actual server timeout. |
Hi everyone, Since v1 is going into maintenance mode, and this is manifesting in v2 as well Im going to close this and ask that you refer to @lucix-aws last comment on that thread and let us know there if you need further assistance. Thanks, |
Comments on closed issues are hard for our team to see. |
Confirm by changing [ ] to [x] below to ensure that it's a bug:
Describe the bug
We have code that download and uploads S3 objects across different buckets (We can't use
CopyObject
/UploadPartCopy
as it is done from public buckets that are out of our control).On larger buckets, after some time passes, we randomly get an error that looks something like this:
The SDK is configured, as per default, to retry requests (I think the default for S3 is 3 retries). But it might not be retrying for this specific error. At least I wasn't able to find a reference to it in the SDK code. (I think this is one of those errors that Go hasn't exported for unknown reasons golang/go#4373)
I'm not sure if this is an error that can/should occur sporadically and should be retried by the SDK or it arises from some race condition/bug somewhere in the SDK or Go.
Version of AWS SDK for Go?
v1.31.15
Version of Go (
go version
)?go version go1.13.12 linux/amd64
To Reproduce (observed behavior)
https://github.com/segevfiner/s3-download-upload-stress
There is one MWE there to copy an object from one bucket over and over to another, and one that copies a bucket recursively to another.
Expected behavior
The entire copy process should work to the end and not crash midway with a random error, the SDK should retry internally for transient errors.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: