Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lib-storage): use PUT for small uploads #2605

Merged
merged 1 commit into from
Aug 7, 2021

Conversation

Linkgoron
Copy link
Contributor

@Linkgoron Linkgoron commented Jul 21, 2021

Use Put for uploads smaller than part size in lib-storage

Issue

When upgrading from s3.upload to lib-storage upload, the performance for small files has degraded and the number of API calls has tripled. This was caused by v2 implementing an optimization for s3.upload that uses PUT for uploads that are one part only (which is one API call), while lib-storage always uses multi-part uploads (which is at least 3 api calls).

fixes #2593

Description

This PR implements PUT for small uploads, instead of using multi-part uploads for smaller files.

Testing

Added multiple tests to Upload.spec.js, that test both "large" multi-part uploads and smaller uploads.

Additional context

The most complex part of this PR (IMO) is delaying the multi-part start command, which needs to happen only after we're certain that we need it. So now it needs to happen in one of the concurrent uploaders, while also needing the other uploaders to wait for it to finish as they need an uploadid to upload their own parts.

As an aside, I've seen quite a few buffer-copies that I think can be improved and removed completely. Should I open another PR for that, or incorporate them in this PR?


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2021

Codecov Report

❗ No coverage uploaded for pull request base (main@5789ff4). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2605   +/-   ##
=======================================
  Coverage        ?   60.36%           
=======================================
  Files           ?      516           
  Lines           ?    27475           
  Branches        ?     6603           
=======================================
  Hits            ?    16585           
  Misses          ?    10890           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5789ff4...e70f7e0. Read the comment docs.

Use Put for uploads smaller than part size in lib-storage
@Linkgoron Linkgoron force-pushed the lib-storage-put-small-files branch from feaabe8 to e70f7e0 Compare July 21, 2021 09:12
@aws-sdk-js-automation
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sdk-staging-test
  • Commit ID: e70f7e0
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@Linkgoron Linkgoron changed the title feat(lib-storage): use PUT from small uploads feat(lib-storage): use PUT for small uploads Jul 21, 2021
@AllanZhengYP AllanZhengYP self-requested a review July 21, 2021 16:04
Copy link
Contributor

@AllanZhengYP AllanZhengYP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for puting this together! I only have a few comments.

lib/lib-storage/src/chunks/getChunkBuffer.ts Show resolved Hide resolved
lib/lib-storage/src/Upload.ts Show resolved Hide resolved
async __createMultipartUpload() {
if (!this.createMultiPartPromise) {
const createCommandParams = { ...this.params, Body: undefined };
this.createMultiPartPromise = this.client.send(new CreateMultipartUploadCommand(createCommandParams));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the variable to handle concurrent createMultipart request if you move the __createMultipartUpload() to the __doMultipartUpload()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but why not? Let's say uploader A called CreateMultipartUploadCommand, and before the command returns uploader B gets another chunk. Uploader B needs to wait for Uploader A to get the UploadId, so they need some-kind of shared state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting moving creating multipart upload API call out of the uploader workflow. As a result, the CreateMultipartUploadCommand is called synchronously before firing concurrent uploaders.

Copy link
Contributor Author

@Linkgoron Linkgoron Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we might not actually want to execute the create command, and we only know this after we get the first chunk from the chunker.

We could change the flow so that the first chunk is done separately, and the uploaders are created only after the first chunk was yielded (maybe this is what you're suggesting?). If the first chunk is the last - use PUT, otherwise spin up concurrent uploaders and use multi-part. It would be a much bigger change than what I did, and will probably be way more complex, I think, but it could work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Linkgoron I did a little POC and I do find it is a overkill to support this feature. The current solution should work without significant overhead. I will approve your change!

@AllanZhengYP AllanZhengYP self-assigned this Jul 26, 2021
@@ -132,9 +183,6 @@ export class Upload extends EventEmitter {
}

async __doMultipartUpload(): Promise<ServiceOutputTypes> {
const createMultipartUploadResult = await this.client.send(new CreateMultipartUploadCommand(this.params));
this.uploadId = createMultipartUploadResult.UploadId;

// Set up data input chunks.
const dataFeeder = getChunk(this.params.Body, this.partSize);
Copy link
Contributor Author

@Linkgoron Linkgoron Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it wouldn't be too difficult, as I've said before the main issue is that the code is more complex IMO. Note that once we go down this route we'd also need to manage the lifetime of the iterator, as we'd need to wrap stuff in a try/finally and call dataFeeder.return() to make sure that it isn't missed as otherwise some resources might not get released (which is done by the for await loop "for free"). If it's fine I can change the code. I've already implemented something similar to the above locally after you suggested it previously.

@AllanZhengYP AllanZhengYP merged commit 7374720 into aws:main Aug 7, 2021
@AllanZhengYP
Copy link
Contributor

AllanZhengYP commented Aug 7, 2021

Thank you a lot for bringing this feature to v3! @Linkgoron

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload() should use PutObject API for uploading smaller objects
5 participants