This project demonstrates an approach to leverage S3 multipart upload and Step Functions (Distributed Map) to concurrently download a large file, up to 100TB (10GB * 10,000) in theory, from any given url (request range is required), and upload to a S3 bucket.
This project also demonstrates a way to host the source code in CodeCommit and deploy via CDK Pipeline.
Partitioner
An Python Lambda takeURL
andSingleTaskSize
as input, fetches the total download file size from given url. Based on given single task size, split the upload task into smaller tasks, and pass tasks to next state.Uploader
An Python Lambda is triggered by Step Functions leveragesrequest range
to download a portion of file, and upload to S3 by using multipart upload.Step Functions
An state machine handles tasks validation, fan-out, retry and error handling, also handles S3 multipart upload create, complete and abort.
Simply run make test
to run lint and unit test on Partitioner
and Uploader
.
- An AWS IAM user account which has enough permission to deploy:
- CodeCommit
- CodeBuild
- CodePipeline
- Step Functions
- Lambda
- S3
- Set up
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_DEFAULT_REGION
andCDK_DEFAULT_ACCOUNT
in.env
file.
This project is using AWS CodeCommit to host source code and CDK Pipeline to deploy. Simply run make ci-deploy
to run
lint, build, create new repository in CodeCommit, push source code and deploy the project CDK Pipeline.
An example Step Functions payload below to upload an awscli file to S3.
{
"URL": "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip",
"SingleTaskSize": 6000000
}