AWS S3 Sync Issues #599

jamesls · 2014-01-17T19:45:28Z

There have been a few issues with respect to the sync command, particularly in the case of syncing down from S3 (s3 -> local). I'd like to try to summarize the known issues as well as a few proposals of possible options, and give people the opportunity to share any feedback they might have.

Sync Behavior Overview

The sync behavior is intended to be an efficient cp; only copy over the files from the source to the destination that are different. In order to do that we need to be able to determine whether or not a file in s3/local are different. To do this, we use two values:

File Size (from stat'ing the file locally and from the Size key in a ListObjects response)
Last modified time (mtime of the local file and the LastModified key in a ListObjects response)

As an aside, we use the ListObjects operation because we get up to 1000 objects returned in a single call. This means that we're limited to information that comes back from a ListObjects response which is LastModified, ETag, StorageClass, Key, Owner, Size.

Now given the remote and local files file size and last modified times we try to determine if the file is different. The file size is easy, if the file sizes are different, then we know the files are different and we need to sync the file. However, last modified time is more interesting. While the mtime of the local file is a true mtime, the LastModified time from ListObjects is really the time the object was uploaded. So imagine this scenario:

aws s3 sync local/ s3://bucket/
sleep 10
aws s3 sync s3://bucket local/

After the first sync command (local->s3), the local files will have an mtime of 0, and the contents in s3 will have a LastModified time of 10 (using relative offsets). When we run the second aws s3 sync command, which is syncing from s3 to local we'll first do the file size check. In this case the file sizes are the same so we look at the last modified time checks. In this case they are different (local == 0, s3 == 10). If we were doing a strict equality comparison then, because the last modified times are different, we would unnecessarily sync the files from s3 to local. So we can say that if the file sizes are the same and the last modified time in s3 is greater (newer) than the local file, then we don't sync. This is the current behavior.

However, this creates a problem if the remote file is updated out of band (via the console or some other SDK) and the size remains the same. If we run aws s3 sync s3://bucket local/ we will not sync the remote file even though we're suppose to.

Potential Solutions

Below are potential solutions.

Change the time checks to be a strict equality comparison. If the times are different, we sync. This has the issue that aws s3 sync local s3://bucket && aws s3 s3://bucket local will unnecessarily sync files. However, when we download a file, we set the mtime of the file to match the LastModified time so if you were to run aws s3 sync s3://bucket local again, it would not sync any files.
Modify the local mtimes when files are uploaded to match the Last Modified time of S3. We'd then change the file time checks to a strict equal/not equal check. The only downside to this approach is whether or not people would expect us to mess with the mtime of their files (I would expect not).
Add custom metadata to each object. This could potentially add some combination of local mtime, md5 so we can also compare md5 checksums if we want. However, we would have to switch from ListObjects to a HeadObject for every request. This would require 1000 times more API calls, and would slow down the sync in the case of many small files. In the steady state case as long as we can HeadObjects faster that we can download/upload them there shouldn't be an issue. A downside to this approach is that other tools won't be adding this metadata, which would make interop with other tools harder (if an object is missing the metadata we'd likely just sync).
Keep a local ETag cache. We could associate the file name with an ETag/local mtime/last modified time/md5 of the file. On the plus side we still use ListObjects, so we get 1000 objects back at a time. Downside is that every client will have to keep a cache. This is also the most complex solution in terms of implementation.

If there any other potential solutions I've left out, please chime in.

The text was updated successfully, but these errors were encountered:

jdub · 2014-01-18T00:16:44Z

Rather than store cache of server-provided ETags, can the ETag be calculated on the client side? Then it would be almost like doing md5 checks, but using data available in the ListObjects response. As long as the ETag algorithm doesn't rely on server-side state…

jdub · 2014-01-18T00:26:04Z

Ah: https://forums.aws.amazon.com/thread.jspa?messageID=203510&state=hashArgs%23203510

sebsto · 2014-01-18T00:29:00Z

You found it :-)

-- Seb

On 17 Jan 2014, at 16:26, Jeff Waugh <notifications@github.com mailto:notifications@github.com> wrote:

Ah: https://forums.aws.amazon.com/thread.jspa?messageID=203510&state=hashArgs%23203510

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/599#issuecomment-32668382.

Amazon EU Societe a responsabilite limitee, 5 Rue Plaetis, L - 2338 Luxembourg, R.C.S Luxembourg n B 101818, capital social: EUR 37.500. Autorisation d establissement en qualite de commercante n 104408.

jamesls · 2014-01-18T00:50:39Z

Yep, we can't reliably calculate the ETag for multipart uploads, otherwise that would be a great solution.

joeharris76 · 2014-01-22T10:09:52Z

Could you add a new flag (or 2) for the time behaviour? Perhaps --check-timestamps for option 1 and --update-local-timestamps for option 2. That way the user can specify a more robust check for changes and accept the consequences at the same time.

jamesls · 2014-01-22T22:00:04Z

Yeah, I think adding flags for options 1 and 2 would be a reasonable approach. One potential concern is that the default (no specified options) behavior has cases where sync doesn't behave how one would expect, but I'm not sure changing the default behavior to either of these options is a good thing here, given the potential tradeoffs we'd be making.

johnboxall · 2014-02-04T07:39:01Z

@jamesls I'm using the sync command to deploy a generated static website.

With the current version, I'm re-uploading all files every sync because the mtime changes when the site is regenerated, even if the content does not.

For my purposes (and I imagine a healthy number of other folks using this fabulous tool to upload their static sites) syncing via ETag as suggested in #575 would be most awesome, but given my reading of that issue it doesn't seem to be an option.

Barring that, for the purposes of static sites, a length only check (though maybe slightly dangerous) would work.

Another option would be for us to disable multi-part uploads and use #575 - we'd see huge saving immediately.

DALDEI · 2014-02-21T17:34:32Z

I have found the reverse problem. I changed a file in S3 that has the same size but newer timestamp and s3 sync doesnt pull it down

aws s3 sync s3://bucket/path/ dir

Looking at the data in S3 ... I think that its because of timezone issues.
The Properties show a time of

Last Modified: 2/21/2014 10:50:33 AM

But the HTTP headers show

Last-Modified: Fri, 21 Feb 2014 15:50:33 GMT

Note that the Last Modified property doesnt show the timezone ?

Since my s3 sync command is running on a different server with different timezone from where I put the file it thinks the file is in the past and doesnt pull it.

I had to change to s3 cp to make sure it gets all the files

jamesls · 2014-03-21T01:21:29Z

I think as a first step, we should implement the --size-only argument. It doesn't solve the problem in the general case, but for certain scenarios this will help and it's easy to understand/explain, particularly the use case referenced above with static sites being synced to s3.

hoegertn · 2014-03-21T06:19:15Z

I think, sync should have an option to always sync files if the file to sync is newer than the target. We are syncing files from machine A to S3 and afterwards from S3 to machine B. If the size of a file does not change (but the content does), this file will not reach machine B. This behavior is broken. I do not car eif I sync to much files but changed files should never be left out.

DALDEI · 2014-03-21T14:47:40Z

As per my previous post, "newer" needs to take into account the timezone as well.
Currently it is not so if you pushing a file to S3 from one timezone then syncing from another it wont correctly detect that the file is newer .

whiteand72 · 2014-03-27T14:59:07Z

@jamesls Further to the --size-only argument, I would be interested in using a --name-only argument. That is, don't check either the File Size or Last Modification Time. Simply copy files that exist in the source but not in the target. In our scenario we sync from s3 to local and once we have downloaded a file we don't expect it to ever change on s3. If this option resulted in fewer operations against our local (nfs) filesystem it could yield a performance improvement.

timvisher · 2014-04-07T16:46:57Z

@jamesls Should --size-only et al be available in 1.3.6?

ngbranitsky · 2014-04-28T15:41:49Z

My AWS Support rep for Case 186692581 says he forwarded my suggestion following to you.
I thought I would post it here anyway for comment:

I think a simple solution would be to introduce a fuzz factor.
If it normally wouldn't take more than 5 minutes for the local -> S3 copy,
then use a 10 minute fuzz factor on subsequent time comparisons.
Treat relative times within 10 minutes as equal.
If the S3 time is more than 10 minutes newer then sync from S3 -> local.
Perhaps add "--fuzz=10m" as an option.

tR-aoltersdorf · 2014-05-15T08:34:48Z

@jamesls @adamsb6
Wouldn't be #575 a good option at least for single part uploaded files?

If you check the ETAG Format of the file on S3, you could differ whether the file was uploaded as single ("ETAG = "MD5 Hash") or multipart (ETAG = "MD5 Hash" - "Number of Parts"). So you could compare all files local MD5to their ETAG and in the case that a file was uploaded as multipart you could skip it.

We've got a customer that has lots of video clips in certain folders on an S3 Bucket, that are synced to ec2 instances in all AWS Regions. All files are uploaded as single part.
In the moment we've got a problem due to s3cmd, that on some instances some files are corrupted. If we would do a full sync again we'll have 14 TB Traffic that will be charged.

Our problem: The corrupted files have exactly the same size like the original file on s3 and due to wrong timestamps through s3cmd we can't use the options mentioned above. In this case the --compare-on-etag would be a great solution to prevent syncing all files again.

Even for normal syncing the --compare-on-etag Option would be great, if you just have single part uploaded files, because aws s3 sync will only sync changed files.

vk6flab · 2014-06-03T05:17:03Z

I've just spent the better part of 3 hours attempting to find the minimum permissions required to use the sync command. The error I was getting was:
A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied

When really the error should have been:
A client error (AccessDenied) occurred when calling the ListBucket operation: Access Denied

A help item which shows a table with the minimum permissions for each command would be very helpful.

vk6flab · 2014-06-03T06:10:01Z

#404 seems like a really good idea, seeing that until I read that I thought that sync already did this.

Edit: To clarify, add rsync like behaviour to "aws s3 sync". It seems that that issue as reported is not quite what I initially understood it to be.

ngbranitsky · 2014-07-11T20:42:09Z

Since the latest AWS-CLI-bundle.zip does not contain the fix implemented above, I did a git clone. I can see the new code in a folder called "customizations". However, it is not clear to me how to create an aws-cli command using this code. Do I have to run make-bundle?

andrew512 · 2014-07-11T20:57:17Z

Yep. I use the following steps to install it onto new servers (Ubuntu):

git clone https://github.com/andrew512/aws-cli.git
cd aws-cli
pip install -r requirements.txt
sudo python setup.py install

ngbranitsky · 2014-07-11T22:14:53Z

OK.
I see the modified code in version 1.3.18.
It accepts my --exact-timestamps parameter.
I thought the latest download bundle I had previously installed was 1.3.21.

andrew512 · 2014-07-11T22:47:38Z

Reliable versioning will really only apply to the official AWS releases. I forked the repo at 1.3.18 so that's the version it will report, but it's already a few versions out of date, with 1.3.22 being the most recent as of right now. Hopefully AWS accepts the pull request and includes the feature in future official releases. It's been very valuable to us and helps address a pretty questionable default behavior.

simlu · 2017-10-22T04:53:03Z

Would be great to have an etag sync option. I know that there are scenarios where it fails, but for me it would be super valuable.

ASayre · 2018-02-06T10:12:43Z

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

This entry can specifically be found on UserVoice at : https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168808-aws-s3-sync-issues

davidmaxwaterman · 2018-02-06T12:00:13Z

https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168808-aws-s3-sync-issues

four43 · 2018-02-06T14:01:43Z

Moving away from github issues? That seems like a mistake...

davidmaxwaterman · 2018-02-06T14:20:13Z

Agreed. It seems more like the method Microsoft use for judging the importance/impact of issues, but I find it quite irritating.

jamesls · 2018-04-06T22:30:32Z

Based on community feedback, we have decided to return feature requests to GitHub issues.

acejam · 2018-05-29T19:14:35Z

Bumping this back up!

acejam · 2019-03-19T20:54:15Z

To the top!

queglay · 2019-08-15T11:20:06Z

comparison of md5 would be great, I'd also add that it would be helpful to output md5 on upload or download. this could be stored in our own db and would help determining if sync is needed through our database to limit requests.

danil-smirnov · 2019-09-01T16:41:54Z

@jamesls Could you comment this issue please?
#4460

This will avoid re-upload if the file size doesn't change. Currently the check on mtime is not very useful, as the local time of last modification never will match the upload mtime on S3. A better/safer mechanism would be to hash the file content and store it as e-tag. These tickets discuss such a feature, but it isn't implemented yet: - aws/aws-cli#599 - aws/aws-cli#575

@mhart

* fix: Functional tests must run on localhost to work in Windows (aws#552) * fix: spacing typo in Log statement in start-lambda (aws#559) * docs: Fix syntax highlighting in README.md (aws#561) * docs: Change jest to mocha in Nodejs init README (aws#564) * docs: Fix @mhart link in README (aws#562) * docs(README): removed cloudtrail, added SNS to generate-event (aws#569) * docs: Update repo name references (aws#577) * feat(debugging): Fixing issues around debugging Golang functions. (aws#565) * fix(init): Improve current init samples around docs and fixes (aws#558) * docs(README): Update launch config to SAM CLI from SAM Local (aws#587) * docs(README): Update sample code for calling Local Lambda Invoke (aws#584) * refactor(init): renamed handler for camel case, moved callback call up (aws#586) * chore: aws-lambda-java-core 1.1.0 -> 1.2.0 for java sam init (aws#578) * feat(validate): Add profile and region options (aws#582) Currently, `sam validate` requires AWS Creds (due to the SAM Translator). This commits adds the ability to pass in the credientials through a profile that is configured through `aws configure`. * docs(README): Update README prerequisites to include awscli (aws#596) * fix(start-lambda): Remove Content-Type Header check (aws#594) * docs: Disambiguation "Amazon Kinesis" (aws#599) * docs: Adding instructions for how to add pyenv to your PATH for Windows (aws#600) * docs: Update README with small grammar fix (aws#601) * fix: Update link in NodeJS package.json (aws#603) * docs: Creating instructions for Windows users to install sam (aws#605) * docs: Adding a note directing Windows users to use pipenv (aws#606) * fix: Fix stringifying λ environment variables when using Python2 (aws#579) * feat(generate-event): Added support for 50+ events (aws#612) * feat(invoke): Add region parameter to all invoke related commands (aws#608) * docs: Breaking up README into separate files to make it easier to read (aws#607) * chore: Update JVM size params to match docker-lambda (aws#615) * feat(invoke): Invoke Function Without Parameters through --no-event (aws#604) * docs: Update advanced_usage.rst with clarification on --env-vars usage (aws#610) * docs: Remove an extra word in the sam packaging command (aws#618) * UX: Improves event names to reflect Lambda Event Sources (aws#619) * docs: Fix git clone typo in installation docs (aws#630) * docs(README): Callout go1.x runtime support (aws#631) * docs(installation): Update sam --version command (aws#634) * chore(0.6.0): SAM CLI Version bump (aws#635)

BourgoisMickael · 2022-02-13T22:14:32Z

Any news about this after 8 years ? Will the solution be implemented before the end of the century ?

Zenexer · 2022-03-03T19:13:06Z

@BourgoisMickael It was already implemented: --exact-timestamps

muliyul · 2022-06-04T09:28:28Z

@BourgoisMickael It was already implemented: --exact-timestamps

This is relevant to S3 -> local sync ops but not the other way around (if I understand correctly).

This was referenced Jan 17, 2014

S3 Sync Ignoring Items #450

Closed

s3 sync a changed file of equal size is dependent on time #406

Closed

jamesls mentioned this issue Jan 18, 2014

Added option to compare on ETag/MD5 #575

Closed

laupow mentioned this issue Feb 13, 2014

S3 sync: s3 -> local redownloads unchanged files #648

Closed

jamesls added the s3sync label Mar 21, 2014

jamesls mentioned this issue Mar 21, 2014

Sync occurs despite the fact that the contents haven't changed #710

Closed

johnboxall mentioned this issue Mar 22, 2014

Adds --size-only param to aws s3 sync to allow syncing by filesize only. #719

Merged

andrew512 added a commit to andrew512/aws-cli that referenced this issue Jun 23, 2014

Optional s3 sync flag to address timestamp issues described in aws#599

6c43a77

andrew512 mentioned this issue Jun 24, 2014

Optional s3 sync flag to address https://github.com/aws/aws-cli/issues/599 #824

Merged

ASayre closed this as completed Feb 6, 2018

jamesls reopened this Apr 6, 2018

justnance pinned this issue Jun 22, 2019

swetashre unpinned this issue Jun 24, 2019

justnance mentioned this issue Jul 25, 2019

aws sync truly different files (not based on size or mtime) #4235

Closed

danil-smirnov mentioned this issue Sep 1, 2019

s3 sync --exact-timestamps flag ignored for uploads #4460

Open

kdaily added the automation-exempt Issue will not be subject to stale-bot label Jul 29, 2020

kdaily mentioned this issue Nov 23, 2020

s3 sync repeatedly downloads files without modification #5730

Closed

stobrien89 mentioned this issue Mar 31, 2021

s3 missing 1 file #6047

Closed

tim-finnigan mentioned this issue Jan 4, 2022

S3 sync and cp commands should have a flag to show the local (and remote) file hashes #6631

Closed

thoward-godaddy pushed a commit to thoward-godaddy/aws-cli that referenced this issue Feb 12, 2022

docs: Disambiguation "Amazon Kinesis" (aws#599)

c21a080

kdaily mentioned this issue Jun 3, 2022

S3 sync based on MD5 diff #7011

Closed

2 tasks

tim-finnigan mentioned this issue Sep 1, 2022

aws s3 sync downloading unchanged files. #7228

Closed

tim-finnigan added needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue labels Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS S3 Sync Issues #599

AWS S3 Sync Issues #599

jamesls commented Jan 17, 2014

jdub commented Jan 18, 2014

jdub commented Jan 18, 2014

sebsto commented Jan 18, 2014

jamesls commented Jan 18, 2014

joeharris76 commented Jan 22, 2014

jamesls commented Jan 22, 2014

johnboxall commented Feb 4, 2014

DALDEI commented Feb 21, 2014

jamesls commented Mar 21, 2014

hoegertn commented Mar 21, 2014

DALDEI commented Mar 21, 2014

whiteand72 commented Mar 27, 2014

timvisher commented Apr 7, 2014

ngbranitsky commented Apr 28, 2014

tR-aoltersdorf commented May 15, 2014

vk6flab commented Jun 3, 2014

vk6flab commented Jun 3, 2014

ngbranitsky commented Jul 11, 2014

andrew512 commented Jul 11, 2014

ngbranitsky commented Jul 11, 2014

andrew512 commented Jul 11, 2014

simlu commented Oct 22, 2017

ASayre commented Feb 6, 2018 •

edited

Loading

davidmaxwaterman commented Feb 6, 2018

four43 commented Feb 6, 2018 via email

davidmaxwaterman commented Feb 6, 2018

jamesls commented Apr 6, 2018

acejam commented May 29, 2018

acejam commented Mar 19, 2019

queglay commented Aug 15, 2019

danil-smirnov commented Sep 1, 2019

BourgoisMickael commented Feb 13, 2022

Zenexer commented Mar 3, 2022

muliyul commented Jun 4, 2022

AWS S3 Sync Issues #599

AWS S3 Sync Issues #599

Comments

jamesls commented Jan 17, 2014

Sync Behavior Overview

Potential Solutions

jdub commented Jan 18, 2014

jdub commented Jan 18, 2014

sebsto commented Jan 18, 2014

jamesls commented Jan 18, 2014

joeharris76 commented Jan 22, 2014

jamesls commented Jan 22, 2014

johnboxall commented Feb 4, 2014

DALDEI commented Feb 21, 2014

jamesls commented Mar 21, 2014

hoegertn commented Mar 21, 2014

DALDEI commented Mar 21, 2014

whiteand72 commented Mar 27, 2014

timvisher commented Apr 7, 2014

ngbranitsky commented Apr 28, 2014

tR-aoltersdorf commented May 15, 2014

vk6flab commented Jun 3, 2014

vk6flab commented Jun 3, 2014

ngbranitsky commented Jul 11, 2014

andrew512 commented Jul 11, 2014

ngbranitsky commented Jul 11, 2014

andrew512 commented Jul 11, 2014

simlu commented Oct 22, 2017

ASayre commented Feb 6, 2018 • edited Loading

davidmaxwaterman commented Feb 6, 2018

four43 commented Feb 6, 2018 via email

davidmaxwaterman commented Feb 6, 2018

jamesls commented Apr 6, 2018

acejam commented May 29, 2018

acejam commented Mar 19, 2019

queglay commented Aug 15, 2019

danil-smirnov commented Sep 1, 2019

BourgoisMickael commented Feb 13, 2022

Zenexer commented Mar 3, 2022

muliyul commented Jun 4, 2022

ASayre commented Feb 6, 2018 •

edited

Loading