Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3: fix for corrupted cache on multipart uploads #1867

Merged
merged 8 commits into from
Apr 15, 2019
Merged

s3: fix for corrupted cache on multipart uploads #1867

merged 8 commits into from
Apr 15, 2019

Conversation

olveirap
Copy link
Contributor

@olveirap olveirap commented Apr 10, 2019

Added a check for when the external output of a script to s3 is multipart, and setting the multipart chunksize to the correct one to recover the same ETag.

Also set the multipart threshold to ContentLength + 1 of the object if it's single part to avoid the copy to do it in multiple parts.

Added an exception for when the output object's ETag doesn't match the one copied to the cache.

Patch for s3 in #1410. Multipart uploads of variable size not supported

dvc/remote/s3.py Outdated Show resolved Hide resolved
@pared
Copy link
Contributor

pared commented Apr 11, 2019

I think some test would be appropriate for this one, but I guess we need test s3 instance. Am I right on this one @efiop ?

@efiop
Copy link
Contributor

efiop commented Apr 11, 2019

@pared Indeed, we've agreed that I'll add a test for it, since our s3 testing is not transparent and requires access to our s3 account. I'll send one ASAP.

olveirap and others added 3 commits April 12, 2019 08:26
The etag of multipart objects depends of the number of parts, when copying to the cache we should do so in the same number of parts that the original object was moved/uploaded in.

Fixes part of #1410
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Specifically because of this getmoto/moto#1941

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Turned out to be quite buggy.

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
@efiop efiop merged commit 75b4d0e into iterative:master Apr 15, 2019
@efiop efiop self-assigned this Apr 15, 2019
@efiop
Copy link
Contributor

efiop commented Apr 15, 2019

@olveirap Adjusted your patch to support multiparts with variable chunk size and added a test for that to our test suit. We will release this during this week. Thank you so much! 🙂

rpip pushed a commit to rpip/dvc that referenced this pull request Apr 16, 2019
* master: (45 commits)
  s3: fix for corrupted cache on multipart uploads (iterative#1867)
  Update PULL_REQUEST_TEMPLATE.md
  Create PULL_REQUEST_TEMPLATE.md
  Delete pull_request_template.md
  github: add PR template with a small checklist
  ignore: introduce .dvcignore, apply when searching for stage files
  add: dont use 'file' as var name during iteration
  test: move corrupted hardlink cache test test_repro -> test_run
  test_repro: make cache use hardlink
  cache: make reflink, copy default links
  diff: cli: add documentation link
  Update README.rst
  tests: be paranoic with caplog
  dvc: bump to 0.35.7
  dvc: deprecate --remove-outs
  logger: use exception instead of error
  version: show when the release version is dirty
  'till'=>'until' in `b_ref` arg. help text
  Adds space after period. (See previous commit.)
  Changes ',' for '.' in `-t` help text.
  ...
@olveirap olveirap deleted the fix-1410 branch April 16, 2019 18:47
@efiop
Copy link
Contributor

efiop commented May 12, 2019

@olveirap Just a heads up: 0.40.0 with this fix is out. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants