Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: Support CopyObject for all sizes #617

Closed
3 tasks done
donatello opened this issue Feb 23, 2017 · 9 comments
Closed
3 tasks done

api: Support CopyObject for all sizes #617

donatello opened this issue Feb 23, 2017 · 9 comments

Comments

@donatello
Copy link
Member

donatello commented Feb 23, 2017

High-level CopyObject requirements:

  • Support copying objects of all sizes.
  • Support source objects with an arbitrary range header (i.e. any valid start and end offset of a source object) via multipart copy object.
  • Support all copy conditions (this is already supported).

Currently, the library supports only copying objects <= 5GiB in size. Larger objects can also be copied via a multipart-copy-object strategy.

The multipart copy object operation consists of starting a new multipart upload, followed by 1 or more copy-object-part requests, and finally a complete-multipart request.

Note that copy object via a single PUT request does not support range headers, but copy-object-part does support this.

This feature was recently implemented in the Haskell SDK.

@harshavardhana
Copy link
Member

harshavardhana commented Feb 23, 2017

Support source objects with an arbitrary range header (i.e. any valid start and end offset of a source object) via multipart copy object.

Why should this be supported? - what benefit does this provide a user? also at what junction a user really knows that they need to copy only a certain range of the object.

Since we cannot append on the destination i don't see how this API behavior benefits anyone.

@hashbackup
Copy link

hashbackup commented Feb 23, 2017

HashBackup could put a multipart ranged copy to good use. HB packs files into arc files during the backup. These default to 100MB but can be larger, like 4GB. Over time, arc files get "holes" poked in them as files are deleted due to retention policies.

For example, you backup a 75MB file and a 25MB file into 1 arc file and store it on S3. The first file is marked deleted. To actually recover space, the 100MB arc file has to be downloaded, packed, and uploaded. The download is where high costs are incurred.

By using a series of mulipart copy requests, this packing operation could be done remotely without requiring a download. I think the only cost would be the request cost: I couldn't see where Amazon charges fees for copy based on the size of the data.

(Just realized this is for the Go binding, and I'm using the Python binding)

@harshavardhana
Copy link
Member

harshavardhana commented Feb 23, 2017

Yes but minio libraries are not meant for exposing lower level multipart operations. For that you should use AWS SDKs or copy minio library source into your repo.

I don't see why we should explore range APIs while not exposing multipart APIs underneath.

@hashbackup
Copy link

A range list of start-end offsets could be added to copy_object without exposing multipart.

@donatello
Copy link
Member Author

Why should this be supported? - what benefit does this provide a user? also at what junction a user really knows that they need to copy only a certain range of the object.

Since we cannot append on the destination i don't see how this API behavior benefits anyone.

@harshavardhana Here is my reasoning about this:

  • Adding range-header is a simple extension to copy-object in the same spirit of put-object that handles all sizes transparently.
  • API-wise, adding a single range, is similar to get-object-partial that lets a user download a part of an object. For copy-object it just lets a user copy a part (i.e. a single contiguous segment) of an object into a new object.
  • The range-header just adds another possible source for copy-object and it is a simple extension - the logic to create a large object (>5GiB) already needs to use range-headers in the lower level copy-object-part API. It takes very little code to allow the caller of (high-level) copy-object to specify a start-end offset.
  • The input to this high-level API is the same as for copy-object-part (low-level) API, i.e. source object, optional range-offsets (only start and end), and optional copy-conditions.

When discussing this with @balamurugana - he gave the idea to do an even more general API that accepts multiple source objects with one or more start-end offset pairs for each source object, that can be used to create a single object on the server side using only copy-object. He believed that is a useful operation for working with related objects that are created separately and finally need to be stitched together (e.g. large video production/rendering applications, and @hashbackup's application above). This was going to be my next proposal.

@hashbackup
Copy link

A negative aspect of exposing ranges is that it might not actually work as expected. After reading about copy object with ranges on S3, it seems that each range must be at least 5M, because it uses the multipart API. So if a user says to copy bytes 0-5 and bytes 20-30, what should happen? You could get very general and do a download, create a temp file with only the bytes needed, then upload it as a new file, but seems to be way out of scope for minio, and whether/how to do that would be very dependent on the storage service's capabilities.

@harshavardhana harshavardhana changed the title Support CopyObject for all sizes api: Support CopyObject for all sizes Mar 10, 2017
@deekoder deekoder added this to the Future milestone Apr 6, 2017
@harshavardhana
Copy link
Member

Moving this as blocked to discuss with @abperiasamy

@deekoder deekoder assigned donatello and unassigned donatello Jun 1, 2017
@harshavardhana
Copy link
Member

BTW this is not blocked anymore @deekoder

@deekoder deekoder removed the blocked label Jun 27, 2017
@donatello
Copy link
Member Author

CopyObject now supports objects of all sizes, copy-conditions, source object ranges, server-side-encryption with decryption of source and encryption of destination, and copying/setting user-metadata on the destination.

In addition, the ComposeObject function is added, which enables creating objects from multiple source objects by providing a concatenation specification.

These changes are available in version 3.0.0 onwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants