Cloud storage on a target-by-target basis #1112

wlandau · 2019-12-14T03:21:10Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Proposal

Certainly not a new idea, but I think we are now ready to try. We just need to send individual target data files to the cloud, e.g. Amazon S3. The rest of the storr cache can stay local. The big data can live on the cloud, and the swarm of tiny metadata files can still live locally. That way, local drake caches are highly portable and shareable, and we can more easily trust the big data not to break when we move around caches.

What makes this idea possible now? Why not earlier? Because of specialized data formats: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. Due to the mechanics of the implementation, drake can bypass the storr for big files, while letting storr keep taking care of the small files. That means we should be able to shuffle around big data when it counts, while avoiding unnecessary network transactions (e.g. richfitz/storr#72 would send metadata to the cloud as well, which would severely slow down data processing).

API

I am thinking about a new argument to make() and drake_config()

make(plan, storage = "amazon_s3")

and target-specific configurable storage

plan <- drake_plan(
  small_data = get_small_data() # not worth uploading to the cloud,
  large_data = target(
    get_large_data(),
    storage = "amazon_s3"
  )
)

The text was updated successfully, but these errors were encountered:

wlandau · 2019-12-14T03:26:32Z

Another thing: most of our specialized formats from https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets should be suitable for cloud storage.

plan <- drake_plan(
  small_data = get_small_data() # not worth uploading to the cloud,
  large_data = target(
    get_large_data(),
    format = "fst",
    storage = "amazon_s3"
  ),
  small_model = target(
    fit_model(small_data),
    format = "keras"
  ),
  large_model = target(
    fit_model(large_data),
    format = "keras",
    storage = "amazon_s3"
  )
)

wlandau · 2019-12-14T03:27:53Z

Related: DiskFrame/disk.frame#163 (comment), cc @xiaodaigh.

wlandau · 2019-12-14T03:35:20Z

Also, any non-local storage should force the "rds" format if the user does not already supply one.

wlandau · 2020-01-05T14:50:37Z

I can see a couple different directions:

Build the target on the cloud and store the result there.
Build the target locally and upload the result to the cloud.

Option 1 avoids shuffling the data over a network, but it requires a full Amazon cloud instance on top of the storage, and it requires us to send drake's environment and the target's dependencies to the cloud. Option 2 is far easier to implement and only requires an S3 bucket (no EC2 instance) but it would be really slow to download and upload the target. What do you all think?

wlandau · 2020-01-13T01:12:35Z

cloudyr and the WebTechnologies task view has a bunch of helpful packages. cloudyr is going through a transition, but I am confident something actively maintained will turn up eventually.

wlandau · 2020-01-13T01:20:35Z

Re #1112 (comment), let's focus this issue on (2). (1) is similar enough to enabling cloud-based HPCs through future and clustermq. Related:

wlandau · 2020-02-19T02:44:22Z

Maybe pins could be a useful abstraction?

wlandau · 2020-02-22T13:53:33Z

#1178 laid the groundwork for dynamic files on the cloud. If we're going to send data to AWS, I plan on implementing it roughly like this.

store_on_aws(key) {
  write_file(key)
  send_to_aws(key)
  drake_aws_object(key, auth_info = getOption("drake_aws_auth_info"))
}

plan <- drake_plan(
  upload_step = target(
    store_on_aws("key"),
    format = "aws"
  )
)

As with #1179, we need some way of verifying that the data actually reached the cloud and is available. Surely AWS S3 has checksum functionality...

wlandau · 2020-03-05T19:24:06Z

Just found out about https://github.com/paws-r/paws.

wlandau · 2020-03-12T01:21:45Z

Change of direction

My original plan was to work up to this issue with #1168, which was valuable in its own right because static-only files were a stumbling block for so many users. But in this case, I think we should back off. Reasons:

External workarounds should exist using the right condition or change trigger, and they probably are not so bad compared to configuring one's project to use the cloud from R. I would welcome reprexes of use cases so I can help with best practices.
On reflection, I do not think automatic cloud storage is a pressing need. Few people will use it compared to the considerable effort and even more considerable technical debt involved (even after the pre-work in Dynamic files #1168).
I do not compute on the cloud routinely enough to be sure the proposal in this issue the right approach. It may just be better to do both storage and computation on the same cloud instance, and drake is already equipped for this.
We would need to constantly keep up with the changing landscape of cloud platforms and possible changes in authentication requirements, a constant drain on maintenance.
To reliably unit-test this feature set, we would need to maintain an AWS S3 bucket, along with analogous storage on other platforms we choose to support. The constant monetary drain and maintenance burden of the tests is not worth it.

wlandau added the type: new feature label Dec 14, 2019

wlandau self-assigned this Dec 14, 2019

wlandau mentioned this issue Dec 14, 2019

Comparison with Netflix's Metaflow #472

Closed

wlandau added the topic: performance label Dec 14, 2019

wlandau added the depends: help or input label Dec 28, 2019

wlandau removed their assignment Dec 28, 2019

wlandau mentioned this issue Feb 13, 2020

Dynamic files #1168

Closed

2 tasks

wlandau mentioned this issue Feb 19, 2020

[Use case] Decorated file paths r-lib/vctrs#840

Closed

wlandau closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud storage on a target-by-target basis #1112

Cloud storage on a target-by-target basis #1112

wlandau commented Dec 14, 2019 •

edited

Loading

wlandau commented Dec 14, 2019

wlandau commented Dec 14, 2019

wlandau commented Dec 14, 2019

wlandau commented Jan 5, 2020 •

edited

Loading

wlandau commented Jan 13, 2020

wlandau commented Jan 13, 2020

wlandau commented Feb 19, 2020

wlandau commented Feb 22, 2020

wlandau commented Mar 5, 2020

wlandau commented Mar 12, 2020 •

edited

Loading

Cloud storage on a target-by-target basis #1112

Cloud storage on a target-by-target basis #1112

Comments

wlandau commented Dec 14, 2019 • edited Loading

Prework

Proposal

API

wlandau commented Dec 14, 2019

wlandau commented Dec 14, 2019

wlandau commented Dec 14, 2019

wlandau commented Jan 5, 2020 • edited Loading

wlandau commented Jan 13, 2020

wlandau commented Jan 13, 2020

wlandau commented Feb 19, 2020

wlandau commented Feb 22, 2020

wlandau commented Mar 5, 2020

wlandau commented Mar 12, 2020 • edited Loading

Change of direction

wlandau commented Dec 14, 2019 •

edited

Loading

wlandau commented Jan 5, 2020 •

edited

Loading

wlandau commented Mar 12, 2020 •

edited

Loading