Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud storage on a target-by-target basis #1112

Closed
2 tasks done
wlandau opened this issue Dec 14, 2019 · 10 comments
Closed
2 tasks done

Cloud storage on a target-by-target basis #1112

wlandau opened this issue Dec 14, 2019 · 10 comments

Comments

@wlandau
Copy link
Member

wlandau commented Dec 14, 2019

Prework

Proposal

Certainly not a new idea, but I think we are now ready to try. We just need to send individual target data files to the cloud, e.g. Amazon S3. The rest of the storr cache can stay local. The big data can live on the cloud, and the swarm of tiny metadata files can still live locally. That way, local drake caches are highly portable and shareable, and we can more easily trust the big data not to break when we move around caches.

What makes this idea possible now? Why not earlier? Because of specialized data formats: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. Due to the mechanics of the implementation, drake can bypass the storr for big files, while letting storr keep taking care of the small files. That means we should be able to shuffle around big data when it counts, while avoiding unnecessary network transactions (e.g. richfitz/storr#72 would send metadata to the cloud as well, which would severely slow down data processing).

API

I am thinking about a new argument to make() and drake_config()

make(plan, storage = "amazon_s3")

and target-specific configurable storage

plan <- drake_plan(
  small_data = get_small_data() # not worth uploading to the cloud,
  large_data = target(
    get_large_data(),
    storage = "amazon_s3"
  )
)
@wlandau
Copy link
Member Author

wlandau commented Dec 14, 2019

Another thing: most of our specialized formats from https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets should be suitable for cloud storage.

plan <- drake_plan(
  small_data = get_small_data() # not worth uploading to the cloud,
  large_data = target(
    get_large_data(),
    format = "fst",
    storage = "amazon_s3"
  ),
  small_model = target(
    fit_model(small_data),
    format = "keras"
  ),
  large_model = target(
    fit_model(large_data),
    format = "keras",
    storage = "amazon_s3"
  )
)

@wlandau
Copy link
Member Author

wlandau commented Dec 14, 2019

Related: DiskFrame/disk.frame#163 (comment), cc @xiaodaigh.

@wlandau
Copy link
Member Author

wlandau commented Dec 14, 2019

Also, any non-local storage should force the "rds" format if the user does not already supply one.

@wlandau
Copy link
Member Author

wlandau commented Jan 5, 2020

I can see a couple different directions:

  1. Build the target on the cloud and store the result there.
  2. Build the target locally and upload the result to the cloud.

Option 1 avoids shuffling the data over a network, but it requires a full Amazon cloud instance on top of the storage, and it requires us to send drake's environment and the target's dependencies to the cloud. Option 2 is far easier to implement and only requires an S3 bucket (no EC2 instance) but it would be really slow to download and upload the target. What do you all think?

@wlandau
Copy link
Member Author

wlandau commented Jan 13, 2020

cloudyr and the WebTechnologies task view has a bunch of helpful packages. cloudyr is going through a transition, but I am confident something actively maintained will turn up eventually.

@wlandau
Copy link
Member Author

wlandau commented Jan 13, 2020

@wlandau wlandau mentioned this issue Feb 13, 2020
2 tasks
@wlandau
Copy link
Member Author

wlandau commented Feb 19, 2020

Maybe pins could be a useful abstraction?

@wlandau
Copy link
Member Author

wlandau commented Feb 22, 2020

#1178 laid the groundwork for dynamic files on the cloud. If we're going to send data to AWS, I plan on implementing it roughly like this.

store_on_aws(key) {
  write_file(key)
  send_to_aws(key)
  drake_aws_object(key, auth_info = getOption("drake_aws_auth_info"))
}

plan <- drake_plan(
  upload_step = target(
    store_on_aws("key"),
    format = "aws"
  )
)

As with #1179, we need some way of verifying that the data actually reached the cloud and is available. Surely AWS S3 has checksum functionality...

@wlandau
Copy link
Member Author

wlandau commented Mar 5, 2020

Just found out about https://github.com/paws-r/paws.

@wlandau
Copy link
Member Author

wlandau commented Mar 12, 2020

Change of direction

My original plan was to work up to this issue with #1168, which was valuable in its own right because static-only files were a stumbling block for so many users. But in this case, I think we should back off. Reasons:

  1. External workarounds should exist using the right condition or change trigger, and they probably are not so bad compared to configuring one's project to use the cloud from R. I would welcome reprexes of use cases so I can help with best practices.
  2. On reflection, I do not think automatic cloud storage is a pressing need. Few people will use it compared to the considerable effort and even more considerable technical debt involved (even after the pre-work in Dynamic files #1168).
  3. I do not compute on the cloud routinely enough to be sure the proposal in this issue the right approach. It may just be better to do both storage and computation on the same cloud instance, and drake is already equipped for this.
  4. We would need to constantly keep up with the changing landscape of cloud platforms and possible changes in authentication requirements, a constant drain on maintenance.
  5. To reliably unit-test this feature set, we would need to maintain an AWS S3 bucket, along with analogous storage on other platforms we choose to support. The constant monetary drain and maintenance burden of the tests is not worth it.

@wlandau wlandau closed this as completed Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant