-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud storage on a target-by-target basis #1112
Comments
Another thing: most of our specialized formats from https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets should be suitable for cloud storage. plan <- drake_plan(
small_data = get_small_data() # not worth uploading to the cloud,
large_data = target(
get_large_data(),
format = "fst",
storage = "amazon_s3"
),
small_model = target(
fit_model(small_data),
format = "keras"
),
large_model = target(
fit_model(large_data),
format = "keras",
storage = "amazon_s3"
)
) |
Related: DiskFrame/disk.frame#163 (comment), cc @xiaodaigh. |
Also, any non-local storage should force the "rds" format if the user does not already supply one. |
I can see a couple different directions:
Option 1 avoids shuffling the data over a network, but it requires a full Amazon cloud instance on top of the storage, and it requires us to send |
|
Re #1112 (comment), let's focus this issue on (2). (1) is similar enough to enabling cloud-based HPCs through |
Maybe |
#1178 laid the groundwork for dynamic files on the cloud. If we're going to send data to AWS, I plan on implementing it roughly like this. store_on_aws(key) {
write_file(key)
send_to_aws(key)
drake_aws_object(key, auth_info = getOption("drake_aws_auth_info"))
}
plan <- drake_plan(
upload_step = target(
store_on_aws("key"),
format = "aws"
)
) As with #1179, we need some way of verifying that the data actually reached the cloud and is available. Surely AWS S3 has checksum functionality... |
Just found out about https://github.com/paws-r/paws. |
Change of directionMy original plan was to work up to this issue with #1168, which was valuable in its own right because static-only files were a stumbling block for so many users. But in this case, I think we should back off. Reasons:
|
Prework
drake
's code of conduct.Proposal
Certainly not a new idea, but I think we are now ready to try. We just need to send individual target data files to the cloud, e.g. Amazon S3. The rest of the
storr
cache can stay local. The big data can live on the cloud, and the swarm of tiny metadata files can still live locally. That way, localdrake
caches are highly portable and shareable, and we can more easily trust the big data not to break when we move around caches.What makes this idea possible now? Why not earlier? Because of specialized data formats: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. Due to the mechanics of the implementation,
drake
can bypass thestorr
for big files, while lettingstorr
keep taking care of the small files. That means we should be able to shuffle around big data when it counts, while avoiding unnecessary network transactions (e.g. richfitz/storr#72 would send metadata to the cloud as well, which would severely slow down data processing).API
I am thinking about a new argument to
make()
anddrake_config()
and target-specific configurable storage
The text was updated successfully, but these errors were encountered: