Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS S3 support for rds storr #72

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ben-gready
Copy link

@ben-gready ben-gready commented Apr 26, 2018

Hi there,

I'm not sure whether you're accepting pull requests, and what the proper etiquette is for asking, but I recently forked your (great) repo, and I think I have a useful contribution. I needed a server-agnostic back end that I could access from anywhere, and since I'm using AWS S3 for other stuff, I decided to implement a driver using that. I completely ripped off your rds driver file & added S3 equivalents for reads, writes listing files etc. It seems to work well, although I have not written tests at this point. I've added a little documentation (which I know is currently lacking, and in the wrong place).

Before I spend any more time on cleaning things up, I was wondering whether you're interested in accepting such a pull request in the first place, and if so, whether you'd mind taking a look at what i have done and letting me know where you'd like to see changes, improvements, etc. I have tried to code everything using base functions where possible, so as to not increase package dependencies.

Thanks for your time,
Ben

@richfitz
Copy link
Owner

Hi Ben - thanks so much for this! There are people who have asked for this specifically (for example: #11 (comment)).

In general I am totally up for PRs, but I use storr as an infrastructure package so try to keep its footprint fairly light. Adding aws.s3 (which is totally the right way of doing what you're doing here) is going to drag in more dependencies than I'd like.

I set up storr to be somewhat extensible - see here for a Redis storr driver. So the options as I see it are:

  1. We could include this PR into the package, but probably would have to move aws.s3 into Suggests. That doesn't provide an ideal user experience
  2. Spin this out into its own micropackage (e.g., storr.s3) - I'd be happy to help and this is not a great deal of work aside from dealing with CRAN
  3. It could be folded into the aws.s3 package itself but I suspect that @leeper probably feels the same way as me about dependencies :)

@richfitz
Copy link
Owner

I see you already made aws.s3 a Suggests - thanks.

A fourth option is to make a package storr.remote and include things like access over ssh (using the ssh package)...

@leeper
Copy link

leeper commented Apr 26, 2018

Agree. aws.s3 is meant to be low-level. Glad to see it being used, though!

Copy link
Owner

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments through - most of which are style at the moment. Testing is the big thing that needs doing. Do do this we need:

  1. a function in a helper-xxx file like skip_if_no_aws_credentials
  2. full spec test of the new driver - see https://github.com/richfitz/redux/blob/542f10d56fcf87505ea640f420542e297de81391/tests/testthat/test-storr.R#L3-L8 and https://github.com/richfitz/storr/blob/master/tests/testthat/test-auto.R for examples
  3. Tests of any s3 internals
  4. Ideally, set up aws credentials (a free tier account ideally) for use on travis

These will be needed/desirable regardless of where this ends up (comments in main discussion)

@@ -22,6 +22,7 @@ Suggests:
knitr,
mockr,
rbenchmark,
testthat (>= 1.0.0)
testthat (>= 1.0.0),
aws.s3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put dependencies alphabetically?

}

##' @export
##' @rdname storr_rds
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be storr_rds_s3

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is why the travis build is failing)

##' @param default_namespace Default namespace (see
##' \code{\link{storr}}).
##' @export
##' @examples
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs redoing

##' exists (or no mangledness if creating a new storr).
##'
##' @title rds object cache driver
##' @param path Path for the store. \code{tempdir()} is a good choice
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider @inheritParams storr_rds

},

type = function() {
"rds_s3"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the type should simply be s3?

## However it will throw an error if the file we are trying to delete looks like a directory.
## S3 has no actual notion of directory, we just fake it using "/". As a result, it's possible to get into a muddle.
## to play it save, line below can be uncommented to force it to delete just the path given, but throw a warning, if it does look like a directory
## can also change to if_dir = "del_recursive" to delete the whole directory with a warning. May never actually show up as an issue, this is just a note.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're the only modifying the bucket is this a real concern? I don't know enough about S3! Can you reflow to 80 characters please?

invisible(exists)
}

s3_writeLines <- function(text, path, bucket){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between ( and {


s3_delete_file <- function(bucket, path, if_dir = c("stop", "del_only_key", "del_recursive")){
files <- aws.s3::get_bucket_df(bucket = bucket, prefix = path, max = Inf)[["Key"]]
if(length(files) > 1){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General style cleanup through here please

@@ -12,7 +12,7 @@ Simple object cacher for R. `storr` acts as a very simple key-value store (supp
* Fetch from an external source (e.g. website) if a key is not found locally
* Pluggable storage backends - currently
- environment (memory)
- rds (disk)
- rds (disk, AWS S3)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it makes sense to think of this as an S3 driver rather than an rds driver

@@ -43,3 +43,15 @@ remotes::install_github("richfitz/storr@develop", upgrade = FALSE)

* [storr](https://richfitz.github.io/storr/articles/storr.html) `vignette("storr")` outlines basic use and core implementation details.
* [external](https://richfitz.github.io/storr/articles/external.html) `vignette("external")` shows how to use storr to cache external resources such as files, web resources, etc, using the `storr_external` object.

## RDS on AWS S3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an entirely new vignette would be the best solution for documenting this

@ben-gready
Copy link
Author

Thanks so much for the replies and feedback.

I'd be happy going down any of the routes that you mentioned - I particularly like your suggestion of storr.remote. While I won't have time to add other remote drivers in the near future, I think I could abstract out all of the aws-specific read/write/list functions, so that they can be easily reproduced for other back ends if others wish to contribute.

Assuming that you think a separate package is the way forward, I will setup a new repo and copy this stuff over, taking into account your above recommendations. This is mostly an "evenings" project right now, so I will link you to the new repo once I have created it. I would gratefully take you up on the offer for help in getting this onto CRAN since this would be my first submission :)

Thanks again for the great feedback!

Ben

@ben-gready
Copy link
Author

ben-gready commented Apr 27, 2018

Hello again,

Just to let you know, I have created the storr.remote repo here. On the master branch, I have included the stuff in this PR & addressed most of your above comments, although not all yet.

I am also working on a branch where I have tried to abstract all s3 file operations into their own R6 class that can be fed into the R6_driver_rds class (I have actually created a version of that called R6_driver_rds_remote). It's not quite working yet, but my hope is that it will be possible to just replicate the s3 file operations class for another backend, and everything else will just plug in.

Still no testing or real documentation (I will get there, I promise), and I'm sure lots of other things. Hopefully I can find time to continue in the next few days.

Thanks,
Ben

@richfitz
Copy link
Owner

Awesome work! I will look at this early next week 🎉

ben-gready/storr.remote#1

@richfitz
Copy link
Owner

richfitz commented May 3, 2018

see also: #61

@richfitz
Copy link
Owner

richfitz commented May 3, 2018

Next place to look is: #74

@richfitz
Copy link
Owner

richfitz commented May 3, 2018

OK, I got a bit more dragged into this today than I was expecting to so most of the pieces are in place. This I think is the process from here:

  1. please review Remote file operations and driver #74 for the infrastructure changes in storr. These introduce no new dependencies but do involve quite a bit of dependence on the underlying rds driver so I think is best to put into storr itself. There is a dummy driver in the test directory (see the PR itself for details). We can change anything here if the abstraction does not seem right (names of classes and drivers, methods, method signatures, etc).
  2. please review ssh support and a second round of abstract file operations ben-gready/storr.remote#2 (which replaces Abstract file ops ben-gready/storr.remote#1 but includes content from it!) - note that Remote file operations and driver #74 needs to be merged before the storr.remote changes make any real sense
  3. test the blind changes I made to the s3 part of the pr
  4. set up automatic testing - see the ssh tests for a complete example and the drivers vignette
  5. close Add AWS S3 support for rds storr #72 and Abstract file ops ben-gready/storr.remote#1

@richfitz
Copy link
Owner

Hi @ben-gready - have you had a chance to look at any of this yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants