Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import drake caches into other drake caches #1015

Closed
2 tasks done
wlandau opened this issue Sep 25, 2019 · 16 comments
Closed
2 tasks done

Import drake caches into other drake caches #1015

wlandau opened this issue Sep 25, 2019 · 16 comments

Comments

@wlandau
Copy link
Member

wlandau commented Sep 25, 2019

Prework

Description

$import() in storr gets us part of the way there, but we also have history files and specialized data files to wrangle too. (So we need wlandau/txtq#17). With some work on drake's decorated storr, this should be possible to do the following with impunity:

cache1 <- drake_cache("path1")
cache2 <- drake_cache("path2")
cache1$import(cache2)

@billdenney and @noamross, how much would this address the issues around remote caches and collaborative projects you brought up recently?

@wlandau
Copy link
Member Author

wlandau commented Sep 25, 2019

The decorated storr would need a serious $import method instead of delegating to the inner storr.

import = function(...) .self$storr$import(...),

@billdenney
Copy link
Contributor

Thanks for starting this one!

In my use case, I’d like to have a bit more and different functionality. I have a few different concerns and use cases:

For my collaboration work, people are often working with similar data types and therefore target names. For example, cache1 and cache2 may both have targets named “vital_signs”. For that, a direct import would fail without putting the caches into some separated namespace.

The other issue that I encounter is that I’d like to be able to run the plans that originally generated cache1 and cache2 in case new data comes in without modifying the plans.

In practice, I usually want only a subset of the targets out of cache1 and cache2 in the new, combined cache. Often, I only need one target from both when I’m combining them.

In my ideal scenario the way I’m thinking about it now, I think it would be possible to have something akin to namespaces within caches. Then, I could import cache1 and cache2 into a new cache3. I could access cache1 objects from cache3 like: cache1::targetname or cache2::targetname, and then when generating new items in cache3 they could refer to those target names.

And, overall, I would be ok rerunning everything and my use case is more at the plan than the cache level (though both would be helpful in different ways). For me, most of my plans take minutes to hours to run, and a one time cost of rerunning them upon combination would be ok.

@noamross
Copy link
Contributor

I think this would cover it in my case. In my scenario people are working on the same project / plan, so I don't expect clashes between target names from different plans.

I note that the alternative workflow I'm thinking of using a shared S3-based storr, which require someone, maybe me, wrapping up this old PR to storr for an S3 back-end: richfitz/storr#72 . But this import method will probably be more efficient while running, because it won't require so many network calls for small transactions.

@wlandau
Copy link
Member Author

wlandau commented Sep 26, 2019

@noamross:

I think this would cover it in my case. In my scenario people are working on the same project / plan, so I don't expect clashes between target names from different plans.

Sounds good. I think the path to a first implementation is clear.

I note that the alternative workflow I'm thinking of using a shared S3-based storr, which require someone, maybe me, wrapping up this old PR to storr for an S3 back-end: richfitz/storr#72 . But this import method will probably be more efficient while running, because it won't require so many network calls for small transactions.

Yes, if the storr begins as an S3 file system, network delays will slow down make() considerably. Not only that, but history and specially-formatted targets will still be stored locally.

@billdenney, unfortunately, drake already uses several storr namespaces, so we cannot define different namespaces to indicate different caches. In your case, would an import_target() method help for bringing in individual targets?

@billdenney
Copy link
Contributor

@wlandau, yes, I think that an import_target() method would help. Ideally, it would perform some type of currency check so that out of date targets would not be imported or so that it would have some form of notification for an out of date target (perhaps without a force argument).

@wlandau
Copy link
Member Author

wlandau commented Sep 27, 2019

Hmm... I hesitate to check for outdatedness because it is comparatively slow and requires an examination of the whole plan. Ultimately, you will need to call outdated() or r_outdated() after you import stuff. If stuff falls out of date and you want to go back, you should be able to run make(recover = TRUE) and still have the imported targets in the cache until you run garbage collection.

@wlandau
Copy link
Member Author

wlandau commented Sep 27, 2019

Another thing I just realized. In drake, an "import" is a function or other object drake brings in from your (usually global) environment. In storr, we talk about importing one key-value store into another. The methods $import and $import_target() are really extensions of storr, so it might be best to defer to storr's name conventions.

@billdenney
Copy link
Contributor

Could there be a flag to check for the imported target being outdated when imported? I ask because, while slow/expensive, I would appreciate the consistency that it would guarantee for the plan combination.

@wlandau
Copy link
Member Author

wlandau commented Sep 28, 2019

Taking a step back

Sorry, I think that's the point where things get too complicated to automate well.

Another way to think about it: a target you import could be invalid in your cache and plan even if it was originally up to date in your colleague's cache and plan. So there is no way to be sure about the kind of target status that ultimately matters. Spelling this out actually makes me question the wisdom of import_target()...

The more I think about it, the more I think the merging you describe is going to require a fully-fledged version control system. For all I know, git may be able to merge everyone's .drake/'s and everyone's plans. It is a big risk, but it goes farther than anything drake itself can provide cleanly.

Can we talk about why you want to merge different drake projects together in the first place? Maybe we should instead focus on better ways to chain multiople pipelines together. That's a legitimate way to go.

Proposal

  1. Use the change trigger to carry over build decisions from one plan to another.
  2. Use ignore() on the cache objects so R6 funny business (R6 object keeps rebuilding #345) does not trigger too many builds.
library(drake)

cache1 <- new_cache("cache1")
cache2 <- new_cache("cache2")

plan1 <- drake_plan(x = 1)
plan2 <- drake_plan(
  x = target(
    ignore(cache1)$get("x"),
    trigger = trigger(change = ignore(cache1)$get_hash("x"))
  )
)

# First runthroughs
make(plan1, cache = cache1)
#> target x
make(plan2, cache = cache2)
#> target x

# Change nothing
make(plan1, cache = cache1)
#> All targets are already up to date.
make(plan2, cache = cache2)
#> All targets are already up to date.

# Change x in the *first* plan
plan1 <- drake_plan(x = 2)
make(plan1, cache = cache1)
#> target x

# x changes in plan2 because of the change trigger
make(plan2, cache = cache2)
#> target x

Created on 2019-09-27 by the reprex package (v0.3.0)

@billdenney
Copy link
Contributor

That type of chaining would do effectively all of what I need. Everything else I'm asking for (like detecting outdated objects from the other cache) is more of a want. I don't think I would have gotten to the ignore(cache1) on my own.

As a simplification, would it be reasonable to do something like this instead for plan2?

plan2 <- drake_plan(
  x = readd(x, cache=cache1)
)

If so, perhaps the cache argument of readd() could be special-cased with an ignore() automatically wrapped around it for this scenario. My guess is that even if mine will work, your method is more efficient.

@wlandau
Copy link
Member Author

wlandau commented Sep 28, 2019

Special-casing would unfortunately break the promise of static code analysis that the values of objects behind the symbols do not matter. But if it helps make the plan look cleaner, this will probably work:

cache1 <- function() {
  drake_cache("cache1")
}

plan2 <- drake_plan(
  x = target(
    cache1()$get("x"),
    trigger = trigger(change = cache1()$get_hash("x"))
  )
)

@billdenney
Copy link
Contributor

In that case, I'd go for the slightly-less-indirection method you originally proposed. I think that from my perspective, you can consider this closed. My only slight additional request would be that your suggestion in #1015 (comment) would be helpful to have in some form of documentation (perhaps it's not exactly a "frequently" asked question, but knowing how to do it could perhaps help others ;) ).

@wlandau
Copy link
Member Author

wlandau commented Sep 29, 2019

I think an FAQ is easiest to way to insert this workaround in the docs. If you find a place in the manual that fits, please feel free to submit a PR.

I changed my mind on cache$import() though. Implementing it was more involved than I expected.

  • The way drake uses namespaces clashes with storr's import method. We have to manually exclude the "progress" namespace or else it will error out.
  • storr loads everything into memory, which does not bode well for typical drake use cases. The method I implemented loads targets one at a time, supports parallel computing, and uses optional garbage collection. PR forthcoming.

As a side-effect of this new strategy, you will be able to import targets one at a time.

@billdenney
Copy link
Contributor

Thanks!

@wlandau
Copy link
Member Author

wlandau commented Nov 17, 2019

@billdenney, just an FYI: in the next release of drake, you will no longer need to explicitly ignore() caches (#1071) so the workaround in #1015 (comment) will no longer be necessary. Turns out we can special-case caches in the storage phase (as opposed to the code analysis phase).

@billdenney
Copy link
Contributor

That's great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants