Import drake caches into other drake caches #1015

wlandau · 2019-09-25T19:55:15Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Description

$import() in storr gets us part of the way there, but we also have history files and specialized data files to wrangle too. (So we need wlandau/txtq#17). With some work on drake's decorated storr, this should be possible to do the following with impunity:

cache1 <- drake_cache("path1")
cache2 <- drake_cache("path2")
cache1$import(cache2)

@billdenney and @noamross, how much would this address the issues around remote caches and collaborative projects you brought up recently?

The text was updated successfully, but these errors were encountered:

wlandau · 2019-09-25T19:56:15Z

The decorated storr would need a serious $import method instead of delegating to the inner storr.

drake/R/decorate_storr.R

Line 130 in 9d50193

import = function(...) .self$storr$import(...),

billdenney · 2019-09-25T20:45:10Z

Thanks for starting this one!

In my use case, I’d like to have a bit more and different functionality. I have a few different concerns and use cases:

For my collaboration work, people are often working with similar data types and therefore target names. For example, cache1 and cache2 may both have targets named “vital_signs”. For that, a direct import would fail without putting the caches into some separated namespace.

The other issue that I encounter is that I’d like to be able to run the plans that originally generated cache1 and cache2 in case new data comes in without modifying the plans.

In practice, I usually want only a subset of the targets out of cache1 and cache2 in the new, combined cache. Often, I only need one target from both when I’m combining them.

In my ideal scenario the way I’m thinking about it now, I think it would be possible to have something akin to namespaces within caches. Then, I could import cache1 and cache2 into a new cache3. I could access cache1 objects from cache3 like: cache1::targetname or cache2::targetname, and then when generating new items in cache3 they could refer to those target names.

And, overall, I would be ok rerunning everything and my use case is more at the plan than the cache level (though both would be helpful in different ways). For me, most of my plans take minutes to hours to run, and a one time cost of rerunning them upon combination would be ok.

noamross · 2019-09-25T20:49:18Z

I think this would cover it in my case. In my scenario people are working on the same project / plan, so I don't expect clashes between target names from different plans.

I note that the alternative workflow I'm thinking of using a shared S3-based storr, which require someone, maybe me, wrapping up this old PR to storr for an S3 back-end: richfitz/storr#72 . But this import method will probably be more efficient while running, because it won't require so many network calls for small transactions.

wlandau · 2019-09-26T20:18:17Z

@noamross:

I think this would cover it in my case. In my scenario people are working on the same project / plan, so I don't expect clashes between target names from different plans.

Sounds good. I think the path to a first implementation is clear.

I note that the alternative workflow I'm thinking of using a shared S3-based storr, which require someone, maybe me, wrapping up this old PR to storr for an S3 back-end: richfitz/storr#72 . But this import method will probably be more efficient while running, because it won't require so many network calls for small transactions.

Yes, if the storr begins as an S3 file system, network delays will slow down make() considerably. Not only that, but history and specially-formatted targets will still be stored locally.

@billdenney, unfortunately, drake already uses several storr namespaces, so we cannot define different namespaces to indicate different caches. In your case, would an import_target() method help for bringing in individual targets?

billdenney · 2019-09-27T05:18:30Z

@wlandau, yes, I think that an import_target() method would help. Ideally, it would perform some type of currency check so that out of date targets would not be imported or so that it would have some form of notification for an out of date target (perhaps without a force argument).

wlandau · 2019-09-27T15:51:16Z

Hmm... I hesitate to check for outdatedness because it is comparatively slow and requires an examination of the whole plan. Ultimately, you will need to call outdated() or r_outdated() after you import stuff. If stuff falls out of date and you want to go back, you should be able to run make(recover = TRUE) and still have the imported targets in the cache until you run garbage collection.

wlandau · 2019-09-27T15:53:41Z

Another thing I just realized. In drake, an "import" is a function or other object drake brings in from your (usually global) environment. In storr, we talk about importing one key-value store into another. The methods $import and $import_target() are really extensions of storr, so it might be best to defer to storr's name conventions.

billdenney · 2019-09-27T17:28:46Z

Could there be a flag to check for the imported target being outdated when imported? I ask because, while slow/expensive, I would appreciate the consistency that it would guarantee for the plan combination.

wlandau · 2019-09-28T02:28:37Z

Taking a step back

Sorry, I think that's the point where things get too complicated to automate well.

Another way to think about it: a target you import could be invalid in your cache and plan even if it was originally up to date in your colleague's cache and plan. So there is no way to be sure about the kind of target status that ultimately matters. Spelling this out actually makes me question the wisdom of import_target()...

The more I think about it, the more I think the merging you describe is going to require a fully-fledged version control system. For all I know, git may be able to merge everyone's .drake/'s and everyone's plans. It is a big risk, but it goes farther than anything drake itself can provide cleanly.

Can we talk about why you want to merge different drake projects together in the first place? Maybe we should instead focus on better ways to chain multiople pipelines together. That's a legitimate way to go.

Proposal

Use the change trigger to carry over build decisions from one plan to another.
Use ignore() on the cache objects so R6 funny business (R6 object keeps rebuilding #345) does not trigger too many builds.

library(drake)

cache1 <- new_cache("cache1")
cache2 <- new_cache("cache2")

plan1 <- drake_plan(x = 1)
plan2 <- drake_plan(
  x = target(
    ignore(cache1)$get("x"),
    trigger = trigger(change = ignore(cache1)$get_hash("x"))
  )
)

# First runthroughs
make(plan1, cache = cache1)
#> target x
make(plan2, cache = cache2)
#> target x

# Change nothing
make(plan1, cache = cache1)
#> All targets are already up to date.
make(plan2, cache = cache2)
#> All targets are already up to date.

# Change x in the *first* plan
plan1 <- drake_plan(x = 2)
make(plan1, cache = cache1)
#> target x

# x changes in plan2 because of the change trigger
make(plan2, cache = cache2)
#> target x

^{Created on 2019-09-27 by the reprex package (v0.3.0)}

billdenney · 2019-09-28T03:54:28Z

That type of chaining would do effectively all of what I need. Everything else I'm asking for (like detecting outdated objects from the other cache) is more of a want. I don't think I would have gotten to the ignore(cache1) on my own.

As a simplification, would it be reasonable to do something like this instead for plan2?

plan2 <- drake_plan(
  x = readd(x, cache=cache1)
)

If so, perhaps the cache argument of readd() could be special-cased with an ignore() automatically wrapped around it for this scenario. My guess is that even if mine will work, your method is more efficient.

wlandau · 2019-09-28T22:09:13Z

Special-casing would unfortunately break the promise of static code analysis that the values of objects behind the symbols do not matter. But if it helps make the plan look cleaner, this will probably work:

cache1 <- function() {
  drake_cache("cache1")
}

plan2 <- drake_plan(
  x = target(
    cache1()$get("x"),
    trigger = trigger(change = cache1()$get_hash("x"))
  )
)

billdenney · 2019-09-29T00:43:53Z

In that case, I'd go for the slightly-less-indirection method you originally proposed. I think that from my perspective, you can consider this closed. My only slight additional request would be that your suggestion in #1015 (comment) would be helpful to have in some form of documentation (perhaps it's not exactly a "frequently" asked question, but knowing how to do it could perhaps help others ;) ).

wlandau · 2019-09-29T01:38:06Z

I think an FAQ is easiest to way to insert this workaround in the docs. If you find a place in the manual that fits, please feel free to submit a PR.

I changed my mind on cache$import() though. Implementing it was more involved than I expected.

The way drake uses namespaces clashes with storr's import method. We have to manually exclude the "progress" namespace or else it will error out.
storr loads everything into memory, which does not bode well for typical drake use cases. The method I implemented loads targets one at a time, supports parallel computing, and uses optional garbage collection. PR forthcoming.

As a side-effect of this new strategy, you will be able to import targets one at a time.

billdenney · 2019-09-29T02:06:05Z

Thanks!

wlandau · 2019-11-17T13:00:30Z

@billdenney, just an FYI: in the next release of drake, you will no longer need to explicitly ignore() caches (#1071) so the workaround in #1015 (comment) will no longer be necessary. Turns out we can special-case caches in the storage phase (as opposed to the code analysis phase).

billdenney · 2019-11-17T14:55:48Z

That's great, thank you!

wlandau added difficulty: advanced topic: reproducibility labels Sep 25, 2019

wlandau mentioned this issue Sep 28, 2019

Bring history into the decorated storr API #1020

Closed

2 tasks

wlandau added the type: faq label Sep 29, 2019

wlandau mentioned this issue Sep 29, 2019

Import and export targets to and from drake caches #1021

Merged

4 tasks

wlandau closed this as completed in 10e183d Sep 29, 2019

This was referenced Nov 13, 2019

Dynamic subtargets that depend on each other #1061

Closed

Avoid the need for ignore() for storrs #1071

Closed

brendanf mentioned this issue Dec 9, 2019

Using targets imported from another cache #1100

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import drake caches into other drake caches #1015

Import drake caches into other drake caches #1015

wlandau commented Sep 25, 2019

wlandau commented Sep 25, 2019

billdenney commented Sep 25, 2019

noamross commented Sep 25, 2019

wlandau commented Sep 26, 2019 •

edited

Loading

billdenney commented Sep 27, 2019

wlandau commented Sep 27, 2019

wlandau commented Sep 27, 2019

billdenney commented Sep 27, 2019

wlandau commented Sep 28, 2019

billdenney commented Sep 28, 2019

wlandau commented Sep 28, 2019

billdenney commented Sep 29, 2019

wlandau commented Sep 29, 2019

billdenney commented Sep 29, 2019

wlandau commented Nov 17, 2019

billdenney commented Nov 17, 2019

Import drake caches into other drake caches #1015

Import drake caches into other drake caches #1015

Comments

wlandau commented Sep 25, 2019

Prework

Description

wlandau commented Sep 25, 2019

billdenney commented Sep 25, 2019

noamross commented Sep 25, 2019

wlandau commented Sep 26, 2019 • edited Loading

billdenney commented Sep 27, 2019

wlandau commented Sep 27, 2019

wlandau commented Sep 27, 2019

billdenney commented Sep 27, 2019

wlandau commented Sep 28, 2019

Taking a step back

Proposal

billdenney commented Sep 28, 2019

wlandau commented Sep 28, 2019

billdenney commented Sep 29, 2019

wlandau commented Sep 29, 2019

billdenney commented Sep 29, 2019

wlandau commented Nov 17, 2019

billdenney commented Nov 17, 2019

wlandau commented Sep 26, 2019 •

edited

Loading