Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environments, reproducibility, and consistent target validation #761

Closed
wlandau opened this issue Mar 1, 2019 · 30 comments
Closed

Environments, reproducibility, and consistent target validation #761

wlandau opened this issue Mar 1, 2019 · 30 comments

Comments

@wlandau
Copy link
Member

wlandau commented Mar 1, 2019

To Chicago R Unconference-goers

Let's talk about the API proposed here and documented here. Should the new functionality have different behavior? Better names?

Original post

The behavior of make() depends on the functions and data in your environment. If you change a function in the R console and then forget about it, you could falsely invalidate targets (prior discussions here, here, and here). This is a downside of drake's deliberately extreme domain-specificity (discussed here).

Temporary recommendations for users

  • Before you run make(), it is good practice to start a fresh R session and be consistent in the way you load your packages and functions. I recommend that you keep a collection of setup scripts and source() them in a new session before doing anything serious (example here). Docker and callr::r_vanilla() may also help.
  • Check outdated() and maybe deps_profile() before calling make() (dependency_profile() in drake version <= 6.2.1).
  • Alternatively, you could manually create a new envir <- new.env(parent = globalenv()), load all your scripts there with source(local = envir), and then call make(envir = envir). This is advanced usage, and it is inconvenient in many cases.

Thoughts on development

drake will continue to have this problem as long as users are responsible for managing their R sessions. The most extreme and thorough solution would force every call to make(), outdated(), vis_drake_graph(), etc. to start a fresh session and run a bunch of R script files in a new, carefully controlled environment. This would be a major course correction, not impossible, but potentially fraught with major breaking changes and a loss of interactivity and domain-specificity.

But what if make() were to prompt the user if some targets are outdated and the session is interactive? (Disabled with make(force = TRUE) or options(drake_force_interactive = TRUE).)

plan <- drake_plan(
  data = get_data(),
  analysis = run_analysis(data)
)
make(plan)
#> Outdated targets:
#>  data
#>  analysis
#> Predicted total runtime (from predict_runtime()): ~56 min
#> Really run make() in interactive mode?
#> More on interactive vs batch mode: (link to a new chapter in the manual)
#>
#> 1: yes
#> 2: no
#>
#> Selection:

@MilesMcBain, how much frustration would this kind of menu eliminate?

@ha0ye
Copy link

ha0ye commented Mar 1, 2019

The specific problem that I encountered in this space was not false positives for invalid targets, but rather false negatives for valid targets (because of internal function changes).

I think expose_imports() can solve this in some situations, but it would also be nice to the ability to manually add dependencies for different kinds of environmental information, such as package version, external config files (without using "file_in"), etc.

@wlandau
Copy link
Member Author

wlandau commented Mar 1, 2019

The specific problem that I encountered in this space was not false positives for invalid targets, but rather false negatives for valid targets (because of internal function changes).

So maybe make() in interactive mode should always throw a message directing users to this thread (except when verbose is 0 or FALSE).

I think expose_imports() can solve this in some situations, but it would also be nice to the ability to manually add dependencies for different kinds of environmental information, such as package version, external config files (without using "file_in"), etc.

I thought about diving into packages, global options, etc., but it gets tricky. If we depend too strongly on those details, workflows will become brittle and targets will almost never stay valid. packrat and Docker are much better solutions because they extend a project's shelf life. For drake, it turns out to be more important to have well-defined boundaries when it comes to dependency detection. Ref: #6.

If you want to manually add more dependencies, you can add to commands

drake_plan(
  x = {
    dep1
    dep2
    actual(command(...))
  }
)

or use triggers.

@wlandau
Copy link
Member Author

wlandau commented Mar 1, 2019

Regardless of where this goes in the long run, I think we really do need a utils::menu() for interactive mode. I can definitely throw one in before the CRAN release on March 10.

@wlandau wlandau changed the title Environments, reproducibility, and consistency Environments, reproducibility, and consistent target invalidation Mar 1, 2019
wlandau-lilly added a commit that referenced this issue Mar 1, 2019
@wlandau wlandau changed the title Environments, reproducibility, and consistent target invalidation Environments, reproducibility, and consistent target validation Mar 1, 2019
@MilesMcBain
Copy link
Contributor

MilesMcBain commented Mar 1, 2019

My problem was similar to @ha0ye's.

My thoughts on menus are they're good for infrequently encountered situations where you need help to make a decision. For a frequent situations you'll learn what you want and if you're locked into a menu with no way around it, it will quickly become annoying. You've provided a way around the menu with the options which is helpful.

In my development workflow, which I've been loving for the most part, I've been calling make very frequently, so I can see the menu becoming annoying and I will almost certainly disable it. This would leave me back almost at square one. I say almost because at least now I am aware of the issue.

But I'll still have no machinery provided by drake for running make interactively AND reproducibily. I'm still pretty new to drake but it seems like one of the issues is that it's hard to build this machinery into make() because make needs to have dependencies loaded into its calling environment. And make() can't easily determine if that environment is contaminated or not because it gives users complete freedom to set that up as they wish. I actually really like this feature and I agree that being more heavy handed about that is probably not worth it for other issues of brittleness etc.

I've solved this problem for myself by building some IDE machinery that wraps up drake: https://github.com/MilesMcBain/spacemacs_cfg/blob/windows/private/ess/packages.el#L352. So now I never call make() directly, I use my keybinding which will reboot my R session and then source my make.R. One potential drawback of this is that there are assumptions about project structure namely:

  • make.R exists in the top level project folder and is responsible for loading all deps, defining the plan, and calling drake::make()
  • clean.R exists in the top level project folder.

This makes me wonder if another approach to helping users with this overall issue is to provide some basic machinery that sets up their project in a way that adheres to convenient assumptions like this in combination with interactive helpers that ensure reproducibility.

For example:

  • drake::use_pipeline() - sets up a project for you with make.R, clean.R and folders for your sources.
    • within make.R there is boilerplate that sources all .R files within the folders
    • within the folders there are example R files like packages.R
  • An RStudio addin similar to my drake-restart-make that will restart the user's R session and then source make.R.

@pat-s
Copy link
Member

pat-s commented Mar 1, 2019

This sounds like a usethis::use_drake() function that could create a research compendium structure focuses on drake. There is already a RStudio addin and Ben Marvicks repo. Would be great to see a combination.

(almost off-topic, sorry for that. Quick idea on the phone)

@MilesMcBain
Copy link
Contributor

I did some looking into this, and at present it's not possible to restart an R session cleanly using the rstudioapi, see: rstudio/rstudioapi#111.

@wlandau
Copy link
Member Author

wlandau commented Mar 2, 2019

Comments

@MilesMcBain,

In my development workflow, which I've been loving for the most part, I've been calling make very frequently, so I can see the menu becoming annoying and I will almost certainly disable it. This would leave me back almost at square one. I say almost because at least now I am aware of the issue.

Good point. What about a less frequent default? Maybe once per session? I agree, a reminder to avert disaster defeats its own purpose if we overdo it.

This makes me wonder if another approach to helping users with this overall issue is to provide some basic machinery that sets up their project in a way that adheres to convenient assumptions like this in combination with interactive helpers that ensure reproducibility.

Interesting. This where I thought @krlmlr intended to go with redrake. I love your idea of an RStudio add-in. To work around #761 (comment), maybe we could follow reprex and just call make() in a separate process.

@pat-s,

This sounds like a usethis::use_drake() function that could create a research compendium structure focuses on drake. There is already a RStudio addin and Ben Marvicks repo. Would be great to see a combination.

Agreed. Also related: lockedata/starters#39

More brainstorming

These are all the major API functions prone to environment-related instability:

  • make()
  • drake_build()
  • drake_debug()
  • outdated()
  • missed()
  • drake_graph_info()
  • vis_drake_graph()
  • sankey_drake_graph()
  • drake_ggraph()
  • predict_runtime()
  • predict_workers()

They all accept a config argument generated by drake_config(), which is a large list that has all the arguments to make(): a plan, a character vector of package names, the environment with all the functions, etc. The environment (config$envir, usually the envir argument to make()) can be any environment that inherits from globalenv() (so packages are reachable).

What if config could alternatively accept an R script that returns a drake_config() object? Maybe even as a default value ("drake.R")? We could launch a new process to create the config from the script, run the function on that config, and then send the return value back to the master process. If done right, I do not think this would disrupt existing functionality.

EDIT

It may be confusing to overload make() etc. to let config be a file name. I feel less confident in this design choice.

@wlandau
Copy link
Member Author

wlandau commented Mar 2, 2019

Some changes in b37ef00:

  • Only the drake_make_menu global option (note the new name) can control the menu. The force argument of make() no longer has anything to do with it.
  • If getOption("drake_make_menu") is either TRUE or unset, the menu will appear only once per session. The menu is an experimental band aid, so it should probably not be more aggressive than that.
  • The help file of make() now has a section dedicated to interactive mode. The menu prompt directs users there.

The more permanent solution will require more thought and care, and it will not be part of the upcoming version 7.0.0. This is fine since we are going for enhanced reproducibility without breaking changes.

@wlandau
Copy link
Member Author

wlandau commented Mar 2, 2019

Proposal

Create API wrappers that look and act like callr processes.

Functions

  • r_make()
  • r_drake_build()
  • r_outdated()
  • r_missed()
  • r_drake_graph_info()
  • r_vis_drake_graph()
  • r_sankey_drake_graph()
  • r_drake_ggraph()
  • r_predict_runtime()
  • r_predict_workers()

Arguments

  1. source: a script that returns a drake_config() object. (We could set a default with a global option or environment variable.)
  2. ...: secondary arguments of the inner function, e.g. the make_imports argument of outdated(). r_make() will have none because everything is already in the config object.
  3. r_fn: a function from callr.
  4. r_args: a list of arguments to r_fn (except func and args).

Behavior

  1. Start a new session with the callr function supplied to r_fn.
  2. In that process, run the script from the source argument to get a drake_config() object.
  3. Run the inner function in the same process (E.g. r_outdated(make_imports = TRUE) calls outdated(config = config, make_imports = TRUE).)
  4. Return the value we get from the callr function from the r_fn argument. For callr::r(), this is just the return value of the inner function. For callr::r_bg(), users will need to call $get_result() to get the value.

Timing

This callr-like API is new and experimental, so I think we should let it play out in the GitHub version for a full development cycle. That means it will probably not make it into the March 18 CRAN release of version 7.0.0. If all goes will, however, I expect it to be part of version 7.1.0 (hopefully April or May).

Collaboration

I am opening this issue up to the Chicago R Unconference, though I will probably not merge any branches until after version 7.0.0 is accepted to CRAN.

Also cc @rkrug, @noamross.

@MilesMcBain
Copy link
Contributor

I am trying to understand how my simple workflow would work with the callr API. As I mentioned I have a make.R in my project root. If I were to use this with r_make() the difference would be:

This file would build a config object with drake_config()?

  • How would this be returned from the file? i.e. how would the callr process know which object is the config for make?
  • Can this file still do the work of sourcing my dependencies and loading packages etc?
    • Or must this all be pushed into the config object e.g. using prework?

@wlandau
Copy link
Member Author

wlandau commented Mar 3, 2019

As I picture it, your _drake.R file (or whatever file you choose) would look something like this.

source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
options(clustermq.scheduler = "slurm", clustermq.template = "custom.tmpl")
drake_config(
  plan,
  parallelism = "clustermq",
  jobs = 4,
  verbose = 6
)

The callr process would call source("_drake.R")$value to get the config object. I hesitate to require a specific variable name for the config, a return value would require fewer assumptions. In-memory data such as packages, functions, global options, and environment variables would be set as side effects. (Maybe not so ideal that we have both side effects and return values.)

There may be better ways to accomplish this, and I am open to suggestions. But I think the priorities are clear.

  1. We need the callr process to be totally independent of the environment of the master session.
  2. We should be as R-focused/domain-specific as possible.

To me, the proposed R script follows naturally.

Or must this all be pushed into the config object e.g. using prework?

Glad you asked. prework is a bit different, and most users probably will not need it. The main use case is distributed computing workflows in which all the targets need the same set of global options or environment variables. When you set these things in the master process, they generally do not get sent to the workers.

wlandau pushed a commit that referenced this issue Mar 3, 2019
@wlandau
Copy link
Member Author

wlandau commented Mar 3, 2019

FYI: just implemented #761 (comment) in #765.

@wlandau
Copy link
Member Author

wlandau commented Mar 3, 2019

I know I promised that this would be an unconf issue, but I became so opinionated about the design that I thought it best to draft an implementation myself. I would still love feedback. A group discussion next week would be really helpful.

@wlandau
Copy link
Member Author

wlandau commented Mar 3, 2019

Also, I am changing my mind about the timing of the rollout. The solution is new, but the problem is not. I have been thinking about environment-related pitfalls since the days of make(session = callr::r), and I believe drake needs something that combines interactivity with reproducibility without complicating usage, breaking existing features, or undermining drake's domain-specificity. I believe the proposed callr API does all of this cleanly.

The r_*() API functions were super simple to implement and test, and so I expect fewer bugs than in other new features like the DSL. I think we can include it in the 7.0.0 release if we treat it as experimental. (The r_make() help file now has a cautionary note at the top.) We still have a week to think it over, which may include a discussion at the unconf.

I will still keep this issue open in order to encourage discussion over the next week.

@MilesMcBain
Copy link
Contributor

Thanks for the clarification @wlandau . I like the simplicity of the proposal 👍

Some things you could potentially discuss at the chirunconf:

  1. What does the r_* prefix mean? I am guessing reproducible? Maybe the choice of prefix could make it clearer that these are the functions that aid reproducibility. Some ideas:
  • rep_
  • repro_
  • iso_
  1. What about the name of the default R file? Personally I think I will stick with make.R because I am working alongside programmers from varying backgrounds and I feel a file called 'make' at root level should trigger them to think 'That might help me understand how this project works.'. To me something that starts with '_' triggers almost the opposite: 'That looks internal'.

  2. We talked about a use_ type function, what structure should this set up and how should this affect the default 'make' file boilerplate?

I can tell you with my current project I have something like this going:

 .drake/
 .git/
 00_plan/
 01_data_fetch/
 02_categorise_incidents/
 03_create_time_series/
 04_fit_models/
 05_forecast/
 06_output/
 07_interpret/
 .Rhistory
 README.md
 clean.R
 demand_forecast.db
 make.R

So then in my make.R I can have some generic code like this to load my R deps, while ignoring other R files:

## Find all the workflow steps and source them.

r_files <- list.files(path = "./",
                      full.names = TRUE,
                      pattern = "\\.R$",
                      recursive = TRUE)

workflow_files <- grep(pattern = "[0-9]{2}_",
                       x = r_files,
                       value = TRUE
                       )

lapply(workflow_files, source)

## remove working vars so as not to pollute global namespace.
rm(list = c('r_files', 'workflow_files'))

@wlandau
Copy link
Member Author

wlandau commented Mar 4, 2019

Great points.

What does the r_* prefix mean? I am guessing reproducible? Maybe the choice of prefix could make it clearer that these are the functions that aid reproducibility.

The API deliberately mimics functions r(), r_bg(), r_vanilla(), r_copycat(), etc. from callr. The primary intent is to create a new transient callr process for an individal function call. We capitalize on callr's pre-established nomenclature and avoid having to define our own convention. Plus, the single letter means we avoid overly long function names.

What about the name of the default R file? Personally I think I will stick with make.R because I am working alongside programmers from varying backgrounds and I feel a file called 'make' at root level should trigger them to think 'That might help me understand how this project works.'. To me something that starts with '_' triggers almost the opposite: 'That looks internal'.

The _drake.R script not only serves r_make(), but also r_outdated(), r_vis_drake_graph(), r_predict_runtime(), etc. Its purpose is to populate the environment and return a drake_config() object, but it is up to each respective r_*() function to do the actual work. Users will not otherwise call the script directly. In other words, _drake.R merely assists from the background. On the other hand, make.R implies a direct call to make(), the star of the show.

We talked about a use_ type function, what structure should this set up and how should this affect the default 'make' file boilerplate?

In my own projects, I often come back to the file structure now described in the chapter on drake projects, which I just rewrote today.

.
+-- make.R   # master script that calls make(). Intended for batch mode.
+-- _drake.R # configuration file to serve all the r_() functions
+-- R/
|   +-- packages.R
|   +-- functions.R
|   +-- plan.R

I think this could be enough for a function like use_drake(). We could include the relevant script files from drake_example("main") with all the code commented out.

@wlandau
Copy link
Member Author

wlandau commented Mar 4, 2019

Do you think use_drake() should be part of drake itself, or should there be a pull request to usethis? usethis already has proj_get() and proj_set(), which we would have to mimic in order to reliably locate the root of an R project. The duplicated effort does not sound ideal, and I would rather not add usethis to Suggests:.

@wlandau
Copy link
Member Author

wlandau commented Mar 4, 2019

For the record: r_*() and _drake.R do make interactive mode more reproducible, but for typical drake projects with long runtimes, batch mode is still much more convenient. It is super inconvenient to wait for a blocking make() call in interactive mode. Up to this point, I have recommended Mac and Linux users run the following.

nohup nice -19 R CMD BATCH make.R &

On Windows, r_make(r_fn = callr::r_bg, r_args = list(supervise = FALSE)) should basically get you there, even from an interactive session.

@MilesMcBain
Copy link
Contributor

Cool I follow the naming choice for _drake.R now.

Re r_: I've used callr a number of times myself but only ever the callr::r() function. I was unaware the r_ convention existed.

And I think that's worth reflecting on because you're saying the r_ convention reflects an implementation, but it's one that your users may not be aware of. Even if they are aware of it, they don't particularly want to use callr, so they won't be thinking about it. They just want a safe way to run their drake commands interactively in development.

@wlandau
Copy link
Member Author

wlandau commented Mar 4, 2019

And I think that's worth reflecting on because you're saying the r_ convention reflects an implementation, but it's one that your users may not be aware of. Even if they are aware of it, they don't particularly want to use callr, so they won't be thinking about it. They just want a safe way to run their drake commands interactively in development.

True, I will reflect on it more. Explicit alignment with callr is appealing, but the package has only been around for a couple years, so I am not actually sure how much weight the historical context carries.

So then what intent should the prefix communicate? Reproducibility covers a huge range of topics and very strong scientific claims, so I am not sure it would be specific or appropriate enough to motivate this naming choice. On the other hand, we know exactly what it means to start a fresh R session. callr had to communicate the same idea in its own API, and I think the choice was successful. Then again, like you said, drake's user base is totally different. Could the increased suggestiveness of a different prefix outweigh the inconvenience of longer function names? I am not 100% sure.

@pat-s
Copy link
Member

pat-s commented Mar 7, 2019

Up to this point, I have recommended Mac and Linux users run the following.

I usually use byobu (or similar tools as screen) for long running jobs. Theo point is that that you can close the terminal window and the command keeps running. You can later re-attach it. More flexible than nohup.

Question

How can I get progress output (i.e. targets) when running r_make()?
I've tried setting r_make(r_args = list(stdout = stdout())) but without success.

Edit: My config for make looks as follows:

drake_config(plan, verbose = 2, targets = "benchmark_evaluation_report_diplodia", console_log_file = stdout(),
             lazy_load = "promise", caching = "worker", template = list(log_file = "log-all%a.txt", n_cpus= 32),
             garbage_collection = TRUE, jobs = 3, parallelism = "clustermq", force = T)

In addition: Let's say I change the "target" in the above drake_config() - will this invalidate all previous work? No, it doesn't.

@pat-s
Copy link
Member

pat-s commented Mar 7, 2019

Another thing that is different in using r_make(): Due to the nature of using callr, all environment modules needed for the respective worker need to be reloaded in the scheduler template to be available for the R process of the worker.

Before, it was enough to load the environment module when starting the R session that will call make() as the workers spawned will have access to the env modules of the master process.

A mistake on my side, all good.

The same applies to the clustermq options of the workers: As clustermq is usually loaded but in a new R session via callr,the clustermq options are not set. Hence I get the following warning in the log file of the workers:

* Option 'clustermq.scheduler' not set, defaulting to 'SLURM'
     --- see: https://github.com/mschubert/clustermq/wiki/Configuration

In the worker the clustermq options are probably not important (at least nothing crashes here). But it might confuse users.

@pat-s
Copy link
Member

pat-s commented Mar 7, 2019

Another thing that I just noticed is that browser() statements within functions are not respected when running r_make(). Using the old approach, the execution stops at the browser() call.

@wlandau wlandau pinned this issue Mar 7, 2019
@wlandau
Copy link
Member Author

wlandau commented Mar 7, 2019

I usually use byobu (or similar tools as screen) for long running jobs. Theo point is that that you can close the terminal window and the command keeps running. You can later re-attach it. More flexible than nohup.

Cool, thanks for the tip about byobu.

How can I get progress output (i.e. targets) when running r_make()?
I've tried setting r_make(r_args = list(stdout = stdout()))

Unfortunately, the stdout stream of the callr process is different from that of the master session. covr has a similar problem (r-lib/covr#121). For now, I recommend writing drake_config(console_log_file = "your.log") at the end of your _drake.R. Either that or giving a log file to stdout in r_args. And on Linux, you can open a terminal window and watch the progress flow in with something like watch -n .1 tail -n 40 your.log. If R ever gets multithreading support, there will be better solutions.

The same applies to the clustermq options of the workers: As clustermq is usually loaded but in a new R session via callr,the clustermq options are not set. Hence I get the following warning in the log file of the workers

Yeah, so that's an adjustment for sure. I tried to point it out here in the manual in the commented-out options() call. If you think of other good places for hints in the docs, I would really appreciate a PR.

As for environment modules, I recommend putting them all in the template file (example).

Another thing that I just noticed is that browser() statements within functions are not respected when running r_make(). Using the old approach, the execution stops at the browser() call.

Good point. browser() is designed for interactive mode, not batch mode. r_make() is strange in that you call it from interactive mode but most of the work happens in batch mode.

@pat-s
Copy link
Member

pat-s commented Mar 8, 2019

Unfortunately, the stdout stream of the callr process is different from that of the master session. covr has a similar problem (r-lib/covr#121). For now, I recommend writing drake_config(console_log_file = "your.log") at the end of your _drake.R. Either that or giving a log file to stdout in r_args. And on Linux, you can open a terminal window and watch the progress flow in with something like watch -n .1 tail -n 40 your.log. If R ever gets multithreading support, there will be better solutions.

Very cool, doing exactly this now!

Good point. browser() is designed for interactive mode, not batch mode. r_make() is strange in that you call it from interactive mode but most of the work happens in batch mode.

Ok I see. So for debugging purposes I would do

  1. Manually source _drake.R
  2. Run make()

@wlandau
Copy link
Member Author

wlandau commented Mar 8, 2019

Ok I see. So for debugging purposes I would do

  1. Manually source _drake.R
  2. Run make()

Yes, for interactive debugging, everything must run in the same session.

@wlandau wlandau mentioned this issue Mar 8, 2019
3 tasks
@wlandau
Copy link
Member Author

wlandau commented Mar 8, 2019

Re #761 (comment), I just took another look at the docs of callr::r(). What about r_make(r_args = list(show = TRUE))?

@pat-s
Copy link
Member

pat-s commented Mar 9, 2019

Re #761 (comment), I just took another look at the docs of callr::r(). What about r_make(r_args = list(show = TRUE))?

Boom! Would it be possible to make this the default? I guess users would expect output in case they declare the verbosity in _drake.R using verbose = .

@wlandau
Copy link
Member Author

wlandau commented Mar 9, 2019

Good idea. Done in cf87fc5.

@wlandau
Copy link
Member Author

wlandau commented Mar 11, 2019

All this functionality is now in version 7.0.0 on CRAN. We did not discuss the API decisions specifically (we mostly focused on this stuff) but I taught drake to some new users and they picked up on r_make() rather quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants