Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I mix non-R code (e.g. Python and shell scripts) in a large drake workflow? #277

Closed
bmchorse opened this issue Feb 22, 2018 · 18 comments

Comments

@bmchorse
Copy link
Contributor

Adding here to tag in FAQ as per @wlandau's request.

I'm not sure how drake will handle a workflow that has steps outside of R. My workflow consists of the following rough steps:

  1. Do some R-script processing on raw data files. [Straightforward to handle with drake]
    • In: raw data
    • Out: a handful of processed data files
  2. Use a shell script to call a handful of Python scripts with arguments [Unsure how to handle this and next steps with drake]
    • In: processed data
    • Out: ~300 text files
    • Note that this shell script itself is actually submitted to a SLURM setup using sbatch (which I suppose could be abstracted out to its own shell script, such that running it will submit the jobs to the cluster). Not sure how much that distinction tangles things up, but either way it sounds like this is relevant to the parallelism vignette, perhaps both for thinking about executing system commands and thinking about scheduling parallel runs on high-performance computing environments.
  3. Use more shell scripts to process the results of the above step
    • In: ~300 text files
    • Out: a few new text files, some R scripts that are automatically run in this same step, some PDFs
  4. Do more R-script processing on the output of above.... [back to a more usual drake usage]

And so on. This continues through to further big analytical steps similar to Step 2.

It seems that system2() is a way to handle system commands from within R. I am not sure if commands will need to be different based on OS (e.g., I sometimes run parts of this workflow on my Windows laptop, but the rest is executed on a Linux cluster).

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 22, 2018

processx is way better than system2(), the GitHub version is preferred currently until an update reaches CRAN.

drake requires that each target represents one file or one R object.

@wlandau
Copy link
Member

wlandau commented Feb 23, 2018

@bmchorse Your mileage from drake depends on how R-focused you want your workflow to be. Since the computationally heavy piece is in Python, you might also look at snakemake. I think it depends on which language you prefer overall. If you are more of an R user, you could call your Python scripts from R using source_python() from RStudio's reticulate package. And if you aggressively port your Python code to custom R functions that use reticulate commands, drake will import and analyze those functions and no longer overreact to trivial formatting changes in external Python scripts.

Whichever tool you choose, I recommend that you abandon the shell scripts and rely entirely on the workflow manager's native HPC support. For drake, I picture something like this:

library(drake)
library(reticulate) # for Python code

py_plan <- drake_plan({
    source_python(file_in("python_INDEX.py"))
    file_out("text_INDEX.txt")
  },
  strings_in_dots = "literals" # Only needed until the legacy file API is defunct.
) %>%
  evaluate_plan(
    wildcard = "INDEX",
    values = paste0("script", 1:4)
  )

DT::datatable(py_plan)

screenshot_20180222_210625

# Visualize the workflow network to make sure
# you wrote your plan properly
vis_drake_graph(drake_config(py_plan))

screenshot_20180222_210303

# Generate a starter batchtools configuration file, batchools.slurm.tmpl.
# This file talks to the cluster and sets a bunch of resource parameters.
# You may need to modify it based on the
# resource requirements of your workflow
drake_batchtools_tmpl_file("slurm")

# Tell `drake` that you're using a SLURM cluster and batchtools.slurm.tmpl.
future::plan(
  batchtools_slurm,
  template = "batchtools.slurm.tmpl",
  workers = 16
)

# Deploy your python jobs to SLURM with drake
make(py_plan, parallelism = "future_lapply")

# Alternatively, deploy with the new `"future"` backend
# from #227. It is experimental, but it has
# efficiency advantages when it comes to scheduling.
make(py_plan, parallelism = "future", jobs = 16)

This is a really important use case, and I am glad you filed an issue about it.

@wlandau
Copy link
Member

wlandau commented Feb 23, 2018

By the way, @kendonB has a lot of experience connecting drake to SLURM. We worked through the initial setup together in #115 and HenrikBengtsson/future.batchtools#11, and we solved a bunch of time and memory inefficiencies last fall.

@wlandau wlandau changed the title How to run shell (or other non-R) scripts that are part of a workflow? How should I mix non-R code (e.g. Python) in a large drake workflow? Feb 23, 2018
@wlandau wlandau changed the title How should I mix non-R code (e.g. Python) in a large drake workflow? How should I mix non-R code (e.g. Python scripts) in a large drake workflow? Feb 23, 2018
@wlandau wlandau changed the title How should I mix non-R code (e.g. Python scripts) in a large drake workflow? How should I mix non-R code (e.g. Python and shell scripts) in a large drake workflow? Feb 23, 2018
@wlandau
Copy link
Member

wlandau commented Feb 23, 2018

Another advantage of aggressively porting to reticulate: you can drop a whole lot of input and output files, relying on drake's cache instead. Drake is a lot easier and cleaner to use that way. But only you can know if the time and effort is worth it.

py_plan <- drake_plan(
  data = preprocess_data(),
  result = call_reticulate_commands_INDEX(data),
  strings_in_dots = "literals"
) %>%
  evaluate_plan(
    wildcard = "INDEX",
    values = 1:4
  )

DT::datatable(py_plan)

screenshot_20180222_225930

@wlandau
Copy link
Member

wlandau commented Feb 23, 2018

Incidentally, I would love to know how snakemake compares with drake. In fact, snakemake might be the excuse I have been looking for to finally return to Python after so many years.

@bmchorse
Copy link
Contributor Author

The Python I'm using is a whole repository from GitHub (not something on PyPI with stable release versions or anything), and it's set up to be called from the command line (hence all the shell scripts). In that sense all the Python is happening 'under the hood' and I think going with snakemake would be less helpful, particularly since I have no standalone Python scripts that are handling data processing/plotting/etc.

Would what I'm describing still work for using source_python() from reticulate or am I stuck using a package such as processx to handle system commands? I do not wish to refactor for calling individual functions from the Python repo rather than using the command line; the code in it is pretty complicated and not written to be handled that way as far as I can tell. I can provide a link to the repository if it would be helpful.

If processx is to be preferred for using R to handle the command line calls, is this processx3 repo the place I should be looking?

Also, can drake be set to override some update steps, i.e., cache the state of some step and inform me that it's out-of-date but not automatically update that step upon a make()? I'm thinking of the particular use case you've indicated where drake might overreact to small changes in Python code as you suggested, particularly since the Python repo I'm using is under active development and in my files it is pulled directly from the repository as a git submodule. Furthermore, there may come a point where the analysis should remain 'final' regardless of changes to the source code of the Python repo. I didn't see anything in the index that fit what I was looking for here but I may not be thinking of the correct words.

Perhaps another route for that question would be to not include the Python source code as a direct input in the drake workflow, since it's (presumably, depending on outcome of above questions) only accessed from a command line call? These are computationally expensive simulations (5-10hrs x 100 jobs on a cluster) that take quite awhile, so I would prefer not to re-run except when I choose to, even though this somewhat goes against the spirit of drake. If this is not an option, maybe I can simply update the git submodule from master less frequently.

Thanks very much for the feedback and suggestions. This is helpful already, and I hope my particular troubles aren't too obscure. It may have been easier for me to start figuring out drake with an entirely R-based workflow, but oh well! I appreciate the friendliness and general positive atmosphere of this project and it makes me very interested in finding a way to use drake, if not now then certainly for future R projects.

@wlandau
Copy link
Member

wlandau commented Feb 23, 2018

The Python I'm using is a whole repository from GitHub (not something on PyPI with stable release versions or anything), and it's set up to be called from the command line (hence all the shell scripts). In that sense all the Python is happening 'under the hood' and I think going with snakemake would be less helpful, particularly since I have no standalone Python scripts that are handling data processing/plotting/etc.

Ah, so it seems like you have a command line tool that just happens to be implemented in Python. Is that right?

Would what I'm describing still work for using source_python() from reticulate or am I stuck using a package such as processx to handle system commands? I do not wish to refactor for calling individual functions from the Python repo rather than using the command line; the code in it is pretty complicated and not written to be handled that way as far as I can tell. I can provide a link to the repository if it would be helpful.

If it were practical to access the Python source direclty, then reticulate and source_python() would be useful. But it sounds like processx is the way to go here. system2(..., wait = TRUE) is not as good, but it could be much quicker to set up in a pinch.

If processx is to be preferred for using R to handle the command line calls, is this processx3 repo the place I should be looking?

I think so: https://github.com/r-lib/processx3. @krlmlr, please correct me if I am wrong.

Also, can drake be set to override some update steps, i.e., cache the state of some step and inform me that it's out-of-date but not automatically update that step upon a make()?

Drake has something similar: triggers (thanks again to @kendonB for pushing me on this last fall). The "missing" trigger may be what you want here besause it only builds the target if it does not already exist. See also drake_example("packages")', which uses the "always"` trigger for one of the targets.

I'm thinking of the particular use case you've indicated where drake might overreact to small changes in Python code as you suggested, particularly since the Python repo I'm using is under active development and in my files it is pulled directly from the repository as a git submodule.

If you call shell commands instead of the Python source, then you have the opposite problem: drake will not automatically see changes that happen to the underlying Python code and will not react if you pull the latest commit. From your later comments, it sounds like this is actually the desired behavior. But if you want drake to react more easily and more often, here are a few options.

  • Use git2r to get the current commit hash of the repo. Treat the hash as a dependency for downstream targets so that new commits to the repo trigger rebuilds. Alternatively, you could be a bit more relaxed and only trigger rebuilds when there is a new tag/release.
library(git2r)
library(drake)

plan <- drake_plan(
  python_repo_fingerprint = {
    repo <- repository("python_repo")
    commits(repo)[[1]]@sha # Or tags(repo)[[1]]@sha if you want your project to be less brittle.
  },
  downstream_target = use_python_repo(fingerprint = python_repo_fingerprint)
  string_in_dots = "literals"
)
  • Potentially more brittle: dive into the .git directory itself to grab the hash for the branch currently checked out.
library(git2r)
library(drake)

plan <- drake_plan(
  python_repo_fingerprint = scan(
    file_in("python_repo/.git/refs/heads/master"),
    what = character(),
    quiet = TRUE
  ),
  downstream_target = use_python_repo(fingerprint = python_repo_fingerprint)
  string_in_dots = "literals"
)
  • Select which source files in the repo you want to depend on and list them all as file inputs in your workflow plan.
library(git2r)
library(magrittr)
library(drake)

plan <- drake_plan(
  downstream_target = {
    file_input(
      "python_repo/src/file1.py",
      "python_repo/src/file2.py",
      "python_repo/src/file3.py"
    )
    use_python_repo(fingerprint = python_repo_fingerprint)
  }
  string_in_dots = "literals"
)

Furthermore, there may come a point where the analysis should remain 'final' regardless of changes to the source code of the Python repo. I didn't see anything in the index that fit what I was looking for here but I may not be thinking of the correct words.

Containerization can seriously extend your project's shelf life and enhance reproducibility. As long as you keep a local copy of the final snapshot you use of the Python repo, it will no longer matter what happens to the remote upstream copy later on. Rather than Docker, I recommend Singularity, which is friendlier for academic research workflows and HPC systems, though I have not actually used either in earnest.

Perhaps another route for that question would be to not include the Python source code as a direct input in the drake workflow, since it's (presumably, depending on outcome of above questions) only accessed from a command line call? These are computationally expensive simulations (5-10hrs x 100 jobs on a cluster) that take quite awhile, so I would prefer not to re-run except when I choose to, even though this somewhat goes against the spirit of drake. If this is not an option, maybe I can simply update the git submodule from master less frequently.

Since you have control over which snapshot of the Python repo you use, I think this is reasonable. Maybe workflows of that scale should not be brittle. In #6, I grappled with this issue for R packages specifically. With packages, drake only reacts to the exported functions you call, not any functions nested inside. Diving deeply into package functions for dependencies would have added support to the promise of reproducibility, but it would have caused a lot of projects to update too often. I eventually just decided to recommend packrat and let drake stay reasonably hands-off.

Thanks very much for the feedback and suggestions. This is helpful already, and I hope my particular troubles aren't too obscure. It may have been easier for me to start figuring out drake with an entirely R-based workflow, but oh well! I appreciate the friendliness and general positive atmosphere of this project and it makes me very interested in finding a way to use drake, if not now then certainly for future R projects.

You're welcome, I am glad you find drake friendly and positive. I think these kinds of challenges are widespread, and solving them should help many other people. In fact, I plan to curate a list of real drake-powered projects out in the wild. If drake turns out to work for you, it would be great to include yours when you are ready to release it.

@wlandau
Copy link
Member

wlandau commented Feb 24, 2018

Closing because I think I addressed most of the question, but let's keep talking on the thread.

@jennysjaarda
Copy link

jennysjaarda commented Dec 9, 2019

I just wanted to follow-up on your suggestion regarding grabbing thegit sha to track non-R code as I'm not really sure how best to implement it - suggestions welcome!

I have a plan as follows (slightly modified for ease of reading):

test_list <- c("X","Y","Z")
analysis <- drake_plan(
  create_mod1_folder = dir.create("analysis/mod1",showWarnings=F),
  create_mod1_inter_folder = if(!is.null(create_mod1_folder)){dir.create("analysis/mod1/interaction",showWarnings=F)},
  create_mod1_linear_folder = if(!is.null(create_mod1_folder)){dir.create("analysis/mod1/linear",showWarnings=F)},


  # repo_fingerprint = {
  #   repo <- repository(here::here())
  #   commits(repo)[[1]]$sha 
  # }
  
  run_interaction = target(
    if(!is.null(create_mod1_inter_folder)){processx::run(command="sh",c( interaction_script, variable), error_on_status=F)},
    transform = cross(variable = !!unlist(test_list %>% dplyr::select(class)))
  )

A few questions:

  1. Is there a nicer way to create directories/subdirectories?
  2. Similar to 1, the output of interaction_script is to "analysis/mod1/interaction", but it feels messy to include as part of the command when I call processx::run. Any suggestion here?
  3. Lastly, do you have suggestion as to the best way to include the fingerprint variable to somehow tell run_interaction to run again when the code has changed?

Thanks a lot!

@wlandau
Copy link
Member

wlandau commented Dec 10, 2019

I just wanted to follow-up on your suggestion regarding grabbing thegit sha to track non-R code as I'm not really sure how best to implement it - suggestions welcome!

On reflection, I think file_in() is probably cleaner and more reliable for non-R code files than manually tracking the sha key.

Is there a nicer way to create directories/subdirectories?

For commands you find cumbersome in the plan, I recommend wrapping the code into functions. NB file_out() can handle entire directories. Related: https://books.ropensci.org/drake/plans.html#functions

create_folder <- function(create_GWAS_folder, dir) {
  if(!is.null(create_GWAS_folder)){
    dir.create(dir, showWarnings = FALSE)
  }
}

analysis <- drake_plan(
  create_mod1_inter_folder = create_folder(
    create_GWAS_folder,
    file_out("analysis/mod1/interaction")
  ),
  # ...
)

More broadly, I would create all those directories within a single target, or better yet, ensure they all exist before running make(). drake is object-based rather than file-based, so it requires a change in mindset. Maybe have a look at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets.

Similar to 1, the output of interaction_script is to "analysis/mod1/interaction", but it feels messy to include as part of the command when I call processx::run. Any suggestion here?

file_out() in commands tells drake that the process of building a target also creates a file as a side effect, and that that file is important. So maybe in the run_interaction target, consider including file_out("literal_path_to_interaction_script.sh") in place of the interaction_script variable.

Lastly, do you have suggestion as to the best way to include the fingerprint variable to somehow tell run_interaction to run again when the code has changed?

Ah, I see we're talking about remote code files. It's been a long time since I looked at this issue, so I do not remember everything right away. For what it's worth, you can supply URLs of individual data/code files to file_in() and drake will track them for changes. Details are in the help file, e.g. ?file_in.

Does all this help? I am not sure if I fully understood your questions (maybe because it's late at night for me). Please let me know what remains unclear.

@jennysjaarda
Copy link

Hello,
Thanks for the tips! I took your suggestion and lumped my creating directories into one function with an output of TRUE. Then I used this as an input when calling processx::run so drake knew to create the dirs before running it. I then defined my shell script within the drake plan using file_in so drake knew if it had changed or not. This is how it ended up looking and it (seems to) do what I want! Let me know if you see anything glaringly problematic with this framework.

#define location of non-R scripts to run (in reality I source these from a `settings.R` script so all my input files are stored in one place. 

interaction_script <- "code/interaction.sh"
linear_script <- "code/linear.sh"

## also define some variables to test - these are inputs for my shell script: 
test_vars <- c("var1", "var2", "var3")

## create drake plan
analysis <- drake_plan(
  make_dirs_out = create_analysis_dirs(),
  interaction_track = file_in(!!interaction_script),
  run_gwas_interaction = target(
    if(make_dirs_out){processx::run(command = "sh", c( interaction_track, variable), error_on_status = F)},
    transform = map(variable = !!test_vars)
  )
)

@jennysjaarda
Copy link

By the way, sorry if I've missed it somewhere- but are the wildcards necessary anymore?

As far as I can tell, this:

file <- "test.txt"
test <- drake_plan (
  file_name = file_in(!!file),
)
make(test)
loadd(file_name)
file_name

is identical to this:

file <- "test.txt"
test <- drake_plan (
  file_name = file_in(file),
)
make(test)
loadd(file_name)
file_name

When to use/not use the wildcards? I included them based on the FAQ in #353 - is this discussion outdated?

@wlandau
Copy link
Member

wlandau commented Dec 12, 2019

A couple recommendations on #277 (comment):

  • Presumably the output files of create_analysis_dirs() are important, right? If so, I recommend using a file_out("analysis/mod1") in your command on the top-level directory of the output. In the current version of drake, file_out() works on entire directories, not just single files.
  • Instead of if(make_dirs_out){...}, I recommend file_in("analysis/mod1"). Ultimately, the files are what we want to track, and I think we want to make sure they exist before attempting run_gwas_interaction.
analysis <- drake_plan(
  make_dirs_out = create_analysis_dirs(file_out("analysis/mod1")),
  interaction_track = file_in(!!interaction_script),
  run_gwas_interaction = target({
      file_in("analysis/mod1")
      processx::run(command = "sh", c( interaction_track, variable), error_on_status = F)
    },
    transform = map(variable = !!test_vars)
  )
)

@wlandau
Copy link
Member

wlandau commented Dec 12, 2019

Re #277 (comment), tidy evaluation via !! makes sure the literal file path string gets inserted in the plan. This is still necessary to make sure drake depends on the files you want it to depend on.

With !!:

library(drake)
  
file <- "test.txt"
test <- drake_plan (
  file_name = file_in(!!file),
)

test
#> # A tibble: 1 x 2
#>   target    command            
#>   <chr>     <expr>             
#> 1 file_name file_in("test.txt")

config <- drake_config(test)
vis_drake_graph(config)

Created on 2019-12-12 by the reprex package (v0.3.0)

Without !!:

library(drake)
  
file <- "test.txt"
test <- drake_plan (
  file_name = file_in(file),
)

test
#> # A tibble: 1 x 2
#>   target    command      
#>   <chr>     <expr>       
#> 1 file_name file_in(file)

config <- drake_config(test)
vis_drake_graph(config)

Created on 2019-12-12 by the reprex package (v0.3.0)

@jennysjaarda
Copy link

Thanks for the suggestions. A few questions.

  • Presumably the output files of create_analysis_dirs() are important, right? If so, I recommend using a file_out("analysis/mod1") in your command on the top-level directory of the output. In the current version of drake, file_out() works on entire directories, not just single files.

Just to clarify, create_analysis_dirs() is only creating directories. So what you are suggesting is having the function take the top directory as an argument so it can be explicitly tracked by drake? (i.e. create_analysis_dirs(file_out("analysis/mod1")) in your example)?

So what do you recommend if a function is called that writes and/or reads a whole bunch of files? Ideally the function would have an input argument that names the folder containing the in/output?

  • Instead of if(make_dirs_out){...}, I recommend file_in("analysis/mod1"). Ultimately, the files are what we want to track, and I think we want to make sure they exist before attempting run_gwas_interaction.

Exactly, the goal is just to make sure that these files (rather directories in this case) exist before running the shell command. I hadn't realized that I could include multiple commands via {...} syntax, so this seems like a cleaner option verses the if(...){...}

@wlandau
Copy link
Member

wlandau commented Dec 13, 2019

Just to clarify, create_analysis_dirs() is only creating directories. So what you are suggesting is having the function take the top directory as an argument so it can be explicitly tracked by drake? (i.e. create_analysis_dirs(file_out("analysis/mod1")) in your example)?

Yes. The simple mention of file_out() is what counts because drake detects files with static code analysis. It does not actually need to be an argument of create_analysis_dirs(), but create_analysis_dirs(file_out("analysis/mod1")) is a clean way to write it. An equivalent alternative is this:

analysis <- drake_plan(
  make_dirs_out = {
    create_analysis_dirs()
    file_out("analysis/mod1")
  }
)

So what do you recommend if a function is called that writes and/or reads a whole bunch of files? Ideally the function would have an input argument that names the folder containing the in/output?

A common pattern is drake_plan(x = some_function(file_in("input", "files"), file_out("output", "more_files")) for functions that read and/or write files. It assumes "output" and "more_files" are the produced by x and modified by no other target.

I am realizing that my original suggestion for your analysis may not totally meet your need. I assumed nothing else modifies the contents of analysis/mod1 once it is created. If downstream targets put stuff in analysis/mod1, then make_dirs_out should focus on the parts that other targets do not modify. Maybe you could write something like create_analysis_dirs(file_out("analysis/mod1/initial_data1", "analysis/mod1/initial_data2")).

Exactly, the goal is just to make sure that these files (rather directories in this is case) exist before running the shell command.

Even easier: if you're only creating directories, why no do that outside the plan drop file_out() entirely? Creating directories and/or asserting their existence should be a really quick operation. You could have a top-level script that looks something like this:

library(drake)

plan <- drake_plan(
  interaction_track = file_in(!!interaction_script),
  run_gwas_interaction = target({
      file_in("analysis/mod1/specific_data_you_need")
      file_out("analysis/mod1/data_generated")
      processx::run(command = "sh", c( interaction_track, variable), error_on_status = F)
    },
    transform = map(variable = !!test_vars)
  )
)

create_analysis_dirs("analysis/mod1") # Always run this before calling make()

make(plan)

@jennysjaarda
Copy link

Ah I see, that all makes sense - thanks very much for all the help!

@wlandau
Copy link
Member

wlandau commented Feb 22, 2020

Again, I would like to plug #1178.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants