-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should I mix non-R code (e.g. Python and shell scripts) in a large drake workflow? #277
Comments
processx is way better than drake requires that each target represents one file or one R object. |
@bmchorse Your mileage from Whichever tool you choose, I recommend that you abandon the shell scripts and rely entirely on the workflow manager's native HPC support. For library(drake)
library(reticulate) # for Python code
py_plan <- drake_plan({
source_python(file_in("python_INDEX.py"))
file_out("text_INDEX.txt")
},
strings_in_dots = "literals" # Only needed until the legacy file API is defunct.
) %>%
evaluate_plan(
wildcard = "INDEX",
values = paste0("script", 1:4)
)
DT::datatable(py_plan) # Visualize the workflow network to make sure
# you wrote your plan properly
vis_drake_graph(drake_config(py_plan)) # Generate a starter batchtools configuration file, batchools.slurm.tmpl.
# This file talks to the cluster and sets a bunch of resource parameters.
# You may need to modify it based on the
# resource requirements of your workflow
drake_batchtools_tmpl_file("slurm")
# Tell `drake` that you're using a SLURM cluster and batchtools.slurm.tmpl.
future::plan(
batchtools_slurm,
template = "batchtools.slurm.tmpl",
workers = 16
)
# Deploy your python jobs to SLURM with drake
make(py_plan, parallelism = "future_lapply")
# Alternatively, deploy with the new `"future"` backend
# from #227. It is experimental, but it has
# efficiency advantages when it comes to scheduling.
make(py_plan, parallelism = "future", jobs = 16) This is a really important use case, and I am glad you filed an issue about it. |
By the way, @kendonB has a lot of experience connecting |
Another advantage of aggressively porting to py_plan <- drake_plan(
data = preprocess_data(),
result = call_reticulate_commands_INDEX(data),
strings_in_dots = "literals"
) %>%
evaluate_plan(
wildcard = "INDEX",
values = 1:4
)
DT::datatable(py_plan) |
Incidentally, I would love to know how |
The Python I'm using is a whole repository from GitHub (not something on PyPI with stable release versions or anything), and it's set up to be called from the command line (hence all the shell scripts). In that sense all the Python is happening 'under the hood' and I think going with Would what I'm describing still work for using If Also, can Perhaps another route for that question would be to not include the Python source code as a direct input in the Thanks very much for the feedback and suggestions. This is helpful already, and I hope my particular troubles aren't too obscure. It may have been easier for me to start figuring out |
Ah, so it seems like you have a command line tool that just happens to be implemented in Python. Is that right?
If it were practical to access the Python source direclty, then
I think so: https://github.com/r-lib/processx3. @krlmlr, please correct me if I am wrong.
If you call shell commands instead of the Python source, then you have the opposite problem:
library(git2r)
library(drake)
plan <- drake_plan(
python_repo_fingerprint = {
repo <- repository("python_repo")
commits(repo)[[1]]@sha # Or tags(repo)[[1]]@sha if you want your project to be less brittle.
},
downstream_target = use_python_repo(fingerprint = python_repo_fingerprint)
string_in_dots = "literals"
)
library(git2r)
library(drake)
plan <- drake_plan(
python_repo_fingerprint = scan(
file_in("python_repo/.git/refs/heads/master"),
what = character(),
quiet = TRUE
),
downstream_target = use_python_repo(fingerprint = python_repo_fingerprint)
string_in_dots = "literals"
)
library(git2r)
library(magrittr)
library(drake)
plan <- drake_plan(
downstream_target = {
file_input(
"python_repo/src/file1.py",
"python_repo/src/file2.py",
"python_repo/src/file3.py"
)
use_python_repo(fingerprint = python_repo_fingerprint)
}
string_in_dots = "literals"
)
Containerization can seriously extend your project's shelf life and enhance reproducibility. As long as you keep a local copy of the final snapshot you use of the Python repo, it will no longer matter what happens to the remote upstream copy later on. Rather than Docker, I recommend Singularity, which is friendlier for academic research workflows and HPC systems, though I have not actually used either in earnest.
Since you have control over which snapshot of the Python repo you use, I think this is reasonable. Maybe workflows of that scale should not be brittle. In #6, I grappled with this issue for R packages specifically. With packages,
You're welcome, I am glad you find |
Closing because I think I addressed most of the question, but let's keep talking on the thread. |
I just wanted to follow-up on your suggestion regarding grabbing the I have a plan as follows (slightly modified for ease of reading):
A few questions:
Thanks a lot! |
On reflection, I think
For commands you find cumbersome in the plan, I recommend wrapping the code into functions. NB create_folder <- function(create_GWAS_folder, dir) {
if(!is.null(create_GWAS_folder)){
dir.create(dir, showWarnings = FALSE)
}
}
analysis <- drake_plan(
create_mod1_inter_folder = create_folder(
create_GWAS_folder,
file_out("analysis/mod1/interaction")
),
# ...
) More broadly, I would create all those directories within a single target, or better yet, ensure they all exist before running
Ah, I see we're talking about remote code files. It's been a long time since I looked at this issue, so I do not remember everything right away. For what it's worth, you can supply URLs of individual data/code files to Does all this help? I am not sure if I fully understood your questions (maybe because it's late at night for me). Please let me know what remains unclear. |
Hello,
|
By the way, sorry if I've missed it somewhere- but are the wildcards necessary anymore? As far as I can tell, this:
is identical to this:
When to use/not use the wildcards? I included them based on the FAQ in #353 - is this discussion outdated? |
A couple recommendations on #277 (comment):
analysis <- drake_plan(
make_dirs_out = create_analysis_dirs(file_out("analysis/mod1")),
interaction_track = file_in(!!interaction_script),
run_gwas_interaction = target({
file_in("analysis/mod1")
processx::run(command = "sh", c( interaction_track, variable), error_on_status = F)
},
transform = map(variable = !!test_vars)
)
) |
Re #277 (comment), tidy evaluation via With library(drake)
file <- "test.txt"
test <- drake_plan (
file_name = file_in(!!file),
)
test
#> # A tibble: 1 x 2
#> target command
#> <chr> <expr>
#> 1 file_name file_in("test.txt")
config <- drake_config(test)
vis_drake_graph(config) Created on 2019-12-12 by the reprex package (v0.3.0) Without library(drake)
file <- "test.txt"
test <- drake_plan (
file_name = file_in(file),
)
test
#> # A tibble: 1 x 2
#> target command
#> <chr> <expr>
#> 1 file_name file_in(file)
config <- drake_config(test)
vis_drake_graph(config) Created on 2019-12-12 by the reprex package (v0.3.0) |
Thanks for the suggestions. A few questions.
Just to clarify, So what do you recommend if a function is called that writes and/or reads a whole bunch of files? Ideally the function would have an input argument that names the folder containing the in/output?
Exactly, the goal is just to make sure that these files (rather directories in this case) exist before running the shell command. I hadn't realized that I could include multiple commands via |
Yes. The simple mention of
So what do you recommend if a function is called that writes and/or reads a whole bunch of files? Ideally the function would have an input argument that names the folder containing the in/output? A common pattern is I am realizing that my original suggestion for your analysis may not totally meet your need. I assumed nothing else modifies the contents of
Even easier: if you're only creating directories, why no do that outside the plan drop
|
Ah I see, that all makes sense - thanks very much for all the help! |
Again, I would like to plug #1178. |
Adding here to tag in FAQ as per @wlandau's request.
I'm not sure how
drake
will handle a workflow that has steps outside of R. My workflow consists of the following rough steps:drake
]drake
]drake
usage]And so on. This continues through to further big analytical steps similar to Step 2.
It seems that
system2()
is a way to handle system commands from within R. I am not sure if commands will need to be different based on OS (e.g., I sometimes run parts of this workflow on my Windows laptop, but the rest is executed on a Linux cluster).The text was updated successfully, but these errors were encountered: