-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accommodation of script-based imperative workflows #994
Comments
I agree that promoting the usage of scripts rather than functions should try to be avoided. My thought was trying to allow for users to add pre-existing workflows they have into drake. In addition, using the If source_in were to be used and promoted the usage of the other "traditional" drake functions (ie file_in, file_out, loadd, readd) to find and document dependencies, I think it would be less friction. In addition, loud warnings and complaining about using external scripts could be added. |
I totally agree.
It seems odd to make things smoother with If the issue is converting old imperative workflows to library(drake)
parse_script <- function(file) {
lines <- paste(readLines(file), sep = "\n")
code <- parse(text = lines, keep.source = FALSE)
out <- call("{")
for (i in seq_along(code)) {
out[[i + 1]] <- code[[i]]
}
out
}
script1 <- tempfile()
script2 <- tempfile()
writeLines(c("data <- my_data()", "munge(data)"), script1)
writeLines("analyze(munged)", script2)
plan <- drake_plan(
munged = !!parse_script(script1),
analysis = !!parse_script(script2)
)
drake_plan_source(plan)
#> drake_plan(
#> munged = {
#> data <- my_data()
#> munge(data)
#> },
#> analysis = {
#> analyze(munged)
#> }
#> )
config <- drake_config(plan)
vis_drake_graph(config) Created on 2019-08-22 by the reprex package (v0.3.0) |
All we have to do is add this new I consider this approach far lighter than |
To emphasize: |
I was thinking along the same lines? My working version is:
I was looking for ways to add in the file tracking that you have with file_in too? |
SpeedThis feature will probably not cause a bottleneck, but speed is still worth a look. It seems like the speed gains of library(microbenchmark)
lines <- readLines(url("http://bioconductor.org/biocLite.R"))
# Used throughout drake's internals:
drake:::safe_parse("# 123")
#> expression()
microbenchmark(
parse_src = parse(text = lines, keep.source = TRUE),
parse = parse(text = lines, keep.source = FALSE),
str = str2expression(lines),
drake = drake:::safe_parse(lines)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> parse_src 862.956 907.467 969.0071 918.4455 992.3515 1484.220 100
#> parse 568.086 596.110 622.2085 602.0705 629.5830 844.246 100
#> str 567.813 596.388 626.6053 601.8305 646.4530 881.024 100
#> drake 579.209 604.267 636.5746 612.3220 655.9230 845.641 100 Created on 2019-08-23 by the reprex package (v0.3.0) ImplementationYour one-liner is way better than my script_in <- function(path){
safe_parse(c("{", readLines(path), "}"))
} More thoughts on namesThis is the hardest part. script_in()On its own, this would be an excellent name. I just worry that it is too much like the existing source_script()You brought up a great point yesterday that from the user's perspective, the script will ultimately be sourced for all intents and purposes when the user calls inline_script()Inline functions are already an established concept in C/C++, and the insert_script()This one is my favorite. It evokes what is really going on, and I think people will know what we mean by insertion. include_script()I thought about it, but I don't like it that much. Inclusion does not necessarily mean we literally insert the code (e.g. file_in()-like trackingThe code will already be tracked in the plan, so I believe this will not be necessary. |
A big problem with this whole approach is that we will end up with an enormous plan that users will ultimately need to refactor. How about |
library(drake)
script_to_function <- function(path) {
lines <- readLines(path)
lines <- c("function(...) {", lines, "}")
text <- paste(lines, sep = "\n")
drake:::safe_parse(text)
}
munge_script <- tempfile()
analysis_script <- tempfile()
writeLines(c("data <- my_data()", "munge(data)"), munge_script)
writeLines("analyze(munged)", analysis_script)
do_munging <- script_to_function(munge_script)
do_analysis <- script_to_function(analysis_script)
do_munging
#> function(...) {
#> data <- my_data()
#> munge(data)
#> }
do_analysis
#> function(...) {
#> analyze(munged)
#> }
drake_plan(
munged_value = do_munging(),
analysis_value = do_analysis(munged_value)
)
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 munged_value do_munging()
#> 2 analysis_value do_analysis(munged_value) Created on 2019-08-23 by the reprex package (v0.3.0) |
Benefits
Concerns
|
It looks like your example wouldn't run, because the body of What about: script_to_function <- function(path, args) {
#need to check that args is an unnamed character vector of valid names
lines <- readLines(path)
lines <- c("function(", paste(args, sep = ", "), ") {", lines, "}")
text <- paste(lines, sep = "\n")
drake:::safe_parse(text)
} ? |
Another issue that I expect would bite new users of this method is that file dependencies in the script would be untracked by |
In this scenario, the target names are unrelated to the preexisting scripts of a non- In other words, |
Just occurred to me: people use R Markdown for script-oriented workflows:
Lines 777 to 790 in 6fca3fd
where Lines 301 to 312 in 6fca3fd
If we go with (1), should we go with We could add an argument to |
We might want to make code more similar to the code to plan. Are you thinking that the rmd would be able to be rendered still at each step too? |
I was thinking we could just extract the code from the active chunks and stick it in a function. No rendering required. Sound good? |
Another thing: we need to think about the return values of the functions. For example, if all the functions generated by code_to_function <- function(path) {
lines <- readLines(path)
knitr_pattern <- "^(### chunk number|<<[^>]*>>=|```\\{r.*\\})"
if (any(grepl(knitr_pattern, lines))) {
lines <- get_tangled_text(path)
}
lines <- c(
"function(...) {",
lines,
"standardize_function(sys.function())", # From drake. Calls deparse(), but this shouldn't be a bottleneck...
"}"
)
text <- paste(lines, sep = "\n")
eval(safe_parse(text))
} |
So then user-side refactoring might not be simple. We really need to talk about the sophisticated dependency tracking you get when you start with functions. But at least we can slap |
what do you think about adding |
Would you sketch what you are thinking? Not 100% sure I follow exactly. I could be convinced otherwise, but my current (and strong) preference is to connect the concept of a script to the concept of a function that gets defined outside the plan, and then all the plan has to do is connect the predefined pieces together (#994 (comment)). This is when drake workflows are cleanest and most manageable. |
I see. My thought is this:
Is this totally against how you want to incorporate external R scripts? This was another shower thought that I wanted to explore the functionality. In addition, this set up allows for triggering rebuilding itself if the file changes and updates will be captured. This function setup does not need to be done within the plan, but it was a thought |
TL;DR - also seems you're still in deep discussion. |
@thebioengineer, clever, but I would rather not go that route because it tracks scripts as files instead of totally relying on the parsed code in the R session. I think code_to_function() should be like source("R/functions.R"). The goal is to get people closer to using drake properly, and I also think we should keep the implementation simple. @pat-s, I am curious if #994 (comment) + #994 (comment) + a walkthrough in the manual would have helped you better understand drake back when you posted #193. |
@wlandau Gotcha. I was approaching it as if the user would want to just update their script and then run make(plan), without having to think about needing to re-generate the plan object. Thinking more, the file_in really didn't need to exist., it just added tracking for me. The steps to add drake to a pre-existing workflow would be as follows then:
Does that workflow jive with your mental model? |
👋 Hey @thebioengineer... Letting you know, |
Yeah, that's basically it! Here's how I would explain it. Suppose we have a workflow with traditional scripts.
where source("01_data.R")
source("02_munge.R")
source("03_analyze.R") I propose we change library(drake)
do_data <- code_to_function("01_data.R")
do_munge <- code_to_function("02_munge.R")
do_analyze <- code_to_function("03_analyze.R")
plan <- drake_plan(
data = do_data(),
munged = do_munge(data),
analysis = do_analyze(munged)
)
make(plan) Take-home messages:
This is what I am planning to flesh out for ropensci-books/drake#41. |
The goal is to make the conceptional leap from imperative (script-oriented) workflows to function-oriented workflows. Does #994 (comment) accomplish this? I have trouble understanding why function-oriented workflows are so confusing to people, so I do not always know how to help. |
Gotcha. I think most people are trained and typically approach workflows as a series of steps, and use functions as more of a "DRY" principal, rather than seeing each step as an opportunity to write a function. At least that has been my typical approach. I think working in tandem with the manual, developing the idea of how to translate the workflow steps into functions is the key. |
I think this is good #994 (comment). It's clean approach for users to try out drake. Additionally, when users finally decide to use drake approach,
From there on, users can start using the drake function approach. |
Yeah, let's see if there are other parts of the transition we can automate. write_code_to_function() certainly helps get the idea across, but duplicated code in the same workspace can be difficult to maintain, especially if users need to go back and forth between the drake and non-drake stuff. I will mull it over. |
#994 (comment) is certainly a nice start that makes things more clear. From my side I would recommend to make the function-based approach from drake much more clear right from the start and link to the appropriate examples in the manual. Leaving beside how drake deals with script-based workflows in the end, users need to be briefed right from the start that their view on generating a workflow might need to change (or drake might even softly push them to change). In my field (ecological modelling) almost all people come from a script-based workflow, simply because they lack the skill of writing functions (or just the experience). |
@pat-s Good points. Getting drake to "accept" script based workflows might not be the more difficult part, but more convincing people the value of workflow management. I am going to pour over the current manual to see how we can make the value add to script based workflows more apparent, and how to move to a script based flow! |
I recommend chapter 5 ("drake projects") which explains how best to organize code into files for drake. @thebioengineer, are you saying you want traditional imperative script-based workflows to be a final solution/destination for drake use? Do you think we can do that in a way that stays true to drake's core values? |
My comment was based on @pat-s 's comment that a number of people in his field have difficulty seeing the value in function-based workflows. Making drake to handle, but have a little friction like we do now, might be an opportunity to then espouse the value of converting to more of a function based workflow. In the manual we could provide some advice on how to perform the conversion, and what exactly the value add is. Not just from it is easier for drake, but it is easier on the person maintaining the workflow to have discrete functions to perform specific tasks. |
Ah, got it, thanks for clarifying. I agree that we should argue for a function-based approach in and of itself, drake or no drake. |
@thebioengineer, do you still want to submit a PR with code_to_function() + docs, or should I? I realize you may have been waiting for me to return to the office. |
hey @wlandau, yes, sorry this has been in my queue to submit. I will do it tonight! I have been thinking on how to best incorporate drake into my existing script-based workflows to be able to speak more from experience on how to use and purpose of this function in the manual. Hopefully what I write ends up coherent :) |
👋 Hey @thebioengineer... Letting you know, |
Re #1007 (comment), I propose a different code_to_function <- function(path) {
lines <- readLines(path)
knitr_pattern <- "^(### chunk number|<<[^>]*>>=|```\\{r.*\\})"
if (any(grepl(knitr_pattern, lines))) {
lines <- get_tangled_text(path)
}
lines <- c(
"function(...) {",
lines,
"c(format(Sys.time(), \"%Y-%m-%d %H:%M:%OS9 %z GMT\"), tempfile())",
"}"
)
text <- paste(lines, sep = "\n")
eval(parse(text = text))
} |
I like that solution, as you said over in #1007, it should then act like a basic make system. I was flying yesterday, so I had come up with this solution: before I saw your response:
It evaluates the environment that gets created when the script would be run and returns that. If the environment changes, the output would then change and trigger downstream builds. |
Interesting. Unfortunately, though, I think #994 (comment) is likely to get us into trouble.
We are bypassing |
I see. That makes sense, I had forgot about the cases where raw pointers can change. Like I had said, I had come up with that solution before I had seen yours and wanted to propose it. I will incorporate the proposed solution and add additional tests! |
@wlandau, I have added a test that checks what I think was checking the idea. I have resolved a number of the lintr comments and added tests for using RMD as scripts as well.
I resolved this by adding |
Awesome! I will look at your tests. |
Thank you @thebioengineer for implementing this in #1007! |
Prework
drake
's code of conduct.Idea
Suggested by @thebioengineer. May improve the migration of old projects to
drake
and contribute to ropensci-books/drake#41.script_in("file.R")
could be shorthand forsource("file.R")$value
at runtime. At code analysis time,script_in()
could telldrake
to analyze the code infile.R
for dependencies in the usual way.Concerns
This way of doing things goes against
drake
's function-oriented style, and it makes more room for suboptimal programming practices. Plus, users can already achieve script-based behavior like this:I am eager to discuss, and my mind could be changed. However, my current opinion is that we should not make script-based imperative workflows easier. I think we should keep nudging users to write functions.
The text was updated successfully, but these errors were encountered: