-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use language to mark input and output files #232
Comments
I will be glad when we have this feature. What should we do about target names in workflow plans? Should we automatically use the names inside |
suggestion: use a dummy target wherever there is |
@AlexAxthelm thinking more about your idea, it comes really close to working. Here's how I picture it. The user sets up the workflow plan without worrying about files in the
When the dependency graph is constructed:
Loose ends:
|
Desired behavior for file_input(x, 'y', "z")
## [1] "x" "y" "z" The option to drop quotes would be super nice to have, though it may be tricky because of how |
Putting this issue together with #7, #9, and #190, I think knit('report.Rmd', output = "report.md") # Current `drake`.
knit(file_input("report.Rmd"), file_output("report.md")) # After this issue is solved.
knit(file_input(report.Rmd), file_output(report.md)) # Quoting should be optional here. Example setbacks:
|
Language objects in R are just nested symbols:
Couldn't you just traverse the language object, search for This makes |
Seems like that would work in this situation. Still, I think it's a good opportunity to think about how to re-represent the syntax tree in a way that's easier to work with. There may be more complicated edge cases we want to handle in the future. Currently, when I traverse an expression, I have to check cryptic things like |
And then there's a whole separate walk to look for namespaced function calls. |
Another advantage of focusing on the graph: we could
This could be a whole lot faster and more elegant than the current build_drake_graph(), which is the current way that |
@dapperjapper I somehow missed this explicit guidance on walking expressions. After reading it, my thinking aligns more with yours. It seems practical to
We might still return a |
FYI: I just started an implementation in the
|
By the way: because of f <- function(){
knitr::knit(file_input("file.Rmd"), output = file_output("file.md"))
render(file_output(a, b, c), input = file_input(x, y, z))
}
special_dependencies(f)
## $namespaced_functions
## [1] "knitr::knit"
##
## $file_inputs
## [1] "\"file.Rmd\"" "\"x\"" "\"y\""
## [4] "\"z\""
##
## $file_outputs
## [1] "\"file.md\"" "\"a\"" "\"b\"" "\"c\""
##
## $knitr_sources
## [1] "\"file.Rmd\"" "\"x\"" "\"y\""
## [4] "\"z\"" |
Another thing: @dapperjapper, file_output(my_report.html)
## [1] "my_report.html"
file_input(data.csv, 'spreadsheet.xlsx', "knitr_source.Rnw")
## [1] "data.csv" "spreadsheet.xlsx" "knitr_source.Rnw" |
Hmm... thinking about a special drake_plan(file.md = disguise('file.Rmd')) # old API
disguise <- function(input){
knitr(input, output = "file.md")
} This will also make |
Does it understand the following? I think it's fine for now if it doesn't... "file.csv" %>% file_input() %>% read_csv()
x <- "filename"
file_output(str_c(x, ".md")) |
Good point. I was not planning on it for the first iteration, but we can definitely work on it. Kicking off this feature is already super complicated. |
Another preview, this time using one_command <- quote({
x <- y + z
# Scoping is optional for knitr::input(), file_input(), and file_output().
rmarkdown::render(drake::knitr_input(report.Rmd), output_format = "all")
file_output(report.md, report.pdf, report.html) # Outputs we care about.
## The actual value of the target is always the command's return value.
readRDS(file_input(data.rds))
})
special_dependencies(one_command)
## $candidate_globals
## [1] "{" "<-" "x" "+" "y"
## [6] "z" "::" "rmarkdown" "render" "drake"
## [11] "\"all\"" "readRDS"
##
## $namespaced_functions
## [1] "rmarkdown::render"
##
## $knitr_input
## [1] "\"report.Rmd\""
##
## $file_output
## [1] "\"file_output\"" "\"report.md\"" "\"report.pdf\""
## [4] "\"report.html\""
##
## $file_input
## [1] "\"file_input\"" "\"data.rds\"" |
I no longer think we should create a new target for every |
File outputs are trickier than I thought. Just like any other target, we need to check if a file output has changed or needs processing. To maintain the parallelism and the checking, maybe that does mean we need to submit each file output to its own parallel job. I think that's what I'll implement as a first go-round. We can revisit these and other efficiency issues later. |
Also: internally, I think all files should be labeled with double quotes: |
Something counterintuitive I just realized: suppose we have a plan with file inputs and outputs: drake_plan(
return_value = {
contents <- read.csv(file_input(input.csv))
write.csv(contents, file_output(output.csv))
}
)
But I think I figured it out:
|
I was wrong: file outputs are both upstream and downstream of the target. Also, to make sure the command is actually run before downstream targets use the file outputs, the file outputs should lie downstream of the target / return value in the graph. |
What if when a target has a myplan <- drake_plan(
return_value_temp = {
contents <- read.csv(file_input(input.csv))
write.csv(contents, file_output(output.csv))
contents
},
output.csv = file_output_of(return_value_temp),
return_value = return_value_temp
) where Just a sketch of an idea, but I'm guessing that something like this would allow for interpreting intent, and translating it into the existing paradigm. |
Are you saying we need a way to match |
That sounds like a great plan. My thought of making the |
And to answer the actual question, yes, I was thinking that there should be a single explicit dependency on
We want |
Why is all this extra complexity needed? Because there can now be more than 1 tracked output file per target? If this starts to get too complicated I would be in favor of just enforcing one file output per target. |
@dapperjapper I am starting to agree. The overall concept is slowly gaining clarity, but my attempt to depart from bijectivity is not going well. |
Another option would be drake_plan(
`file_output(target.csv)` = write_csv(data, "target.csv")
) Clunky to write, but more clear what's going on |
I would be in favor of something like this, especially if it helps moves away from the single/double quote issues that I have. |
I thought about that too, but it requires repeating the target name on the left-hand side as well. drake_plan(
{
function_inside_unnamed_command()
x <- read.csv('input.csv') # old API
write.csv(summarize(x), file_output(output.csv))
},
no_files_here(but_no_target_name_either),
overwritten_target_name = knit(knitr_input(report.Rmd), file_output(report.md)),
true_target_name = my_function(file_input(values.rds) * 3.14)
)
## # A tibble: 3 x 2
## target command
## <chr> <chr>
## 1 "\"output.csv\"" "{\n function_inside_unnamed_command()\n x <- read.csv…
## 2 drake_target_2 no_files_here(but_no_target_name_either)
## 3 "\"report.md\"" knit(knitr_input(report.Rmd), file_output(report.md))
## 4 true_target_name my_function(file_input(values.rds) * 3.14) All files are still quoted (double-quoted), but this happens automatically on the backend. We avoid potential name conflicts with non-file targets that way, and it's easy to differentiate between files and non-files. I am open to suggestions about doing away with any kind of quoting, but it would require a lot of error-prone refactoring internally. |
Hi all, Sorry I took a bit to chime in here...
My somewhat solicited advice (I was tagged in here and work in this space extensively) is to really, really consider how important to you the quotes being optional is. I can almost promise you it's not worth the lack of clarity, especially when, as in this case, the thing without quotes around it is actually a string value. it's a path. just have them put quotes around it. That said, and again, i really honestly think you shouldn't, you can easily allow things inside file_input and file_output to not need quotes with a simple custom handler for calls to those functions within the CodeDepends framework.
Sorry, I'm playing catch up a bit here. Can you give me an example where you need the nesting and what you're using it for? As for the knowing stuff inside input_file is a file, the function handler framework I put into CodeDepends is designed explicitly to be able to do this kind of thing. codetools has the problems you mention above, but they'd be very easy to avoid in the CodeDepends framework.
It's a side-effect, though thinking in CodeDepends terms (ie inputs and outputs of an expression) you could argue that the file itself (not output.csv the symbol) is an output of that expression, then any other expressions that have it (again, the file, not the symbol) as an input would have a dependency on that expression. That said, even though conceptually I made a very strong (and important) distinction between the symbol and the file itself, I'd have to think a bit more about how far you could get by just spoofing that symbol being an output here and and input elsewhere. You'd need to know it ultimately is a file so you could check its existence, but you mention elsewhere that that is pretty easy and for the actual dependency modeling I don't know if that's important so long as it's handled consistently throughout the entire static analysis pass. |
Thanks, @gmbecker.
You have convinced me. And for full paths with
Actually, the nesting didn't turn out to be as much of a problem as I thought. Previously, I assumed we would have to detect whether a |
I am ready to propose an implementation! Please see #258. |
Allows storing commands as list of language objects in version +2, we don't need to distinguish between
'
and"
anymore.The text was updated successfully, but these errors were encountered: