Use language to mark input and output files #232

krlmlr · 2018-02-04T06:55:11Z

drake_plan(
  knit(file_input("report.Rmd"), output = file_output("report.md"), ...)
)

Allows storing commands as list of language objects in version +2, we don't need to distinguish between ' and " anymore.

The text was updated successfully, but these errors were encountered:

wlandau · 2018-02-11T04:04:35Z

I will be glad when we have this feature.

What should we do about target names in workflow plans? Should we automatically use the names insidefile_output()s and ignore what the user puts in the target column? If so, how do we make sure the file_output()s do not conflict with symbolic targets? To minimize refactoring, we could sanitize the workflow plan such that the strings inside file_output() become double-quoted target names in the target column. Other approaches would require more refactoring but could be better in the long run.

AlexAxthelm · 2018-02-11T21:45:34Z

suggestion: use a dummy target wherever there is file_output(), and have an explicit dependency on that dummy target for the file. Don't know how to run that without breaking old plans though.

wlandau · 2018-02-12T13:49:51Z

@AlexAxthelm thinking more about your idea, it comes really close to working. Here's how I picture it. The user sets up the workflow plan without worrying about files in the "target" column. The symbolic LHS targets are always return values from commands, and we can reproducibly track them along with any files generated.

drake_plan(
  some_return_value = {
    knit(file_input("report.Rmd"), output = file_output("report.md"), ...)
    write.csv(some_output, file_output(some_output.csv)) # quoting should be optional in `file_*()`
  }
)

## # A tibble: 1 x 2
##   target            command                                                   
##   <chr>             <chr>                                                     
## 1 some_return_value "{\n    knit(file_input('report.Rmd'), output = file_outp…

When the dependency graph is constructed:

some_return_value should depend on report.Rmd.
Every file_output() in the command should depend on some_return_value and all its dependencies. (A dependency on some_return_value will not usually be enough.).
The command of each file_output() should be a non-executing representation of the command of some_return_value.
file_output()s and file_input()s of imported functions should be resolved this way too.

Loose ends:

We still need to differentiate between file and non-file targets somehow. Seems like the sort of thing to wrap inside drake_meta().
Do we need to expand the workflow plan to account for this? I would rather take it into account when we construct the internal igraph of the workflow.

wlandau · 2018-02-12T14:36:24Z

Desired behavior for file_input() and file_output():

file_input(x, 'y', "z")

## [1] "x" "y" "z"

The option to drop quotes would be super nice to have, though it may be tricky because of how codetools::findGlobals() finds dependencies.

wlandau · 2018-02-14T04:24:34Z

Putting this issue together with #7, #9, and #190, I think drake needs its own custom static code analysis. The current approach is to call codetools::findGlobals() and then work around all the exceptions. That makes it hard to depart from what codetools already does, and with file_input() and file_output(), I think we hit a wall. For example, the following three workflow plan commands mean the same thing, but they present problems for drake's current approach.

knit('report.Rmd', output = "report.md")                 # Current `drake`.
knit(file_input("report.Rmd"), file_output("report.md")) # After this issue is solved.
knit(file_input(report.Rmd), file_output(report.md))     # Quoting should be optional here.

Example setbacks:

In knitr.R, the file_input() would interfere with the detection of "report.Rmd".
In the last line above, codetools thinks report.Rmd is an R variable.

Drake's goals for static code analysis are so different that I think we need to depart from the landscape of existing tools such as codetools, CodeDepends by @gmbecker and @duncantl, and globals by @HenrikBengtsson. Not only do we need to get all the symbols and strings from an expression, we also need to know how they are nested. I suggest we:

Get the abstract syntax tree as a sensible data structure. A list or igraph would be ideal. It's a pity that pryr::ast() just outputs text that needs to be parsed all over again. Drake version 1 used pryr::ast(), and the text processing made it extremely prone to errors.
Rely heavily on the explicit hierarchy in the AST to detect the file outputs and the dependencies, as well as the decisions knitr and rmarkdown will make.

violetcereza · 2018-02-14T18:43:05Z

Language objects in R are just nested symbols:

rlang::expr(knit(file_input(report.Rmd), file_output(report.md)) ) %>% View

Couldn't you just traverse the language object, search for . == rlang::sym("file_output") in the target code, and extract that function from the parse tree while leaving its contents untouched?

This makes file_input and file_output not real functions (and up for dangerous misinterpretation by users) but it's a simple workaround.

wlandau · 2018-02-14T22:17:14Z

Seems like that would work in this situation. Still, I think it's a good opportunity to think about how to re-represent the syntax tree in a way that's easier to work with. There may be more complicated edge cases we want to handle in the future. Currently, when I traverse an expression, I have to check cryptic things like is.function() and is.recursive() along the way. pryr::ast() even checks is.pairlist() and keeps track of a level argument (depth of the current node in the tree). Why not just convert the whole AST into an igraph or maybe even a ggraph first? Maybe I just need to educate myself more on rlang. I am having trouble finding vignettes and tutorials on non-standard evaluation that explain how to do this sort of thing correctly.

wlandau · 2018-02-15T02:13:35Z

And then there's a whole separate walk to look for namespaced function calls.

wlandau · 2018-02-15T02:39:54Z

Another advantage of focusing on the graph: we could

Create the igraph for each imported function and workflow plan command. Not every symbol in a function body is a global object, so these graphs would need some trimming.
Take the union of all the import/command graphs.

This could be a whole lot faster and more elegant than the current build_drake_graph(), which is the current way that drake creates the dependency network from a workflow plan and an environment of imports.

wlandau · 2018-02-15T13:17:05Z

@dapperjapper I somehow missed this explicit guidance on walking expressions. After reading it, my thinking aligns more with yours. It seems practical to

Walk the expression to find namespaced functions (containing :: and :::), file inputs, file outputs, knitr source files, and dependencies in knitr code chunks. This should be more unified than how drake currently does it, and it should take into account how things are nested.
Call codetools::findGlobals() to find the rest.

We might still return a igraphs from code analyses so igraph::union() can put it all together in build_drake_graph().

wlandau · 2018-02-15T15:23:27Z

wlandau · 2018-02-15T15:26:14Z

By the way: because of special_dependencies(), you can specify multiple files in file_input() and file_output(), and quotes are totally optional. This is really exciting! cc @noamross

f <- function(){
  knitr::knit(file_input("file.Rmd"), output = file_output("file.md"))
  render(file_output(a, b, c), input = file_input(x, y, z))
}
special_dependencies(f)

## $namespaced_functions
## [1] "knitr::knit"
## 
## $file_inputs
## [1] "\"file.Rmd\"" "\"x\""        "\"y\""       
## [4] "\"z\""       
## 
## $file_outputs
## [1] "\"file.md\"" "\"a\""       "\"b\""       "\"c\""      
## 
## $knitr_sources
## [1] "\"file.Rmd\"" "\"x\""        "\"y\""       
## [4] "\"z\""

wlandau · 2018-02-15T15:49:13Z

Another thing: @dapperjapper, file_input() and file_output() are now actual functions that will work in-place inside commands.

file_output(my_report.html)

## [1] "my_report.html"

file_input(data.csv, 'spreadsheet.xlsx', "knitr_source.Rnw")

## [1] "data.csv"         "spreadsheet.xlsx" "knitr_source.Rnw"

wlandau · 2018-02-15T16:00:40Z

Hmm... thinking about a special knitr_input() function. Right now, drake has to do some gymnastics to guess your knitr sources, and it is easily fooled:

drake_plan(file.md = disguise('file.Rmd')) # old API
disguise <- function(input){
  knitr(input, output = "file.md")
}

This will also make special_dependencies() more reliable.

violetcereza · 2018-02-15T16:07:27Z

Does it understand the following? I think it's fine for now if it doesn't...

"file.csv" %>% file_input() %>% read_csv()
x <- "filename"
file_output(str_c(x, ".md"))

wlandau · 2018-02-15T16:13:48Z

Good point. I was not planning on it for the first iteration, but we can definitely work on it. Kicking off this feature is already super complicated.

wlandau · 2018-02-15T16:28:15Z

Another preview, this time using knitr_input().

one_command <- quote({
  x <- y + z
  # Scoping is optional for knitr::input(), file_input(), and file_output().
  rmarkdown::render(drake::knitr_input(report.Rmd), output_format = "all")
  file_output(report.md, report.pdf, report.html) # Outputs we care about.
  ## The actual value of the target is always the command's return value.
  readRDS(file_input(data.rds))
})
special_dependencies(one_command)

## $candidate_globals
##  [1] "{"         "<-"        "x"         "+"         "y"        
##  [6] "z"         "::"        "rmarkdown" "render"    "drake"    
## [11] "\"all\""   "readRDS"  
## 
## $namespaced_functions
## [1] "rmarkdown::render"
## 
## $knitr_input
## [1] "\"report.Rmd\""
## 
## $file_output
## [1] "\"file_output\"" "\"report.md\""   "\"report.pdf\"" 
## [4] "\"report.html\""
## 
## $file_input
## [1] "\"file_input\"" "\"data.rds\""

wlandau · 2018-02-15T20:40:45Z

I no longer think we should create a new target for every file_output(). We don't want to launch a new HPC job just to check that a file was generated. If we communicate the file output information to conclude_build(), then we can store each output_file() with the same metadata as the concomitant target. However, we should still add output_file()s to the graph for the sake of visualization.

wlandau · 2018-02-16T05:06:00Z

File outputs are trickier than I thought. Just like any other target, we need to check if a file output has changed or needs processing. To maintain the parallelism and the checking, maybe that does mean we need to submit each file output to its own parallel job. I think that's what I'll implement as a first go-round. We can revisit these and other efficiency issues later.

wlandau · 2018-02-16T05:17:37Z

Also: internally, I think all files should be labeled with double quotes: "\"report.md\"", "\"report.Rmd\"", etc. That helps us avoid name conflicts between files and non-file targets (for example, the function read.csv() and an imported file named "read.csv"). I am using igraph attributes to help distinguish between file inputs and file outputs without too much loss of speed.

wlandau · 2018-02-16T05:39:46Z

Something counterintuitive I just realized: suppose we have a plan with file inputs and outputs:

drake_plan(
  return_value = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
  }
)

output.csv shares all the dependencies of return_value. That's clear enough. However, output.csv is neither upstream nor downstream of return_value. Why?

Not upstream: output.csv cannot be fingerprinted and cached until return_value is computed.
Not downstream: if output.csv changes, we need to re-run the command that produced return_value.

But I think I figured it out:

Treat output.csv as upstream anyway.
Trigger a build on return_value, but not output.csv.
In conclude_build() for return_value, check for file outputs like output.csv and process them.

wlandau · 2018-02-16T06:48:49Z

I was wrong: file outputs are both upstream and downstream of the target. output.csv needs to be processed and fingerprinted beforehand to see if return_value needs building. Then after return_value is computed, output.csv needs to be fingerprinted and tracked because it was just generated. That means we should look at the file outputs twice: once during file_trigger() (which needs refactoring) and once at the very beginning of conclude_build().

Also, to make sure the command is actually run before downstream targets use the file outputs, the file outputs should lie downstream of the target / return value in the graph.

AlexAxthelm · 2018-02-16T14:24:47Z

What if when a target has a file_output as part of the command, reassigns the command to <<target>>_temp, and then has something along the lines of:

myplan <- drake_plan(
  return_value_temp = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
    contents
  },
  output.csv = file_output_of(return_value_temp),
  return_value = return_value_temp
)

where file_output_of() would signal to the parser that the target is a file, and the argument is the dependency.

Just a sketch of an idea, but I'm guessing that something like this would allow for interpreting intent, and translating it into the existing paradigm.

wlandau · 2018-02-16T15:11:01Z

Are you saying we need a way to match output.csv with return_value_temp? I agree. I actually ran into that problem last night, but I think my work this morning fixed it. The only thing that points to output.csv should be return_value_temp. So we already know the file_output_of(return_value_temp) just by looking at the graph. We now have a function called concomitant_file_output() that gets the downstream adjacent files, and the metadata of return_value_temp keeps track of them.

AlexAxthelm · 2018-02-16T16:47:12Z

That sounds like a great plan. My thought of making the _temp target was to keep the current paradigm of one-to-one correlation between hashed objects (targets) and the command that builds them. Do you have an example of the graph for this example plan? Or is that not finished yet?

AlexAxthelm · 2018-02-16T16:55:38Z

And to answer the actual question, yes, I was thinking that there should be a single explicit dependency on return_value_temp for the output.csv file, because we want return_value to rebuild on 2 conditions:

output.csv changes Because the file is a depend for the final object, this works
input.csv (or whatever other depends) change This will cause return_value_temp to rebuild, which then propagates downstream

We want output.csv to rebuild on those same 2 conditions. The second is easy, and built into the graph. The tricky part would be to make file_output_of() make return_value_temp outdated if output.csv were outdated or missing (propagate outdated-ness upstream). I can see why this would be a problem, and need some weird, fragile exceptions to happen.

violetcereza · 2018-02-16T17:01:29Z

Why is all this extra complexity needed? Because there can now be more than 1 tracked output file per target? If this starts to get too complicated I would be in favor of just enforcing one file output per target.

wlandau · 2018-02-16T17:48:39Z

@dapperjapper I am starting to agree. The overall concept is slowly gaining clarity, but my attempt to depart from bijectivity is not going well. Drake is fundamentally designed to have one target per command, and the i232-file-api branch has ugly code and broken build logic. On the other hand, the static code analysis is great. So I think we should role out file_input() etc. first and then think about non-bijectivity later.

violetcereza · 2018-02-16T18:40:38Z

Another option would be

drake_plan(
  `file_output(target.csv)` = write_csv(data, "target.csv")
)

Clunky to write, but more clear what's going on

AlexAxthelm · 2018-02-16T19:10:26Z

I would be in favor of something like this, especially if it helps moves away from the single/double quote issues that I have.

wlandau · 2018-02-16T19:26:57Z

I thought about that too, but it requires repeating the target name on the left-hand side as well. Remake had a special target_name keyword for this. Both options may be a little unclear for new users who are unfamiliar with Make-like build management. For now, my favorite option is to just have drake pick the target names for you. I have new branch now, and this is the current behavior. The touching-up happens inside sanitize(), which is called early and often to try to minimize surprises.

drake_plan(
  {
    function_inside_unnamed_command()
    x <- read.csv('input.csv') # old API
    write.csv(summarize(x), file_output(output.csv))
  },
  no_files_here(but_no_target_name_either),
  overwritten_target_name = knit(knitr_input(report.Rmd), file_output(report.md)),
  true_target_name = my_function(file_input(values.rds) * 3.14)
)

## # A tibble: 3 x 2
##   target           command                                                      
##   <chr>            <chr>                                                        
## 1 "\"output.csv\"" "{\n    function_inside_unnamed_command()\n    x <- read.csv…
## 2 drake_target_2   no_files_here(but_no_target_name_either)
## 3 "\"report.md\""  knit(knitr_input(report.Rmd), file_output(report.md))        
## 4 true_target_name my_function(file_input(values.rds) * 3.14)

All files are still quoted (double-quoted), but this happens automatically on the backend. We avoid potential name conflicts with non-file targets that way, and it's easy to differentiate between files and non-files. I am open to suggestions about doing away with any kind of quoting, but it would require a lot of error-prone refactoring internally.

gmbecker · 2018-02-16T22:03:59Z

Hi all,

Sorry I took a bit to chime in here...

Putting this issue together with #7, #9, and #190, I think drake needs its own custom static code analysis. The current approach is to call codetools::findGlobals() and then work around all the exceptions. That makes it hard to depart from what codetools already does, and with file_input() and file_output(), I think we hit a wall. For example, the following three workflow plan commands mean the same thing, but they present problems for drake's current approach.

knit('report.Rmd', output = "report.md") # Current drake.
knit(file_input("report.Rmd"), file_output("report.md")) # After this issue is solved.
knit(file_input(report.Rmd), file_output(report.md)) # Quoting should be optional here.

My somewhat solicited advice (I was tagged in here and work in this space extensively) is to really, really consider how important to you the quotes being optional is. I can almost promise you it's not worth the lack of clarity, especially when, as in this case, the thing without quotes around it is actually a string value. it's a path. just have them put quotes around it.

That said, and again, i really honestly think you shouldn't, you can easily allow things inside file_input and file_output to not need quotes with a simple custom handler for calls to those functions within the CodeDepends framework.

Example setbacks:

In knitr.R, the file_input() would interfere with the detection of "report.Rmd".
In the last line above, codetools thinks report.Rmd is an R variable.
Drake's goals for static code analysis are so different that I think we need to depart from the landscape of existing tools such as codetools, CodeDepends by @gmbecker and @duncantl, and globals by @HenrikBengtsson. Not only do we need to get all the symbols and strings from an expression, we also need to know how they are nested. I suggest we:

Sorry, I'm playing catch up a bit here. Can you give me an example where you need the nesting and what you're using it for?

As for the knowing stuff inside input_file is a file, the function handler framework I put into CodeDepends is designed explicitly to be able to do this kind of thing. codetools has the problems you mention above, but they'd be very easy to avoid in the CodeDepends framework.

Get the abstract syntax tree as a sensible data structure. A list or igraph would be ideal. It's a pity that pryr::ast() just outputs text that needs to be parsed all over again. Drake version 1 used pryr::ast(), and the text processing made it extremely prone to errors.
Rely heavily on the explicit hierarchy in the AST to detect the file outputs and the dependencies, as well as the decisions knitr and rmarkdown will make.

Something counterintuitive I just realized: suppose we have a plan with file inputs and outputs:
drake_plan(
  return_value = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
  }
)
output.csv shares all the dependencies of return_value. That's clear enough. However, output.csv is neither upstream nor downstream of return_value. Why?

It's a side-effect, though thinking in CodeDepends terms (ie inputs and outputs of an expression) you could argue that the file itself (not output.csv the symbol) is an output of that expression, then any other expressions that have it (again, the file, not the symbol) as an input would have a dependency on that expression.

That said, even though conceptually I made a very strong (and important) distinction between the symbol and the file itself, I'd have to think a bit more about how far you could get by just spoofing that symbol being an output here and and input elsewhere. You'd need to know it ultimately is a file so you could check its existence, but you mention elsewhere that that is pretty easy and for the actual dependency modeling I don't know if that's important so long as it's handled consistently throughout the entire static analysis pass.

wlandau · 2018-02-17T12:56:39Z

Thanks, @gmbecker.

My somewhat solicited advice (I was tagged in here and work in this space extensively) is to really, really consider how important to you the quotes being optional is. I can almost promise you it's not worth the lack of clarity, especially when, as in this case, the thing without quotes around it is actually a string value. it's a path. just have them put quotes around it.

You have convinced me. And for full paths with "/", users will need quotes anyway.

Sorry, I'm playing catch up a bit here. Can you give me an example where you need the nesting and what you're using it for?

Actually, the nesting didn't turn out to be as much of a problem as I thought. Previously, I assumed we would have to detect whether a file_input() was called inside a call to knit() or render(). But we're getting around the problem and adding more robustness by just having users write knitr_input(). That tells drake whether to dive into the active code chunks of a knitr source file to look for dependencies.

@krlmlr

cc @krlmlr @noamross @AlexAxthelm @dapperjapper @gmbecker

wlandau · 2018-02-17T19:35:15Z

I am ready to propose an implementation! Please see #258.

wlandau · 2018-02-20T03:03:58Z

Another big thanks to @krlmlr and @gmbecker for steering us in the right direction! I merged #258 into master and I am closing this issue, but let's keep talking about and refining this feature going forward.

because #232 is solved.

wlandau added type: new feature topic: api labels Feb 5, 2018

This was referenced Feb 8, 2018

Command that writes a file always runs #245

Closed

Use tibbles and language objects internally instead of data frames and text #247

Closed

Writing file targets using here() #249

Closed

wlandau added the status: priority label Feb 11, 2018

violetcereza mentioned this issue Feb 14, 2018

DSL based on dplyr-like verbs? #233

Closed

wlandau mentioned this issue Feb 15, 2018

FAQ Document? #256

Closed

wlandau-lilly mentioned this issue Feb 16, 2018

CodeDepends as a backend for reproducible build systems duncantl/CodeDepends#14

Closed

wlandau mentioned this issue Feb 16, 2018

Helper function for creating file targets with multiple files #257

Closed

wlandau pushed a commit that referenced this issue Feb 17, 2018

Update NEWS.md: finish a proposed solution to #232

f84e98e

cc @krlmlr @noamross @AlexAxthelm @dapperjapper @gmbecker

wlandau mentioned this issue Feb 17, 2018

Handle files with file_input(), file_output(), and knitr_input() instead of double quotes. #258

Merged

wlandau added the TOP PRIORITY label Feb 17, 2018

wlandau pushed a commit that referenced this issue Feb 20, 2018

Update with the new file API (#232)

67ecad0

wlandau closed this as completed Feb 20, 2018

wlandau pushed a commit that referenced this issue Feb 20, 2018

Bump dev version to 5.0.1.9002

906404f

because #232 is solved.

wlandau mentioned this issue Feb 21, 2018

Directories (folders) are not reproducibly tracked. #12

Closed

wlandau-lilly mentioned this issue Feb 26, 2018

evaluate_plan() with file targets #106

Closed

This was referenced Feb 26, 2018

Allow multiple output files for each command #283

Closed

Lonely single quotes confuse check_plan() #291

Closed

wlandau mentioned this issue Mar 9, 2018

planning the 5.1.0 CRAN release #308

Closed

wlandau mentioned this issue Jul 20, 2018

A new interface for triggers #473

Closed

wlandau mentioned this issue Feb 28, 2019

Documentation of strings_in_dots warning is unclear #758

Closed

Use language to mark input and output files #232

Use language to mark input and output files #232

Comments

krlmlr commented Feb 4, 2018

wlandau commented Feb 11, 2018

AlexAxthelm commented Feb 11, 2018

wlandau commented Feb 12, 2018 • edited Loading

wlandau commented Feb 12, 2018 • edited Loading

wlandau commented Feb 14, 2018 • edited Loading

violetcereza commented Feb 14, 2018 • edited Loading

wlandau commented Feb 14, 2018

wlandau commented Feb 15, 2018

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018

wlandau commented Feb 15, 2018

violetcereza commented Feb 15, 2018

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018 • edited Loading

wlandau commented Feb 15, 2018

wlandau commented Feb 16, 2018

wlandau commented Feb 16, 2018

wlandau commented Feb 16, 2018 • edited Loading

wlandau commented Feb 16, 2018 • edited Loading

AlexAxthelm commented Feb 16, 2018

wlandau commented Feb 16, 2018

AlexAxthelm commented Feb 16, 2018

AlexAxthelm commented Feb 16, 2018

violetcereza commented Feb 16, 2018

wlandau commented Feb 16, 2018

violetcereza commented Feb 16, 2018

AlexAxthelm commented Feb 16, 2018

wlandau commented Feb 16, 2018 • edited Loading

gmbecker commented Feb 16, 2018

wlandau commented Feb 17, 2018

wlandau commented Feb 17, 2018

wlandau commented Feb 20, 2018

wlandau commented Feb 12, 2018 •

edited

Loading

wlandau commented Feb 12, 2018 •

edited

Loading

wlandau commented Feb 14, 2018 •

edited

Loading

violetcereza commented Feb 14, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 15, 2018 •

edited

Loading

wlandau commented Feb 16, 2018 •

edited

Loading

wlandau commented Feb 16, 2018 •

edited

Loading

wlandau commented Feb 16, 2018 •

edited

Loading