Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use language to mark input and output files #232

Closed
krlmlr opened this issue Feb 4, 2018 · 35 comments
Closed

Use language to mark input and output files #232

krlmlr opened this issue Feb 4, 2018 · 35 comments

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 4, 2018

drake_plan(
  knit(file_input("report.Rmd"), output = file_output("report.md"), ...)
)

Allows storing commands as list of language objects in version +2, we don't need to distinguish between ' and " anymore.

@wlandau
Copy link
Member

wlandau commented Feb 11, 2018

I will be glad when we have this feature.

What should we do about target names in workflow plans? Should we automatically use the names insidefile_output()s and ignore what the user puts in the target column? If so, how do we make sure the file_output()s do not conflict with symbolic targets? To minimize refactoring, we could sanitize the workflow plan such that the strings inside file_output() become double-quoted target names in the target column. Other approaches would require more refactoring but could be better in the long run.

@AlexAxthelm
Copy link
Collaborator

suggestion: use a dummy target wherever there is file_output(), and have an explicit dependency on that dummy target for the file. Don't know how to run that without breaking old plans though.

@wlandau
Copy link
Member

wlandau commented Feb 12, 2018

@AlexAxthelm thinking more about your idea, it comes really close to working. Here's how I picture it. The user sets up the workflow plan without worrying about files in the "target" column. The symbolic LHS targets are always return values from commands, and we can reproducibly track them along with any files generated.

drake_plan(
  some_return_value = {
    knit(file_input("report.Rmd"), output = file_output("report.md"), ...)
    write.csv(some_output, file_output(some_output.csv)) # quoting should be optional in `file_*()`
  }
)

## # A tibble: 1 x 2
##   target            command                                                   
##   <chr>             <chr>                                                     
## 1 some_return_value "{\n    knit(file_input('report.Rmd'), output = file_outp…

When the dependency graph is constructed:

  1. some_return_value should depend on report.Rmd.
  2. Every file_output() in the command should depend on some_return_value and all its dependencies. (A dependency on some_return_value will not usually be enough.).
  3. The command of each file_output() should be a non-executing representation of the command of some_return_value.
  4. file_output()s and file_input()s of imported functions should be resolved this way too.

capture4

Loose ends:

  1. We still need to differentiate between file and non-file targets somehow. Seems like the sort of thing to wrap inside drake_meta().
  2. Do we need to expand the workflow plan to account for this? I would rather take it into account when we construct the internal igraph of the workflow.

@wlandau
Copy link
Member

wlandau commented Feb 12, 2018

Desired behavior for file_input() and file_output():

file_input(x, 'y', "z")

## [1] "x" "y" "z"

The option to drop quotes would be super nice to have, though it may be tricky because of how codetools::findGlobals() finds dependencies.

@wlandau
Copy link
Member

wlandau commented Feb 14, 2018

Putting this issue together with #7, #9, and #190, I think drake needs its own custom static code analysis. The current approach is to call codetools::findGlobals() and then work around all the exceptions. That makes it hard to depart from what codetools already does, and with file_input() and file_output(), I think we hit a wall. For example, the following three workflow plan commands mean the same thing, but they present problems for drake's current approach.

knit('report.Rmd', output = "report.md")                 # Current `drake`.
knit(file_input("report.Rmd"), file_output("report.md")) # After this issue is solved.
knit(file_input(report.Rmd), file_output(report.md))     # Quoting should be optional here.

Example setbacks:

  • In knitr.R, the file_input() would interfere with the detection of "report.Rmd".
  • In the last line above, codetools thinks report.Rmd is an R variable.

Drake's goals for static code analysis are so different that I think we need to depart from the landscape of existing tools such as codetools, CodeDepends by @gmbecker and @duncantl, and globals by @HenrikBengtsson. Not only do we need to get all the symbols and strings from an expression, we also need to know how they are nested. I suggest we:

  1. Get the abstract syntax tree as a sensible data structure. A list or igraph would be ideal. It's a pity that pryr::ast() just outputs text that needs to be parsed all over again. Drake version 1 used pryr::ast(), and the text processing made it extremely prone to errors.
  2. Rely heavily on the explicit hierarchy in the AST to detect the file outputs and the dependencies, as well as the decisions knitr and rmarkdown will make.

@violetcereza
Copy link
Contributor

violetcereza commented Feb 14, 2018

Language objects in R are just nested symbols:

rlang::expr(knit(file_input(report.Rmd), file_output(report.md)) ) %>% View

image

Couldn't you just traverse the language object, search for . == rlang::sym("file_output") in the target code, and extract that function from the parse tree while leaving its contents untouched?

This makes file_input and file_output not real functions (and up for dangerous misinterpretation by users) but it's a simple workaround.

@wlandau
Copy link
Member

wlandau commented Feb 14, 2018

Seems like that would work in this situation. Still, I think it's a good opportunity to think about how to re-represent the syntax tree in a way that's easier to work with. There may be more complicated edge cases we want to handle in the future. Currently, when I traverse an expression, I have to check cryptic things like is.function() and is.recursive() along the way. pryr::ast() even checks is.pairlist() and keeps track of a level argument (depth of the current node in the tree). Why not just convert the whole AST into an igraph or maybe even a ggraph first? Maybe I just need to educate myself more on rlang. I am having trouble finding vignettes and tutorials on non-standard evaluation that explain how to do this sort of thing correctly.

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

And then there's a whole separate walk to look for namespaced function calls.

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

Another advantage of focusing on the graph: we could

  1. Create the igraph for each imported function and workflow plan command. Not every symbol in a function body is a global object, so these graphs would need some trimming.
  2. Take the union of all the import/command graphs.

This could be a whole lot faster and more elegant than the current build_drake_graph(), which is the current way that drake creates the dependency network from a workflow plan and an environment of imports.

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

@dapperjapper I somehow missed this explicit guidance on walking expressions. After reading it, my thinking aligns more with yours. It seems practical to

  1. Walk the expression to find namespaced functions (containing :: and :::), file inputs, file outputs, knitr source files, and dependencies in knitr code chunks. This should be more unified than how drake currently does it, and it should take into account how things are nested.
  2. Call codetools::findGlobals() to find the rest.

We might still return a igraphs from code analyses so igraph::union() can put it all together in build_drake_graph().

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

FYI: I just started an implementation in the i232-file-api branch. The internal special_dependencies() function walks through the code only once and detects namespaced functions, file inputs, file outputs, and sources to rmarkdown::render() and knitr::knit(). Remaining challenges:

  • Write file_input() and file_output() as exported functions with help files. They need to work in commands as-is.
  • Before drake_config() does any code analysis, preprocess workflow plans so that the file output is the target name.
  • Make special_dependencies() collect candidates for global symbols. Since quotes in file_input() and file_output() are optional, it is important to exclude some symbols before checking against codetools::findGlobals().
  • Dive into codetools::findGlobals() and knitr code chunks to complete the code analysis.
  • Return a partial igraph to build_drake_graph().
  • Take the igraph::union() to build up the graph.
  • Tell drake that file outputs in the graph are not imported files.
  • Repair workflow network visualizations.
  • Update all the tests and test cases.
  • Repair the tests for all the parallel backends.
  • Convert the README, examples, and vignettes to the new API.
  • Update NEWS.md
  • Update RawGit image hashes after merging to master.
  • Update pkgdown documentation website after merging to master.

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

By the way: because of special_dependencies(), you can specify multiple files in file_input() and file_output(), and quotes are totally optional. This is really exciting! cc @noamross

f <- function(){
  knitr::knit(file_input("file.Rmd"), output = file_output("file.md"))
  render(file_output(a, b, c), input = file_input(x, y, z))
}
special_dependencies(f)

## $namespaced_functions
## [1] "knitr::knit"
## 
## $file_inputs
## [1] "\"file.Rmd\"" "\"x\""        "\"y\""       
## [4] "\"z\""       
## 
## $file_outputs
## [1] "\"file.md\"" "\"a\""       "\"b\""       "\"c\""      
## 
## $knitr_sources
## [1] "\"file.Rmd\"" "\"x\""        "\"y\""       
## [4] "\"z\"" 

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

Another thing: @dapperjapper, file_input() and file_output() are now actual functions that will work in-place inside commands.

file_output(my_report.html)

## [1] "my_report.html"

file_input(data.csv, 'spreadsheet.xlsx', "knitr_source.Rnw")

## [1] "data.csv"         "spreadsheet.xlsx" "knitr_source.Rnw"

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

Hmm... thinking about a special knitr_input() function. Right now, drake has to do some gymnastics to guess your knitr sources, and it is easily fooled:

drake_plan(file.md = disguise('file.Rmd')) # old API
disguise <- function(input){
  knitr(input, output = "file.md")
}

This will also make special_dependencies() more reliable.

@violetcereza
Copy link
Contributor

Does it understand the following? I think it's fine for now if it doesn't...

"file.csv" %>% file_input() %>% read_csv()
x <- "filename"
file_output(str_c(x, ".md"))

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

Good point. I was not planning on it for the first iteration, but we can definitely work on it. Kicking off this feature is already super complicated.

@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

Another preview, this time using knitr_input().

one_command <- quote({
  x <- y + z
  # Scoping is optional for knitr::input(), file_input(), and file_output().
  rmarkdown::render(drake::knitr_input(report.Rmd), output_format = "all")
  file_output(report.md, report.pdf, report.html) # Outputs we care about.
  ## The actual value of the target is always the command's return value.
  readRDS(file_input(data.rds))
})
special_dependencies(one_command)

## $candidate_globals
##  [1] "{"         "<-"        "x"         "+"         "y"        
##  [6] "z"         "::"        "rmarkdown" "render"    "drake"    
## [11] "\"all\""   "readRDS"  
## 
## $namespaced_functions
## [1] "rmarkdown::render"
## 
## $knitr_input
## [1] "\"report.Rmd\""
## 
## $file_output
## [1] "\"file_output\"" "\"report.md\""   "\"report.pdf\"" 
## [4] "\"report.html\""
## 
## $file_input
## [1] "\"file_input\"" "\"data.rds\""  

@wlandau wlandau mentioned this issue Feb 15, 2018
@wlandau
Copy link
Member

wlandau commented Feb 15, 2018

I no longer think we should create a new target for every file_output(). We don't want to launch a new HPC job just to check that a file was generated. If we communicate the file output information to conclude_build(), then we can store each output_file() with the same metadata as the concomitant target. However, we should still add output_file()s to the graph for the sake of visualization.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

File outputs are trickier than I thought. Just like any other target, we need to check if a file output has changed or needs processing. To maintain the parallelism and the checking, maybe that does mean we need to submit each file output to its own parallel job. I think that's what I'll implement as a first go-round. We can revisit these and other efficiency issues later.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

Also: internally, I think all files should be labeled with double quotes: "\"report.md\"", "\"report.Rmd\"", etc. That helps us avoid name conflicts between files and non-file targets (for example, the function read.csv() and an imported file named "read.csv"). I am using igraph attributes to help distinguish between file inputs and file outputs without too much loss of speed.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

Something counterintuitive I just realized: suppose we have a plan with file inputs and outputs:

drake_plan(
  return_value = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
  }
)

output.csv shares all the dependencies of return_value. That's clear enough. However, output.csv is neither upstream nor downstream of return_value. Why?

  • Not upstream: output.csv cannot be fingerprinted and cached until return_value is computed.
  • Not downstream: if output.csv changes, we need to re-run the command that produced return_value.

But I think I figured it out:

  1. Treat output.csv as upstream anyway.
  2. Trigger a build on return_value, but not output.csv.
  3. In conclude_build() for return_value, check for file outputs like output.csv and process them.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

I was wrong: file outputs are both upstream and downstream of the target. output.csv needs to be processed and fingerprinted beforehand to see if return_value needs building. Then after return_value is computed, output.csv needs to be fingerprinted and tracked because it was just generated. That means we should look at the file outputs twice: once during file_trigger() (which needs refactoring) and once at the very beginning of conclude_build().

Also, to make sure the command is actually run before downstream targets use the file outputs, the file outputs should lie downstream of the target / return value in the graph.

@AlexAxthelm
Copy link
Collaborator

What if when a target has a file_output as part of the command, reassigns the command to <<target>>_temp, and then has something along the lines of:

myplan <- drake_plan(
  return_value_temp = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
    contents
  },
  output.csv = file_output_of(return_value_temp),
  return_value = return_value_temp
)

where file_output_of() would signal to the parser that the target is a file, and the argument is the dependency.

image_21

Just a sketch of an idea, but I'm guessing that something like this would allow for interpreting intent, and translating it into the existing paradigm.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

Are you saying we need a way to match output.csv with return_value_temp? I agree. I actually ran into that problem last night, but I think my work this morning fixed it. The only thing that points to output.csv should be return_value_temp. So we already know the file_output_of(return_value_temp) just by looking at the graph. We now have a function called concomitant_file_output() that gets the downstream adjacent files, and the metadata of return_value_temp keeps track of them.

@AlexAxthelm
Copy link
Collaborator

That sounds like a great plan. My thought of making the _temp target was to keep the current paradigm of one-to-one correlation between hashed objects (targets) and the command that builds them. Do you have an example of the graph for this example plan? Or is that not finished yet?

@AlexAxthelm
Copy link
Collaborator

And to answer the actual question, yes, I was thinking that there should be a single explicit dependency on return_value_temp for the output.csv file, because we want return_value to rebuild on 2 conditions:

  1. output.csv changes Because the file is a depend for the final object, this works
  2. input.csv (or whatever other depends) change This will cause return_value_temp to rebuild, which then propagates downstream

We want output.csv to rebuild on those same 2 conditions. The second is easy, and built into the graph. The tricky part would be to make file_output_of() make return_value_temp outdated if output.csv were outdated or missing (propagate outdated-ness upstream). I can see why this would be a problem, and need some weird, fragile exceptions to happen.

@violetcereza
Copy link
Contributor

Why is all this extra complexity needed? Because there can now be more than 1 tracked output file per target? If this starts to get too complicated I would be in favor of just enforcing one file output per target.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

@dapperjapper I am starting to agree. The overall concept is slowly gaining clarity, but my attempt to depart from bijectivity is not going well. Drake is fundamentally designed to have one target per command, and the i232-file-api branch has ugly code and broken build logic. On the other hand, the static code analysis is great. So I think we should role out file_input() etc. first and then think about non-bijectivity later.

@violetcereza
Copy link
Contributor

Another option would be

drake_plan(
  `file_output(target.csv)` = write_csv(data, "target.csv")
)

Clunky to write, but more clear what's going on

@AlexAxthelm
Copy link
Collaborator

I would be in favor of something like this, especially if it helps moves away from the single/double quote issues that I have.

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

I thought about that too, but it requires repeating the target name on the left-hand side as well. Remake had a special target_name keyword for this. Both options may be a little unclear for new users who are unfamiliar with Make-like build management. For now, my favorite option is to just have drake pick the target names for you. I have new branch now, and this is the current behavior. The touching-up happens inside sanitize(), which is called early and often to try to minimize surprises.

drake_plan(
  {
    function_inside_unnamed_command()
    x <- read.csv('input.csv') # old API
    write.csv(summarize(x), file_output(output.csv))
  },
  no_files_here(but_no_target_name_either),
  overwritten_target_name = knit(knitr_input(report.Rmd), file_output(report.md)),
  true_target_name = my_function(file_input(values.rds) * 3.14)
)

## # A tibble: 3 x 2
##   target           command                                                      
##   <chr>            <chr>                                                        
## 1 "\"output.csv\"" "{\n    function_inside_unnamed_command()\n    x <- read.csv…
## 2 drake_target_2   no_files_here(but_no_target_name_either)
## 3 "\"report.md\""  knit(knitr_input(report.Rmd), file_output(report.md))        
## 4 true_target_name my_function(file_input(values.rds) * 3.14) 

All files are still quoted (double-quoted), but this happens automatically on the backend. We avoid potential name conflicts with non-file targets that way, and it's easy to differentiate between files and non-files. I am open to suggestions about doing away with any kind of quoting, but it would require a lot of error-prone refactoring internally.

@gmbecker
Copy link

Hi all,

Sorry I took a bit to chime in here...

Putting this issue together with #7, #9, and #190, I think drake needs its own custom static code analysis. The current approach is to call codetools::findGlobals() and then work around all the exceptions. That makes it hard to depart from what codetools already does, and with file_input() and file_output(), I think we hit a wall. For example, the following three workflow plan commands mean the same thing, but they present problems for drake's current approach.

knit('report.Rmd', output = "report.md") # Current drake.
knit(file_input("report.Rmd"), file_output("report.md")) # After this issue is solved.
knit(file_input(report.Rmd), file_output(report.md)) # Quoting should be optional here.

My somewhat solicited advice (I was tagged in here and work in this space extensively) is to really, really consider how important to you the quotes being optional is. I can almost promise you it's not worth the lack of clarity, especially when, as in this case, the thing without quotes around it is actually a string value. it's a path. just have them put quotes around it.

That said, and again, i really honestly think you shouldn't, you can easily allow things inside file_input and file_output to not need quotes with a simple custom handler for calls to those functions within the CodeDepends framework.

Example setbacks:

In knitr.R, the file_input() would interfere with the detection of "report.Rmd".
In the last line above, codetools thinks report.Rmd is an R variable.
Drake's goals for static code analysis are so different that I think we need to depart from the landscape of existing tools such as codetools, CodeDepends by @gmbecker and @duncantl, and globals by @HenrikBengtsson. Not only do we need to get all the symbols and strings from an expression, we also need to know how they are nested. I suggest we:

Sorry, I'm playing catch up a bit here. Can you give me an example where you need the nesting and what you're using it for?

As for the knowing stuff inside input_file is a file, the function handler framework I put into CodeDepends is designed explicitly to be able to do this kind of thing. codetools has the problems you mention above, but they'd be very easy to avoid in the CodeDepends framework.

Get the abstract syntax tree as a sensible data structure. A list or igraph would be ideal. It's a pity that pryr::ast() just outputs text that needs to be parsed all over again. Drake version 1 used pryr::ast(), and the text processing made it extremely prone to errors.
Rely heavily on the explicit hierarchy in the AST to detect the file outputs and the dependencies, as well as the decisions knitr and rmarkdown will make.

Something counterintuitive I just realized: suppose we have a plan with file inputs and outputs:

drake_plan(
  return_value = {
    contents <- read.csv(file_input(input.csv))
    write.csv(contents, file_output(output.csv))
  }
)

output.csv shares all the dependencies of return_value. That's clear enough. However, output.csv is neither upstream nor downstream of return_value. Why?

It's a side-effect, though thinking in CodeDepends terms (ie inputs and outputs of an expression) you could argue that the file itself (not output.csv the symbol) is an output of that expression, then any other expressions that have it (again, the file, not the symbol) as an input would have a dependency on that expression.

That said, even though conceptually I made a very strong (and important) distinction between the symbol and the file itself, I'd have to think a bit more about how far you could get by just spoofing that symbol being an output here and and input elsewhere. You'd need to know it ultimately is a file so you could check its existence, but you mention elsewhere that that is pretty easy and for the actual dependency modeling I don't know if that's important so long as it's handled consistently throughout the entire static analysis pass.

@wlandau
Copy link
Member

wlandau commented Feb 17, 2018

Thanks, @gmbecker.

My somewhat solicited advice (I was tagged in here and work in this space extensively) is to really, really consider how important to you the quotes being optional is. I can almost promise you it's not worth the lack of clarity, especially when, as in this case, the thing without quotes around it is actually a string value. it's a path. just have them put quotes around it.

You have convinced me. And for full paths with "/", users will need quotes anyway.

Sorry, I'm playing catch up a bit here. Can you give me an example where you need the nesting and what you're using it for?

Actually, the nesting didn't turn out to be as much of a problem as I thought. Previously, I assumed we would have to detect whether a file_input() was called inside a call to knit() or render(). But we're getting around the problem and adding more robustness by just having users write knitr_input(). That tells drake whether to dive into the active code chunks of a knitr source file to look for dependencies.

@wlandau
Copy link
Member

wlandau commented Feb 17, 2018

I am ready to propose an implementation! Please see #258.

wlandau pushed a commit that referenced this issue Feb 20, 2018
@wlandau
Copy link
Member

wlandau commented Feb 20, 2018

Another big thanks to @krlmlr and @gmbecker for steering us in the right direction! I merged #258 into master and I am closing this issue, but let's keep talking about and refining this feature going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants