-
Notifications
You must be signed in to change notification settings - Fork 26
/
plans.Rmd
530 lines (406 loc) · 20.6 KB
/
plans.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
# `drake` plans {#plans}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(drake)
library(glue)
library(purrr)
library(rlang)
library(tidyverse)
invisible(drake_example("main", overwrite = TRUE))
invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE))
invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE))
tmp <- suppressWarnings(drake_plan(x = 1, y = 2))
```
Most data analysis workflows consist of several steps, such as data cleaning, model fitting, visualization, and reporting. A `drake` plan is the high-level catalog of all these steps for a single workflow. It is the centerpiece of every `drake`-powered project, and it is always required. However, the plan is almost never the first thing we write. A typical plan rests on a foundation of carefully-crafted custom functions.
## Functions
A function is a reusable instruction that accepts some inputs and returns a single output. After we define a function once, we can easily call it any number of times.
```{r}
root_square_term <- function(l, w, h) {
half_w <- w / 2
l * sqrt(half_w ^ 2 + h ^ 2)
}
root_square_term(1, 2, 3)
root_square_term(4, 5, 6)
```
In practice, functions are vocabulary. They are concise references to complicated ideas, and they help us write instructions of ever increasing complexity.
```{r}
# right rectangular pyramid
volume_pyramid <- function(length_base, width_base, height) {
area_base <- length_base * width_base
term1 <- root_square_term(length_base, width_base, height)
term2 <- root_square_term(width_base, length_base, height)
area_base + term1 + term2
}
volume_pyramid(3, 5, 7)
```
The `root_square_term()` function is custom shorthand that makes `volume_pyramid()` easier to write and understand. `volume_pyramid()`, in turn, helps us crudely approximate the total square meters of stone eroded from the Great Pyramid of Giza ([dimensions from Wikipedia](https://en.m.wikipedia.org/wiki/Great_Pyramid_of_Giza)).
```{r}
volume_original <- volume_pyramid(230.4, 230.4, 146.5)
volume_current <- volume_pyramid(230.4, 230.4, 138.8)
volume_original - volume_current # volume eroded
```
This function-oriented code is concise and clear. Contrast it with the cumbersome mountain of [imperative](https://en.wikipedia.org/wiki/Imperative_programming) arithmetic that would have otherwise daunted us.
```{r}
# Don't try this at home!
width_original <- 230.4
length_original <- 230.4
height_original <- 146.5
# We supply the same lengths and widths,
# but we use different variable names
# to illustrate the general case.
width_current <- 230.4
length_current <- 230.4
height_current <- 138.8
area_original <- length_original * width_original
term1_original <- length_original *
sqrt((width_original / 2) ^ 2 + height_original ^ 2)
term2_original <- width_original *
sqrt((length_original / 2) ^ 2 + height_original ^ 2)
volume_original <- area_original + term1_original + term2_original
area_current <- length_current * width_current
term1_current <- length_current *
sqrt((width_current / 2) ^ 2 + height_current ^ 2)
term2_current <- width_current *
sqrt((length_current / 2) ^ 2 + height_current ^ 2)
volume_current <- area_current + term1_current + term2_current
volume_original - volume_current # volume eroded
```
Unlike [imperative](https://en.wikipedia.org/wiki/Imperative_programming) scripts, functions break down complex ideas into manageable pieces, and they gradually build up bigger and bigger pieces until an elegant solution materializes. This process of building up functions helps us think clearly, understand what we are doing, and explain our methods to others.
## Intro to plans
A `drake` plan is a data frame with columns named `target` and `command`. Each row represents a step in the workflow. Each command is a concise [expression](http://adv-r.had.co.nz/Expressions.html) that makes use of our functions, and each target is the return value of the command. (The `target` column has the *names* of the targets, not the values. These names must not conflict with the names of your functions or other global objects.)
We create plans with the `drake_plan()` function.
```{r}
plan <- drake_plan(
raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
data = raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
hist = create_plot(data),
fit = lm(Ozone ~ Wind + Temp, data),
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
)
)
plan
```
The plan makes use of a custom `create_plot()` function to produce target `hist`. Functions make the plan more concise and easier to read.
```{r}
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone)) +
theme_gray(24)
}
```
`drake` automatically understands the relationships among targets in the plan. It knows `data` depends on `raw_data` because the symbol `raw_data` is mentioned in the command for `data`. `drake` represents this dependency relationship with an arrow from `raw_data` to `data` in the graph.
```{r}
vis_drake_graph(plan)
```
We can write the targets in any order and `drake` still understands the dependency relationships.
```{r}
plan <- drake_plan(
raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
data = raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
hist = create_plot(data),
fit = lm(Ozone ~ Wind + Temp, data),
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
)
)
vis_drake_graph(plan)
```
The `make()` function runs the correct targets in the correct order and stores the results in a hidden cache.
```{r}
library(drake)
library(glue)
library(purrr)
library(rlang)
library(tidyverse)
make(plan)
readd(hist)
```
The purpose of the plan is to identify steps we can skip in our workflow. If we change some code or data, `drake` saves time by running some steps and skipping others.
```{r}
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone), binwidth = 10) + # new bin width
theme_gray(24)
}
vis_drake_graph(plan)
```
```{r}
make(plan)
readd(hist)
```
## A strategy for building up plans
Building a `drake` plan is a gradual process. You do not need to write out every single target to start with. Instead, start with just one or two targets: for example, `raw_data` in the plan above. Then, `make()` the plan and inspect the results with `readd()`. If the target's return value seems correct to you, go ahead and write another target in the plan (`data`), `make()` the bigger plan, and repeat. These repetitive `make()`s should skip previous work each time, and you will have an intuitive sense of the results as you go.
## How to choose good targets
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case. Generally speaking, a good target is
1. Long enough to eat up a decent chunk of runtime, and
2. Small enough that `make()` frequently skips it, and
3. Meaningful to your project, and
4. A well-behaved R object compatible with `saveRDS()`. For example, data frames behave better than database connection objects (discussions [here](https://github.com/ropensci/drake/issues/1046) and [here](https://github.com/ropensci-books/drake/issues/46)), [R6 classes](https://github.com/ropensci/drake/issues/345), and [`xgboost` matrices](https://github.com/ropensci/drake/issues/1038).
Above, "long" and "short" refer to computational runtime, not the size of the target's value. The more data you return to the targets, the more data `drake` puts in storage, and the slower your workflow becomes. If you have a large dataset, it may not be wise to copy it over several targets.
```{r}
bad_plan <- drake_plan(
raw = get_big_raw_dataset(), # We write this ourselves.
selection = select(raw, column1, column2),
filtered = filter(selection, column3 == "abc"),
analysis = my_analysis_function(filtered) # Same here.
)
```
In the above sketch, the dataset is super large, and selection and filtering are fast by comparison. It is much better to wrap up these steps in a data cleaning function and reduce the number of targets.
```{r}
munged_dataset <- function() {
get_big_raw_dataset() %>%
select(column1, column2) %>%
filter(column3 == "abc")
}
good_plan <- drake_plan(
dataset = munged_dataset(),
analysis = my_analysis_function(dataset)
)
```
## Special data formats for targets
`drake` supports custom formats for saving and loading large objects and highly specialized objects. For example, the `"fst"` and `"fst_tbl"` formats use the [`fst`](https://github.com/fstpackage/fst) package to save `data.frame` and `tibble` targets faster. Simply enclose the command and the format together with the `target()` function.
```{r, eval = FALSE}
library(drake)
n <- 1e8 # Each target is 1.6 GB in memory.
plan <- drake_plan(
data_fst = target(
data.frame(x = runif(n), y = runif(n)),
format = "fst"
),
data_old = data.frame(x = runif(n), y = runif(n))
)
make(plan)
#> target data_fst
#> target data_old
build_times(type = "build")
#> # A tibble: 2 x 4
#> target elapsed user system
#> <chr> <Duration> <Duration> <Duration>
#> 1 data_fst 13.93s 37.562s 7.954s
#> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s
```
There are several formats, each with their own system requirements. These system requirements, such as the `fst` R package for the `"fst"` format, do not come pre-installed with `drake`. You will need to install them manually.
- `"file"`: Dynamic files. To use this format, simply create local files and directories yourself and then return a character vector of paths as the target's value. Then, `drake` will watch for changes to those files in subsequent calls to `make()`. This is a more flexible alternative to `file_in()` and `file_out()`, and it is compatible with dynamic branching. See <https://github.com/ropensci/drake/pull/1178> for an example.
- `"fst"`: save big data frames fast. Requires the `fst` package.
Note: this format strips non-data-frame attributes such as the
- `"fst_tbl"`: Like `"fst"`, but for `tibble` objects.
Requires the `fst` and `tibble` packages.
Strips away non-data-frame non-tibble attributes.
- `"fst_dt"`: Like `"fst"` format, but for `data.table` objects.
Requires the `fst` and `data.table` packages.
Strips away non-data-frame non-data-table attributes.
- `"diskframe"`:
Stores `disk.frame` objects, which could potentially be
larger than memory. Requires the `fst` and `disk.frame` packages.
Coerces objects to `disk.frame`s.
Note: `disk.frame` objects get moved to the `drake` cache
(a subfolder of `.drake/` for most workflows).
To ensure this data transfer is fast, it is best to
save your `disk.frame` objects to the same physical storage
drive as the `drake` cache,
`as.disk.frame(your_dataset, outdir = drake_tempfile())`.
- `"keras"`: save Keras models as HDF5 files.
Requires the `keras` package.
- `"qs"`: save any R object that can be properly serialized
with the `qs` package. Requires the `qs` package.
Uses `qsave()` and `qread()`.
Uses the default settings in `qs` version 0.20.2.
- `"rds"`: save any R object that can be properly serialized.
Requires R version >= 3.5.0 due to ALTREP.
Note: the `"rds"` format uses gzip compression, which is slow.
`"qs"` is a superior format.
## Special columns
With `target()`, you can define any kind of special column in the plan.
```{r}
drake_plan(
x = target((1 + sqrt(5)) / 2, golden = "ratio"),
y = target(pi * 3 ^ 2, area = "circle")
)
```
The following columns have special meanings, and `make()` reads and interprets them.
- `format`: already described above.
- `dynamic`: See the [chapter on dynamic branching](#dynamic).
- `transform`: Automatically processed by `drake_plan()` except for `drake_plan(transform = FALSE)`. See the [chapter on static branching](#static).
- `trigger`: rule to decide whether a target needs to run. See the [trigger chapter](#triggers) to learn more.
- `elapsed` and `cpu`: number of seconds to wait for the target to build before timing out (`elapsed` for elapsed time and `cpu` for CPU time).
- `hpc`: logical values (`TRUE`/`FALSE`/`NA`) whether to send each target to parallel workers. [Click here](https://books.ropensci.org/drake/hpc.html#selectivity) to learn more.
- `resources`: target-specific lists of resources for a computing cluster. See the advanced options in the [parallel computing](#hpc) chapter for details.
- `caching`: overrides the `caching` argument of `make()` for each target individually. Only supported in `drake` version 7.6.1.9000 and above. Possible values:
- "main": tell the main process to store the target in the cache.
- "worker": tell the HPC worker to store the target in the cache.
- NA: default to the `caching` argument of `make()`.
- `retries`: number of times to retry building a target in the event of an error.
- `seed`: For statistical reproducibility, `drake` automatically assigns a unique pseudo-random number generator (RNG) seed to each target based on the target name and the global `seed` argument to `make()`. With the `seed` column of the plan, you can override these default seeds and set your own. Any non-missing seeds in the `seed` column override `drake`'s default target seeds.
- `max_expand`: for dynamic branching only. Same as the `max_expand` argument of [make()], but on a target-by-target basis. Limits the number of sub-targets created for a given target. Only supported in `drake` >= 7.11.0.
## Static files
`drake` has special functions to declare relationships between targets and external storage on disk. [`file_in()`](https://docs.ropensci.org/drake/reference/file_in.html) is for input files and directories, [`file_out()`](https://docs.ropensci.org/drake/reference/file_out.html) is for output files and directories, and [`knitr_in()`](https://docs.ropensci.org/drake/reference/knitr_in.html) is for R Markdown reports and `knitr` source files. If you use one of these functions inline in the plan, it tells `drake` to rerun a target when a file changes (or any of the files in a directory).
All three functions appear in this plan.
```{r}
plan
```
If we break the `file_out()` file, `drake` automatically repairs it.
```{r}
unlink("report.html")
make(plan)
file.exists("report.html")
```
As for `knitr_in()`, recall what happened when we changed the `create_plot()`. Not only did `hist` rerun, `report` ran as well. Why? Because `knitr_in()` is special. It tells `drake` to look for mentions of `loadd()` and `readd()` in the code chunks. `drake` finds the targets you mention in those `loadd()` and `readd()` calls and treats them as dependencies of the report. This lets you choose to run the report either inside or outside a `drake` pipeline.
```{r}
cat(readLines("report.Rmd"), sep = "\n")
```
That is why we have an arrow from `hist` to `report` in the graph.
```{r}
vis_drake_graph(plan)
```
### URLs
`file_in()` understands URLs. If you supply a string beginning with `http://`, `https://`, or `ftp://`, `drake` watches the HTTP ETag, file size, and timestamp for changes.
```{r}
drake_plan(
external_data = download.file(file_in("http://example.com/file.zip"))
)
```
### Limitations of static files
#### Paths must be literal strings
`file_in()`, `file_out()`, and `knitr_in()` require you to mention file and directory names *explicitly*. You cannot use a variable containing the name of a file. The reason is that `drake` detects dependency relationships with [static code analysis](http://adv-r.had.co.nz/Expressions.html#ast-funs). In other words, `drake` needs to know the names of all your files ahead of time (before we start building targets in `make()`). Here is an example of a bad plan.
```{r}
prefix <- "eco_"
bad_plan <- drake_plan(
data = read_csv(file_in(paste0(prefix, "data.csv")))
)
vis_drake_graph(bad_plan)
```
```{r, include = FALSE}
file.create("eco_data.csv")
```
Instead, write this:
```{r}
good_plan <- drake_plan(
file = read_csv(file_in("eco_data.csv"))
)
vis_drake_graph(good_plan)
```
Or even the one below, which uses the `!!` ("bang-bang") tidy evaluation unquoting operator.
```{r}
prefix <- "eco_"
drake_plan(
file = read_csv(file_in(!!paste0(prefix, "data.csv")))
)
```
#### Do not use inside functions
`file_out()` and `knitr_in()` should not be used inside imported functions because `drake` does not know how to deal with functions that depend on targets. Instead of this:
```{r, eval = FALSE}
f <- function() {
render(knitr_in("report.Rmd"), output_file = file_out("report.html"))
}
plan <- drake_plan(
y = f()
)
```
Write this:
```{r, eval = FALSE}
plan <- drake_plan(
y = render(knitr_in("report.Rmd"), output_file = file_out("report.html"))
)
```
Or this:
```{r, eval = FALSE}
f <- function(input, output) {
render(input, output_file = output)
}
plan <- drake_plan(
y = f(input = knitr_in("report.Rmd"), output = file_out("report.html"))
)
```
`file_in()` can be used inside functions, but only for files that exist before you call `make()`.
#### Incompatible with dynamic branching
`file_out()` and `knitr_in()` deal with static output files, so they must not be used with [dynamic branching](#dynamic). As an alternative, consider dynamic files (described below). You can still use `file_in()`, but only for files that *all* the dynamic sub-targets depend on. (Changing a static input file dependency will invalidate all the sub-targets.)
#### Database connections
`file_in()` and friends do not help us manage database connections. If you work with a database, the most general best practice is to always [trigger](#triggers) a snapshot to make sure you have the latest data.
```{r, eval = FALSE}
plan <- drake_plan(
data = target(
get_data_from_db("my_table"), # Define yourself.
trigger = trigger(condition = TRUE) # Always runs.
),
preprocess = my_preprocessing(data) # Runs when the data change.
)
```
In specific use cases, you may be able to watch database metadata for changes, but this information is situation-specific.
```{r, eval = FALSE}
library(DBI)
# Connection objects are brittle, so they should not be targets.
# We define them up front, and we use ignore() to prevent
# drake from rerunning targets when the connection object changes.
con <- dbConnect(...)
plan <- drake_plan(
data = target(
dbReadTable(ignore(con), "my_table"), # Use ignore() for db connection objects.
trigger = trigger(change = somehow_get_db_timestamp()) # Define yourself.
),
preprocess = my_preprocessing(data) # runs when the data change
)
```
## Dynamic files
`drake` >= 7.11.0 supports dynamic files through a specialized format. With dynamic files, `drake` can watch local files without knowing them in advance. This is a more flexible alternative to `file_out()` and `file_in()`, and it is fully compatible with [dynamic branching](#dynamic).
### How to use dynamic files
1. Set format = "file" in target() within drake_plan().
2. Return the paths to local files from the target.
3. To link targets together in dependency relationships, reference target names and not literal character strings.
### Example of dynamic files
```{r}
bad_plan <- drake_plan(
upstream = target({
writeLines("one line", "my file") # Make sure the file exists.
"my file" # Must return the file path.
},
format = "file" # Necessary for dynamic files
),
downstream = readLines("my file") # Oops!
)
plot(bad_plan)
good_plan <- drake_plan(
upstream = target({
writeLines("one line", "my file") # Make sure the file exists.
"my file" # Must return the file path.
},
format = "file" # Necessary for dynamic files
),
downstream = readLines(upstream) # Use the target name.
)
plot(good_plan)
make(good_plan)
# Change how the file is generated.
good_plan <- drake_plan(
upstream = target({
writeLines("different line", "my file") # Change the file.
"my file"
},
format = "file"
),
downstream = readLines(upstream)
)
# The downstream target automatically reruns.
make(good_plan)
```
### Limitations of dynamic files
Unlike `file_in()`, dynamic files cannot handle URLs. All files and directories must have valid local paths.
## Large plans
`drake` has special interfaces to concisely define large numbers of targets. See the chapters on [static branching](#static) and [dynamic branching](#dynamic) for details.