-
Notifications
You must be signed in to change notification settings - Fork 26
/
scripts.Rmd
222 lines (170 loc) · 7.15 KB
/
scripts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Script-based workflows {#scripts}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(drake)
library(tidyverse)
invisible(drake_example("script-based-workflows", overwrite = TRUE))
files <- list.files(
"script-based-workflows/R/",
pattern = "*.R",
full.names = TRUE
)
tmp <- file.copy(files, ".", recursive = TRUE)
tmp <- file.copy("script-based-workflows/raw_data.xlsx", ".")
tmp <- file.copy("script-based-workflows/report.Rmd", ".")
dir.create("data")
```
## Function-oriented workflows
`drake` works best when you write functions for data analysis. Functions break down complicated ideas into manageable pieces.
```{r}
# R/functions.R
get_data <- function(file){
readxl::read_excel(file)
}
munge_data <- function(raw_data){
raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}
fit_model <- function(munged_data){
lm(Ozone ~ Wind + Temp, munged_data)
}
```
When we express computational steps as functions like `get_data()`, `munge_data()`, and `fit_model()`, we create special shorthand to make the rest of our code easier to read and understand.
```{r}
# R/plan.R
plan <- drake_plan(
raw_data = get_data(file_in("raw_data.xlsx")),
munged_data = munge_data(raw_data),
model = fit_model(munged_data)
)
```
This function-oriented approach is elegant, powerful, testable, scalable, and maintainable. However, it can be challenging to convert pre-existing traditional script-based analyses to function-oriented `drake`-powered workflows. This chapter describes a stopgap to retrofit `drake` to existing projects. Custom functions are still better in the long run, but the following workaround is quick and painless, and it does not require you to change your original scripts.
## Traditional and legacy workflows
It is common to express data analysis tasks as numbered scripts.
```
01_data.R
02_munge.R
03_histogram.R
04_regression.R
05_report.R
```
The numeric prefixes indicate the order in which these scripts need to run.
```{r, eval=FALSE}
# run_everything.R
source("01_data.R")
source("02_munge.R")
source("03_histogram.R")
source("04_regression.R")
source("05_report.R") # Calls rmarkdown::render() on report.Rmd.
```
## Overcoming Technical Debt
`code_to_function()` creates `drake_plan()`-ready functions from scripts like these.
```{r}
# R/functions.R
load_data <- code_to_function("01_data.R")
munge_data <- code_to_function("02_munge.R")
make_histogram <- code_to_function("03_histogram.R")
do_regression <- code_to_function("04_regression.R")
generate_report <- code_to_function("05_report.R")
```
Each function contains all the code from its corresponding script, along with a special final line to make sure we never return the same value twice.
```{r}
print(load_data)
```
## Dependencies
`drake` pays close attention to dependencies. In `drake`, a target's dependencies are the things it needs in order to build. Dependencies can include functions, files, and other targets upstream. Any time a dependency changes, the target is no longer valid. The `make()` function automatically detects when dependencies change, and it rebuilds the targets that need to rebuild.
To leverage drake's dependency-watching capabilities, we create a `drake` plan. This plan should include all the steps of the analysis, from loading the data to generating a report.
To write the plan, we plug in the functions we created from `code_to_function()`.
```{r}
simple_plan <- drake_plan(
data = load_data(),
munged_data = munge_data(),
hist = make_histogram(),
fit = do_regression(),
report = generate_report()
)
```
It's a start, but right now, `drake` has no idea which targets to run first and which need to wait for dependencies! In the following graph, there are no edges (arrows) connecting the targets!
```{r}
vis_drake_graph(simple_plan)
```
## Building the connections
Just as our original scripts had to run in a certain order, so do our targets now.
We pass targets as function arguments to express this execution order.
For example, when we write `munged_data = munge_data(data)`, we are signaling to
`drake` that the `munged_data` target depends on the function `munge_data()` and
the target `data`.
```{r}
script_based_plan <- drake_plan(
data = load_data(),
munged_data = munge_data(data),
hist = make_histogram(munged_data),
fit = do_regression(munged_data),
report = generate_report(hist, fit)
)
```
```{r}
vis_drake_graph(script_based_plan)
```
## Run the workflow
We can now run the workflow with the `make()` function. The first call to `make()` runs all the data analysis tasks we got from the scripts.
```{r}
make(script_based_plan)
```
## Keeping the results up to date
Any time we change a script, we need to run `code_to_function()` again to keep our function up to date. `drake` notices when this function changes, and `make()` reruns the updated function and the all downstream functions that rely on the output.
For example, let's fine tune our histogram. We open `03_histogram.R`, change the `binwidth` argument, and call `code_to_function("03_histogram.R")` all over again.
```{r echo = FALSE}
writeLines(
c(
"munged_data <- readRDS(\"data/munged_data.RDS\")",
"gg <- ggplot(munged_data) +",
" geom_histogram(aes(x = Ozone)) +",
" theme_gray(20)",
"ggsave(",
" filename = \"data/ozone.PNG\",",
" plot = gg,",
" width = 6,",
" height = 6",
")",
"saveRDS(gg, \"data/ozone.RDS\")"
),
"03_histogram.R"
)
```
```{r}
# We need to rerun code_to_function() to tell drake that the script changed.
make_histogram <- code_to_function("03_histogram.R")
```
Targets `hist` and `report` depend on the code we modified, so `drake` marks
those targets as outdated.
```{r message = FALSE}
outdated(script_based_plan)
vis_drake_graph(script_based_plan, targets_only = TRUE)
```
When you call `make()`, `drake` runs `make_histogram()` because the underlying
script changed, and it runs `generate_report()` because the report depends on
`hist`.
```{r}
make(script_based_plan)
```
All the targets are now up to date!
```{r}
vis_drake_graph(script_based_plan, targets_only = TRUE)
```
## Final thoughts
Countless data science workflows consist of numbered imperative scripts, and `code_to_function()` lets `drake` accommodate script-based projects too big and cumbersome to refactor.
However, for new projects, we strongly recommend that you write functions. Functions help organize your thoughts, and they improve portability, readability, and compatibility with `drake`. For a deeper discussion of functions and their
role in `drake`, consider watching the [webinar recording of the 2019-09-23 rOpenSci Community Call](https://ropensci.org/commcalls/2019-09-24).
Even old projects are sometimes pliable enough to refactor into functions, especially with the new [`Rclean`](https://github.com/provtools/rclean) package.