-
Notifications
You must be signed in to change notification settings - Fork 26
/
time.Rmd
131 lines (99 loc) · 5.68 KB
/
time.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Time: speed, time logging, prediction, and strategy {#time}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 6,
fig.height = 6,
fig.align = "center"
)
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
## Why is `drake` so slow?
### Help us find out!
If you encounter slowness, please report it to <https://github.com/ropensci/drake/issues> and we will do our best to speed up `drake` for your use case. Please include a [reproducible example](https://github.com/tidyverse/reprex) and tell us about your operating system and version of R. In addition, flame graphs from the [`proffer`](https://github.com/r-prof/proffer) package really help us identify bottlenecks.
### Too many targets?
`make()` and friends tend to slow down if you have a huge number of targets. There are unavoidable overhead costs from storing each single target and checking if it is up to date, so please read [this advice on choosing good targets](https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets) and consider dividing your work into a manageably small number of meaningful targets. [Dynamic branching](#dynamic) can also help in many cases.
### Big data?
`drake` saves the return value of each target to [on disk storage](#storage). So in addition to [dividing your work into a smaller number of targets](https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets), [specialized storage formats](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets) can help speed things up. It may also be worth reflecting on how much data you really need to store. And if the cache is too big, the [storage chapter](#storage) has advice for downsizing it.
### Aggressive shortcuts
If your plan *still* needs tens of thousands of targets, you can take aggressive shortcuts to make things run faster.
```{r, eval = FALSE}
make(
plan,
verbose = 0L, # Console messages can pile up runtime.
log_progress = FALSE, # drake_progress() will be useless.
log_build_times = FALSE, # build_times() will be useless.
recoverable = FALSE, # make(recover = TRUE) cannot be used later.
history = FALSE, # drake_history() cannot be used later.
session_info = FALSE, # drake_get_session_info() cannot be used later.
lock_envir = FALSE, # See https://docs.ropensci.org/drake/reference/make.html#self-invalidation.
)
```
## Time logging
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(drake)
```
Thanks to [Jasper Clarkberg](https://github.com/dapperjapper), `drake` records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution.
```{r}
library(drake)
load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/main/mtcars
make(my_plan)
build_times(digits = 8) # From the cache.
## `dplyr`-style `tidyselect` commands
build_times(starts_with("coef"), digits = 8)
```
## Predict total runtime
`drake` uses these times to predict the runtime of the next `make()`. At this moment, everything is up to date in the current example, so the next `make()` should ideally take no time at all (except for preprocessing overhead).
```{r}
predict_runtime(my_plan)
```
Suppose we change a dependency to make some targets out of date. Now, the next `make()` should take longer since some targets are out of date.
```{r}
reg2 <- function(d){
d$x3 <- d$x ^ 3
lm(y ~ x3, data = d)
}
predict_runtime(my_plan)
```
And what if you plan to delete the cache and build all the targets from scratch?
```{r}
predict_runtime(my_plan, from_scratch = TRUE)
```
## Strategize your high-performance computing
Let's say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run `make()`, your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let's write down these known times in seconds.
```{r}
known_times <- rep(7200, nrow(my_plan))
names(known_times) <- my_plan$target
known_times["report"] <- 5
known_times
```
How many parallel jobs should you use in the next `make()`? The `predict_runtime()` function can help you decide. `predict_runtime(jobs = n)` simulates persistent parallel workers and reports the estimated total runtime of `make(jobs = n)`. (See also `predict_workers()`.)
```{r}
time <- c()
for (jobs in 1:12){
time[jobs] <- predict_runtime(
my_plan,
jobs_predict = jobs,
from_scratch = TRUE,
known_times = known_times
)
}
library(ggplot2)
ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) +
geom_line(aes(x = jobs, y = time, group = group)) +
scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) +
theme_gray(16) +
xlab("jobs argument of make()") +
ylab("Predicted runtime of make() (hours)")
```
We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of `jobs` for `make()` ultimately depends on the runtime you can tolerate and the computing resources at your disposal.
A final note on predicting runtime: the output of `predict_runtime()` and `predict_workers()` also depends the optional `workers` column of your `drake_plan()`. If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the [high-performance computing guide](#hpc) for more.