walkthrough.Rmd

# Walkthrough {#walkthrough}

```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 6,
  fig.align = "center"
)
options(
  drake_make_menu = FALSE,
  drake_clean_menu = FALSE,
  warnPartialMatchArgs = FALSE,
  crayon.enabled = FALSE,
  readr.show_progress = FALSE,
  tidyverse.quiet = TRUE
)
```

```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(drake)
library(dplyr)
library(ggplot2)
invisible(drake_example("main", overwrite = TRUE))
invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE))
invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE))
```

A typical data analysis workflow is a sequence of data transformations. Raw data becomes tidy data, then turns into fitted models, summaries, and reports. Other analyses are usually variations of this pattern, and `drake` can easily accommodate them.

## Set the stage.

To set up a project, load your packages,

```{r}
library(drake)
library(dplyr)
library(ggplot2)
library(tidyr)
```

load your custom functions,

```{r}
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone)) +
    theme_gray(24)
}
```

check any supporting files (optional),

```{r}
## Get the files with drake_example("main").
file.exists("raw_data.xlsx")
file.exists("report.Rmd")
```

and plan what you are going to do.

```{r}
plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
  hist = create_plot(data),
  fit = lm(Ozone ~ Wind + Temp, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

plan
```

Optionally, visualize your workflow to make sure you set it up correctly. The graph is interactive, so you can click, drag, hover, zoom, and explore.

```{r}
vis_drake_graph(plan)
```

## Make your results.

So far, we have just been setting the stage. Use `make()` or [`r_make()`](https://books.ropensci.org/drake/projects.html#safer-interactivity) to do the real work. Targets are built in the correct order regardless of the row order of `plan`.

```{r}
make(plan) # See also r_make().
```

Except for output files like `report.html`, your output is stored in a hidden `.drake/` folder. Reading it back is easy.

```{r}
readd(data) %>% # See also loadd().
  head()
```

The graph shows everything up to date.

```{r}
vis_drake_graph(plan) # See also r_vis_drake_graph().
```

## Go back and fix things.

You may look back on your work and see room for improvement, but it's all good! The whole point of `drake` is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width.

```{r}
readd(hist)
```

So let's fix the plotting function.

```{r}
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone), binwidth = 10) +
    theme_gray(24)
}
```

`drake` knows which results are affected.

```{r}
vis_drake_graph(plan) # See also r_vis_drake_graph().
```

The next `make()` just builds `hist` and `report`. No point in wasting time on the data or model.

```{r}
make(plan) # See also r_make().
```

```{r}
loadd(hist)
hist
```

## History and provenance

As of version 7.5.2, `drake` tracks the history and provenance of your targets:
what you built, when you built it, how you built it, the arguments you
used in your function calls, and how to get the data back.

```{r}
history <- drake_history(analyze = TRUE)
history
```

Remarks:

- The `quiet` column appears above because one of the `drake_plan()` commands has `knit(quiet = TRUE)`.
- The `hash` column identifies all the previous versions of your targets. As long as `exists` is `TRUE`, you can recover old data.
- Advanced: if you use `make(cache_log_file = TRUE)` and put the cache log file under version control, you can match the hashes from `drake_history()` with the `git` commit history of your code.

Let's use the history to recover the oldest histogram.

```{r}
hash <- history %>%
  filter(target == "hist") %>%
  pull(hash) %>%
  head(n = 1)
cache <- drake_cache()
cache$get_value(hash)
```

## Reproducible data recovery and renaming

Remember how we made that change to our histogram? What if we want to change it back? If we revert `create_plot()`, `make(plan, recover = TRUE)` restores the original plot.

```{r}
create_plot <- function(data) {
  ggplot(data) +
    geom_histogram(aes(x = Ozone)) +
    theme_gray(24)
}

# The report still needs to run in order to restore report.html.
make(plan, recover = TRUE)

readd(hist) # old histogram
```

`drake`'s data recovery feature is another way to avoid rerunning commands. It is useful if:

- You want to revert to your old code, maybe with `git reset`.
- You accidentally `clean()`ed a target and you want to get it back.
- You want to rename an expensive target.

In version 7.5.2 and above, `make(recover = TRUE)` can salvage the values of old targets. Before building a target, `drake` checks if you have ever built something else with the same command, dependencies, seed, etc. that you have right now. If appropriate, `drake` assigns the old value to the new target instead of rerunning the command.

Caveats:

1. This feature is still experimental.
2. Recovery may not be a good idea if your external dependencies have changed a lot over time (R version, package environment, etc.).

### Undoing `clean()`

```{r}
# Is the data really gone?
clean() # garbage_collection = FALSE

# Nope!
make(plan, recover = TRUE) # The report still builds since report.md is gone.

# When was the raw data *really* first built?
diagnose(raw_data)$date
```

### Renaming

You can use recovery to rename a target. The trick is to supply the random number generator seed that `drake` used with the old target name. Also, renaming a target unavoidably invalidates downstream targets.

```{r}
# Get the old seed.
old_seed <- diagnose(data)$seed

# Now rename the data and supply the old seed.
plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  
  # Previously just named "data".
  airquality_data = target(
    raw_data %>%
      mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
    seed = !!old_seed
  ),

  # `airquality_data` will be recovered from `data`,
  # but `hist` and `fit` have changed commands,
  # so they will build from scratch.
  hist = create_plot(airquality_data),
  fit = lm(Ozone ~ Wind + Temp, airquality_data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

make(plan, recover = TRUE)
```

## Try the code yourself!

Use `drake_example("main")` to download the [code files](#projects) for this example.

## Thanks

Thanks to [Kirill Müller](https://github.com/krlmlr) for originally providing this example.