forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 1
/
workflow-scripts.Rmd
333 lines (241 loc) · 14.5 KB
/
workflow-scripts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Workflow: scripts and projects {#workflow-scripts-projects}
```{r, results = "asis", echo = FALSE}
status("restructuring")
```
So far, you have used the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
To give yourself more room to work, it's a great idea to use the script editor.
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes:
```{r}
#| echo: false
#| out.width: "75%"
#| fig-alt: >
#| RStudio IDE with Editor, Console, and Output highlighted.
knitr::include_graphics("diagrams/rstudio-editor.png")
```
The script editor is a great place to put code you care about.
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open.
Nevertheless, it's a good idea to save your scripts regularly and to back them up.
## Naming files
Saving your code in a script requires creating a new file that you will need to name.
It might be tempting to name this file `code.R` or `myscript.R`, but you should think a bit harder before choosing a name for your file.
Three important principles for file naming are as follows:
1. File names should be machine readable: Avoid spaces, punctuation, symbols, and accented character. Do not rely on case sensitivity to distinguish files. Make deliberate use of delimiters.
2. File names should be human readable: Use file names that describe what is in the file.
3. File names should play well with default ordering: Start file names with numbers that allow them to be sorted in the order they get used.
Suppose you have the following files in a project folder.
run-first.R
alternative model.R
code for exploratory analysis.R
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
temp.txt
There are a variety of problems here: the files are misordered, file names contain spaces, there are two files with basically the same name but different capitalization (`finalreport` vs. `FinalReport`), and some file names don't reflect their contents (`run-first` and `temp`).
Below is an alternative way of naming and organizing the same set of files.
01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
notes-on-report-draft.txt
report-2022-03-20.qmd
report-2022-04-02.qmd
Numbering and descriptive names that are similarly formatted allow for a more useful organization of the R scripts.
Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `notes-on-report-draft` to better describe its contents.
## Running code
The script editor is also a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations.
The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
This executes the current R expression in the console.
For example, take the code below.
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
It will also move the cursor to the next statement (beginning with `not_cancelled |>`).
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
```{r}
#| eval: false
library(dplyr)
library(nycflights13)
not_cancelled <- flights |>
filter(!is.na(dep_delay)█, !is.na(arr_delay))
not_cancelled |>
group_by(year, month, day) |>
summarize(mean = mean(dep_delay))
```
Instead of running your code expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
I recommend that you always start your script with the packages that you need.
That way, if you share your code with others, they can easily see which packages they need to install.
Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share.
It's very antisocial to change settings on someone else's computer!
When working through future chapters, I highly recommend starting in the script editor and practicing your keyboard shortcuts.
Over time, sending code to the console in this way will become so natural that you won't even think about it.
## RStudio diagnostics
The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar:
```{r}
#| echo: false
#| out.width: NULL
#| fig-alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
knitr::include_graphics("screenshots/rstudio-diagnostic.png")
```
Hover over the cross to see what the problem is:
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
#| Hovering over the X shows a text box with the text 'unexpected token y' and
#| unexpected token <-'.
knitr::include_graphics("screenshots/rstudio-diagnostic-tip.png")
```
RStudio will also let you know about potential problems:
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `3 == NA`. A yellow exclamation park
#| indicates that there may be a potential problem. Hovering over the
#| exclamation mark shows a text box with the text 'use is.na to check
#| whether expression evaluates to NA'.
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
```
## Workflow: projects
One day, you will need to quit R, go do something else, and return to your analysis later.
One day, you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
2. Where does your analysis "live"?
## What is real?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
However, in the long run, you'll be much better off if you consider your R scripts as "real".
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (inevitably, making mistakes along the way) or you'll have to carefully mine your R history.
To encourage this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
```{r}
#| echo: false
#| out.width: "75%"
#| fig.alt: >
#| RStudio preferences window where the option 'Restore .RData into workspace
#| at startup' is not checked. Also, the option 'Save workspace to .RData
#| on exit' is set to 'Never'.
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.
## Where does your analysis live?
R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
RStudio shows your current working directory at the top of the console:
```{r}
#| echo: false
#| out.width: "50%"
#| fig-alt: >
#| The Console tab shows the current working directory as
#| '~/Documents/r4ds/r4ds'.
knitr::include_graphics("screenshots/rstudio-wd.png")
```
And you can print this out in R code by running `getwd()`:
```{r}
#| eval: false
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
But you're six chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organizing your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
**I do not recommend it**, but you can also set the working directory from within R:
```{r}
#| eval: false
setwd("/path/to/my/CoolProject")
```
But you should never do this because there's a better way; a way that also puts you on the path to managing your R work like an expert.
## Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
There are three chief ways in which they differ:
1. The most important difference is how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
3. The last minor difference is the place that `~` points to.
`~` is a convenient shortcut to your home directory.
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
## RStudio projects
R experts keep all the files associated with a given project together --- input data, R scripts, analytical results, and figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
```{r}
#| echo: false
#| out.width: "50%"
#| fig-alt: >
#| There are three screenshots of the New Project menu. In the first screenshot,
#| the `Create Project` window is shown and 'New Directory' is selected.
#| In the second screenshot, the `Project Type` window is shown and
#| 'Empty Project' is selected. In the third screenshot, the 'Create New Project'
#| window is shown and the directory name is given as 'r4ds' and the project
#| is being created as subdirectory of the Desktop.
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
```{r}
#| eval: false
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file using a relative path, R will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
```{r}
#| label: toy-line
#| eval: false
library(tidyverse)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")
```
Quit RStudio.
Inspect the folder associated with your project --- notice the `.Rproj` file.
Double-click that file to re-open the project.
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is a huge win!
One day, you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
## Summary
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
- Create an RStudio project for each data analysis project.
- Keep data files there; we'll talk about loading them into R in \@ref(data-import).
- Keep scripts there; edit them, run them in bits or as a whole.
- Save your outputs (plots and cleaned data) there.
- Only ever use relative paths, not absolute paths.
Everything you need is in one place and cleanly separated from all the other projects that you are working on.
## Exercises
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips> and find one tip that looks interesting.
Practice using it!
2. What other common mistakes will RStudio diagnostics report?
Read <https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics> to find out.