-
Notifications
You must be signed in to change notification settings - Fork 1
/
L06-Computing.qmd
376 lines (251 loc) · 22.4 KB
/
L06-Computing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
# Computing with functions and arguments {#sec-data-computing}
```{r include=FALSE}
source("../_startup.R")
set_chapter(6)
```
You have already seen the R command patterns that we will use throughout these *Lessons*: pipelines composed of actions separated by `|>`, names for data frames, functions and their arguments. This Lesson recapitulates and explains those patterns, the better to help you construct your own commands. The Lesson also emphasizes a small technical vocabulary that helps in communicating with other people to share insights and identify errors.
Learning command patterns is a powerful enabler in statistical thinking. The computer is an indispensable tool for statistical thinking. In addition to doing the work of calculations, computer software is a medium for describing and sharing with others statistical techniques and concepts. For instance, three fundamental operations in statistics---randomize, repeat, collect---are easily expressed in computer languages but have no counterpart in the algebraic notation traditionally uused in mathematics education.
## Chain of operations
A typical computing task consists of a chain of operations. For instance, each wrangling operation receives a data-frame object from an incoming pipe `|>`, operates on that data frame to perform the action described by the function's name and arguments, and produces another data frame as output. Depending on the overall task, the output from the operation may be piped into another action or displayed on the screen.
There are also operations, like `point_plot()`, that translate a data frame into another kind of object: a *graphic*. Starting with Lesson [-@sec-regression], we will work with `model_train()`, a function that translates a data frame into a *model*. In this Lesson, we will meet two other ways of dealing with the output of a chain of operations: storing the output under a name for later use and formatting a data frame into a table suited to human readers.
Manufacturing processes provide a helpful analogy for understanding the step-by-step structure of a computation. Simple manufacturing processes might involve one or a handful of work steps arranged in a chain. Complex operations can involve many chains coming together in elaborate configurations. The video in @fig-pencil-video shows the steps of pencil manufacturing. These involves several inputs:
::: {#fig-pencil-video .column-margin}
{{< video https://www.youtube.com/embed/qqs3fxfmWr4
title=""
start="116"
aspect-ratio="17x11"
>}}
Manufacturing a pencil, step by step.
:::
The overall manufacturing process takes several inputs that are shaped and combined to produce the pencil: cedar wood slabs, glue, graphite, enamel paint Each input is processed in a step-by-step manner. At some steps, two partially processed components are combined. For instance, there is a step that grooves the cedar slabs (which are sourced from another production line). The next step put glue in the groves. In still another step, the graphite rods (which come from their own production process) are placed into the glue-filled groves. (Lesson [-@sec-databases] introduces the data-wrangling process, join, that combines two inputs.)
There are several different forms of conveyors in the pencil manufacturing line that carry the materials from one manufacturing step to the next. We need only one type of conveyor---the **pipe**---to connect computing steps.
Manufacturing processes often involve storage or delivery. The video in @fig-pencil-video ends before the final steps in the process: boxing the pencils, warehousing them, and the parts of the chain that deliver them to the consumer end-user.Storage, retrieval, and customer use all have their counterparts in computing processes. By default, the object produced by the computing chain is directly delivered to the customer, here by displaying it in some appropriate place, for instance directly under the computer command, or in a viewing panel or document:
```{r}
Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by = year)
```
In our computer notation, the storage operation can look like this:
```{r}
Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by=year) -> High_income_countries
```
At the very end of the pipeline chain, there is a **storage arrow** (`->`, as opposed to the pipe, `|>`) followed by a **storage name** (`High_income_countries`). The effect is to place the object at the output end of the chain to be stored in computer memory in a location identified by the storage name.
`r why_say("sec-storage-arrow")`
Retrieval from storage is even simpler: just use the storage name as an input. For instance:
```{r}
High_income_countries |>
select(-GDP, -pop) |>
filter(year == 2020) |>
kable(digits=2)
```
::: {.callout-note}
## Pointing out storage from the start
In a previous example we placed the storage arrow (`->`) at the end of a left-to-right chain of operations. In practice, programmers and authors prefer another arrangement---which we will use from now on in these *Lessons*---where the storage arrow is at the left end of the chain. The storage arrow still points to the storage name. For instance,
```{r}
High_income_countries <- Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by=year)
```
Using this `storage_name <-` idiom it is easier to scan code for storage names and to spot when the output of the chain is to be delivered directly to the customer.
:::
## What's in a pipe?
The pipe---that is, `|>`---carries material from one operation to another. In computer-speak, the word "**object**" describes this material. That is, pipes convey objects.
Objects come in different "**types**." Computer programmers learn to deal with dozens of object types. Fortunately, we can accomplish what we need in statistical computing with just a handful. You have already met two types:
1. data frames
2. graphics, consisting of one or more **layers**, e.g. the point plot as one layer and the annotations as another layer placed on top.
In later lessons, we will introduce two more types---(3) models and (4) simulations.
## Pipes connect to functions
At the receiving end of a pipe is an operation on the object conveyed by the pipe. A better word for such an operation is "**function**." It is easy to spot the functions in a pipeline: they **always** consist of a name---such as `summarize` or `point_plot`---followed directly by `(` and, eventually, a closing `)`. For example, in
::: {#lst-GDP-mean}
```{r results="hide"}
Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by = year)
```
:::
the first function is named `mutate`. The function output is being piped to a second function, named `filter`. From now on, whenever we name a function we will write the name followed by `()` to remind the reader that the name refers to a function: so `mutate()` and `filter()`. There are other things that names can refer to. For instance, `Nats` at the start of the pipeline is a data frame, and `GDP`, `GDPpercap` and `pop`, and `year` are *variables*. Such names for non-functions are **never** followed directly by `(`.
{{< include LearningChecks/L06/LC06-01.qmd >}}
## Arguments (inside the parentheses)
Almost always when using a function the human writer of a computer expression needs to specify some details of how the function is to work. These details are **always** put inside the parentheses following the name of the function. To illustrate, consider the task of plotting the data in the `SAT` data frame. The *skeleton* of the computer command is
::: {lst-sat-point-error}
```{webr-r}
SAT |> point_plot()
```
:::
This skeleton is not a complete command, as becomes evident when the (incomplete) command is evaluated:
What's missing from the erroneous command is a detail needed to complete the operation. This missing detail is what variables from `SAT` to map to y and x. This detail should be provided to `point_plot()` as an argument. As you saw in Lesson [-@sec-point-plots], the argument is written as a **tilde expression**, for instance `sat ~ frac` to map `sat` to y and `frac` to x. Once we have constructed the appropriate argument for the task at hand, we place it inside the parentheses that follow the function name.
{{< include LearningChecks/L06/LC06-02.qmd >}}
Many functions have more than one argument. Some arguments, like the tilde expression argument to `point_plot()`, may be required. When an argument is not required, the argument itself is given a name and it will have a **default value**. In the case of `point_plot()`, there is a second argument named `annot=` to specify what kind of annotation layer to add on top of the point plot. The default value of `annot=` turns off the annotation layer.
**Named arguments**, like `annot=`, will **always** be followed by a single equal sign, followed by the value to which that argument is to be set. For instance, `point_plot()` allows four different values for `annot=`:
i. the default (which turns off the annotation)
ii. `annot = "violin"` specifying a density display annotation
iii. `annot = "model"` which annotates with a graph of a model
iv. `annot = "bw"` which creates a traditional "box-and-whiskers" display of distribution. (We will not use such box-and-whiskers annotations in these Lessons, preferring violins instead. Still, they are often seen in practice.)
In these Lessons, the single `=` sign always signifies a named argument.
A closely related use for `=` is to give a name to a calculated result from `mutate()` or `summarize()`. For instance, suppose you want to calculate the mean sat score and mean fraction in the `SAT` data frame. This is easy:
```{r results=asis()}
SAT |> summarize(mean(sat), mean(frac))
```
We will often use this *unnamed* style when the results are intended for the human reader. But if such a calculation is to be fed down the pipeline to further calculations, it can be helpful to give simple names to the result. Frivolously, we'll illustrate using the names `eel` and `fish`:
```{webr-r}
SAT |> summarize(eel = mean(sat), fish = mean(frac))
```
The reason for the frivolity here is to point out that *you* get to choose the names for the results calculated by `mutate()` and `summarize()`. Needless to say, it's best to avoid frivolous or misleading names.
{{< include LearningChecks/L06/LC06-03.qmd >}}
## Variable names in arguments
Many of the functions we use are on the receiving end of a pipe carrying a data frame. Examples, perhaps already familiar to you: `filter()`, `point_plot()`, `mutate()`, and so on.
A good analogy for a data frame is a shipping box. Inside the shipping box: one or more variables. When a function receives the ~~shipping box~~ data frame, it opens it, providing access to each variable contained therein. In constructing arguments to the function, you do not have to think about the box, just the contents. You refer to the contents only by their names. `select()` provides a good example, since each argument can be simply the name of a variable, e.g.
For most uses, the arguments to a function will be an expressions constructed out of variable names. Some examples:
- `SAT |> filter(frac > 50)` where the argument checks whether each value of `frac` is greater than 50.
- `SAT |> mutate(efficiency = sat / expend)` where the argument gives a name (`efficiency`) to an arithmetic combination of `sat` and `expend`.
- `SAT |> point_plot(frac ~ expend)` where the argument to `point_plot()` is an expression involving both `frac` and `expend`.
- `SAT |> filter(expend > median(expend))` where the argument involves calculating the median expenditure across the state using the `median()` reduction function, then comparing the calculated median to the actual expenditure in each state. The overall effect is to remove any state with a below-median expenditure from the output of `filter()`.
- `SAT |> select(-state, -frac)` uses the `-` sign to exclude the variables from the output.
{{< include LearningChecks/L06/LC06-04.qmd >}}
## Exercises
```{r eval=FALSE, echo=FALSE}
# need to run this in console after each change
emit_exercise_markup(
paste0("../LSTexercises/06-Computing/",
c("Q06-101.Rmd",
"Q06-102.Rmd",
"Q06-103.Rmd",
"Q06-104.Rmd",
"Q06-105.Rmd")),
outname = "L06-exercise-markup.txt"
)
```
{{< include L06-exercise-markup.txt >}}
<!-- Still in draft
"Q06-106",
"Q06-107",
"Q06-108",
"Q06-109",
"Q06-110",
"../LSTexercises/DataComputing/Z-bee-fight-ship.qmd",
"../LSTexercises/DataComputing/Z-maple-ring-lamp.qmd",
"../LSTexercises/DataComputing/Z-buck-hang-window.qmd",
"../LSTexercises/DataComputing/Z-falcon-catch-room.qmd",
"../LSTexercises/DataComputing/Z-oak-say-painting.qmd",
"../LSTexercises/DataComputing/Z-lamb-bet-pencil.qmd",
"../LSTexercises/DataComputing/Z-cheetah-spend-chair.qmd",
-->
-----
## Enrichment topics
::: {.callout-note collapse="true" #enr-reading-data-from-web}
## Reading data from the web
Using your web browser, open this link in a new tab: <https://www.mosaic-web.org/go/datasets/engines.csv>.
Depending on how your browser is set up, you will either be directed to a web page showing a data frame about engines **or** the browser will download a file named "engines.csv" onto your computer.
The `.csv` suffix on the file name indicates that the file stored at the address <https://www.mosaic-web.org/go/datasets/engines.csv> is in a format called "comma separated values." The CSV format is a common way to store spreadsheet files.
In these *Lessons* most data frames will be accessed in a single step, by name. However, in professional work, data is stored in computer files or on the interweb. For such data, two steps are needed to access the data from within R.
Step 1. Read the file into R, translate the contents into the native R format for data frames, and store the data frame under a name. For a CSV file, an appropriate R function to read and translate the file is `readr::read_csv()`. As an argument to the function, give the address of the file, making sure to enclose the address in quotation marks: `"https://www.mosaic-web.org/go/datasets/engines.csv"`. This will cause `readr::read_csv()` to access the web address, then copy and translate the contents into an R format format for data frames. Use the *storage arrow* `<-` to store the data frame under the name `Engines`.
Step 2. Use the storage name, in this example `Engines`, to access the data frame from within R.
```{webr-r}
```
Your task: Read in the "engines.csv" file to R as a data frame, storing it as `Engines`. Then use `nrow()` to calculate the number of rows in the data frame. In addition, use `names` to see the variable names.
:::
::: {.callout-note collapse="true" #enr-quotation-marks}
## Quotation marks
Sometimes, you will see an argument written as letters and numbers *inside* quotation marks, as in `annot = "model"`. The quotation marks instruct the computer to take the contents literally instead of pointing to a function or a variable. (In computer terminology, the content of the quotation marks is called a **character string**.)
The style of R commands does not use quotations around the names of objects, functions, and variables are **not** placed in quotations. When you see quotation marks in an example in these Lessons, take note. They are needed, for instance, in saying what kind of annotation should be drawn by `point_plot()`. If you forget to use the quotation marks where they are needed, the computer will signal an error. Try it!
```{webr-r}
Penguins |> point_plot(bill_length ~ flipper, annot = model)
```
The error message is terse, but it gives hints; for example, `'arg'` suggests the error is about an argument, `annot` is the name of the problematic argument, and `character` is meant to point you to some issue involving character strings.
:::
::: {.callout-note collapse=true #enr-styling-with-space}
## Styling with space
Written English uses space to separate words. It is helpful to the human reader to follow analogous forms in R commands.
- Use spaces around storage arrows and pipes: `x <- 7 |> sqrt()` reads better than `x<-7|>sqrt()`.
- Use spaces between an argument name and its value: `mutate(percap = GDP / pop)` rather than `mutate(percap=GDP/pop)`.
- When writing long pipelines, put a newline after the pipe symbol. You can see several instances of this in previous examples in this Lesson. DO NOT, however, start a line with a pipe symbol.
:::
::: {.callout-note collapse="true" #enr-imperatives}
## Imperatives and objects
In English, a sentence like "Walk the dog!" is an imperative, a command. Similarly, in R, commands are always imperatives. The English imperative sentence, "Jane, walk the dog!" directs the imperative to a particular actor, namely Jane. The R imperative is always directed to "the computer," as in, "Computer, select the `country` and `GDP` columns for the output."
::: {#fig-star-trek-scotty .column-margin}
![Star Trek IV](www/scotty-computer.png)]
[A scene from *Star Trek IV: The Voyage Home* (1986). Click link to play movie clip.](https://www.youtube.com/embed/hShY6xZWVGE?si=bitLcj6fhMxUiO02)
:::
"Walk the dog!" has both a verb ("walk") and a noun ("the dog"). The noun in such an imperative is the **object** of the verb; the entity that the action (walk) is to be applied to.
R structures sentences/commands differently. Every sentence is a command. The actor is always the computer, there's no reason to state that explicitly. So the imperative in R looks like this:
`the_dog |> walk()`
In word order, the object of the action *preceeds* the action. In data-wrangling commands, the object is always a data frame.
Now a little about arguments .... "Walk the dog!" doesn't specify an important detail: Who is to hold the leash? An argument can fill in this detail:
`the_dog` |> walk("Carlos")`
:::
:::: {.callout-note collapse=true #enr-why-storage}
## Instructor note: "Storage arrow"
Calling the `<-` token the "*storage arrow*" is unconventional. Those experienced with computing know that the act of giving a computer object a named storage location is called "assignment." From the student's point of view, however, "assignment" has many meanings which have nothing to do with computer storage. For instance, in many courses students are obliged to hand in their work at regular intervals: "assignments." Synonyms for "assignment" are "task," "duty," "job," and "chore."
::::
:::: {.callout-note #enr-displaying-tables collapse=true}
## Displaying tables
::: {.callout-warning}
Kable() won't work meaningfully in webr-r. So do we want to include this:
:::
We are using the word "**table**" to refer specifically to a printed display intended for a human reader, as opposed to data frames which, although often readable, are oriented around computer memory.
The readability of tabular content goes beyond placing the content in neatly aligned columns and rows to include the issue of the number of "significant digits" to present. All of the functions we use for statistical computations make use of internal hardware that deals with numbers to a precision of fifteen digits. Such precision is warranted for internal calculations, which often build on one another. But fifteen digits is much more than can be readily assimilated by the human reader. To see why, let's display calculate yearly GDP growth (in percent) with all the digits that are carried along in internal calculations:
```{r results="hide"}
Growth_rate <- Nats |>
pivot_wider(country,
values_from = c(GDP, pop),
names_from = year) |>
mutate(yearly_growth =
100.*((GDP_2020 / GDP_1950)^(1/70.)-1)) |>
select(country, yearly_growth)
Growth_rate
```
```{r echo=FALSE}
Growth_rate |> mutate(yearly_growth = as.character(signif(yearly_growth, digits=15)))
```
GDP, like many quantities, can be measured only approximately. It would be generous to ascribe a precision of about 1 part in 100 to GDP. Informally, this suggests that only the first two or three digits of a calculation based on GDP can have any real meaning.
The problem of significant digits has two parts: 1) how many digits are worth displaying [We will take a statistical view of the appropriate number of digits to show in @sec-confidence-intervals.]{.aside} and 2) how to instruct the computer to display only that number of digits. Point (1) often depends on expert knowledge of a field. Point (2) is much more straightforward; use a computer function that controls the number of digits printed. There are many such functions. For simplicity, we focus on one widely used in the R community, `kable()`.
The purpose of `kable()` can be described in plain English: to format tabular output for the human reader. Whenever encountering a new function, you will want to find out what are the *inputs* and what is the *output*. The primary input to `kable()` is a data frame. Additional arguments, if any, specify details of the formatting, such as the number of digits to show. For instance:
`r originally <- options(digits=15)`
```{r}
Growth_rate |>
kable(digits = 1,
caption = "Annual growth in GDP from 1950 to 2020",
col.names = c("", "Growth rate (%)"))
```
`r options(originally)`
The output of `kable()`, perhaps surprisingly, is **not** a data frame. Instead, the output is instructions intended for the display's typesetting facility. The typesetting instructions for web-browsers are often written in a special-purpose language called HTML. So far as these *Lessons* are concerned, is not important that you understand the HTML instructions. Even so, we show them to you to emphasize an important point: You can't use the output of `kable()` as the input to data-wrangling or graphics operation.
```{html}
<table>
<caption>Annual growth in GDP from 1950 to 2020</caption>
<thead>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:right;"> Growth rate (%) </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;"> Korea </td>
<td style="text-align:right;"> 3.1 </td>
</tr>
<tr>
<td style="text-align:left;"> Cuba </td>
<td style="text-align:right;"> 0.4 </td>
</tr>
<tr>
<td style="text-align:left;"> France </td>
<td style="text-align:right;"> 2.3 </td>
</tr>
<tr>
<td style="text-align:left;"> India </td>
<td style="text-align:right;"> 1.9 </td>
</tr>
</tbody>
</table>
```
::::
::::: {.callout-note collapse="true" #enr-computing-history}
## Statistics at the origins of computing
::: {.callout-warning}
## Under construction
Calculators, Hollerith cards, Fisher's quote, MCMC, machine learning.
:::
:::::