-
Notifications
You must be signed in to change notification settings - Fork 1
/
L09-Accounting-for-variation.qmd
316 lines (223 loc) · 16.2 KB
/
L09-Accounting-for-variation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
# Accounting for variation {#sec-accounting-for-variation}
```{r include=FALSE}
source("../_startup.R")
set_chapter(8)
```
Lesson [-@sec-variation] listed three foundations for statistical thinking and provided instruction on the first one:
1. How to measure the amount of variation in a variable
In this Lesson, we will take on two and three:
2. Accounting for variation in a response variable.
3. Measuring what remains unaccounted for.
In business, accounting keeps track of money and value: assets, receipts, expenditures, etc. Likewise, in statistics, accounting for variation is the process of keeping track of variation. At its most basic, accounting for variation splits a variable into components: those parts which can be attributed to one or more other variables and the remaining (or `r glossary_term("residual")`) which cannot be so attributed.
The variable being split up is called the "**response variable**." The human modeler chooses a response variable, depending on her purpose for undertaking the work. The modeler also chooses *explanatory variables* that may account for some of the variability in the response variable. There is almost always something left over, variation that cannot be attributed to the explanatory variables. This left-over part is called the "**residual**."
To illustrate, we return to a historical statistical landmark, the data collected in the 1880s by Francis Galton on the heights of parents and their full-grown children. In that era, there was much excitement, uncertainty, and controversy about Charles Darwin's *theory of evolution*.
<!-- [Remarkably, Galton was Darwin's cousin.]{.aside} Today, we understand the role of DNA in encoding genetic information. But in Galton's day, the concept of "gene" was unknown. -->
Everyone knows that height varies from person to person. How much variation in height is there? We quantify variation using the *variance*. `var()` will do the calculation. The units of the output will be in *square-inches*, since `height` itself is measured in inches.
```{webr-r}
Galton |>
summarize(var(height))
```
{{< include LearningChecks/L09/LC09-01.qmd >}}
<!--
SOME MORE STUFF
[Each specimen is a full-grown child in the `Galton` data frame. The variables recorded are the `height` of the child in inches, the heights of the `mother` and `father`, and the `sex` of the child according to the conventions of Victorian England, where the data were collected.]{.aside}
AND MORE STILL
-->
Galton's overall project was to quantify the heritability of traits (such as height) from parents to offspring. Darwin had famously measured the beaks of finches on the Galapagos Islands. Galton, in London, worked with the most readily available local species: humans. Height is easy and socially acceptable to measure, a good candidate for data collection. [Galton *invented* the methods we are going to use in this example.]{.aside}
Galton sought to divide the height variation in the children into two parts: that which could be attributed to the parents and that which remained unaccounted for, which we call the "**residual**." We will start with a mathematically more straightforward project: dividing the variation in the children's heights into that part associated with the child's sex and the residual.
@lst-galton-mean shows how to calculate the average `height` for each `sex` using the techniques from Lesson [-@sec-wrangling]. The line numbers in the code listing explain each of the components of the command.
::: {#lst-galton-mean}
```{r digits=4, results = "hide"}
Galton |>
summarize( # <1>
modval = mean(height), # <2>
.by = sex # <3>
)
```
1. `summarize()` reduces many rows to a single row.
2. Calculate the mean of `height` and name the corresponding variable in the output `modval`.
3. Break down the calculation in (2) group by group.
:::
To see the results of the command, run it!
```{webr-r}
Galton |>
summarize(modval = mean(height), .by = sex)
```
Note that the output has one row for each `sex`. As always, the output from summarize is set by the arguments to the function. Here, there will be two variables in the output: `sex` and the mean height for each sex, which we're calling `modval`, short for **model value**.
This simple report indicates that the sexes differ, on average, by about 5 inches in height. But this report offers no insight into how much of the person-to-person variation in height is attributable to `sex`. For that, we need to look at the group averages *in the context of the variation of individuals*. That is, instead of computing the model value for each `sex`, we will compute it for each specimen using `mutate()`. This is simple, since all `F` specimens will have a model value of 64.1 inches, while the `M` specimens all have a model value of 69.2 inches. However, anticipating what will come later in this Lesson, we will do the calculation using `mutate()`.
::: {#lst-galton-mutate}
```{webr-r}
Galton |>
mutate(modval = mean(height), .by = sex) |>
take_sample(n = 3, .by = sex) # show just a handful
```
:::
Note: To save space on the screen, @lst-galton-mutate is written to show only a handful of the 898 rows that would otherwise be in the output.
@lst-galton-mutate2 repeats the above calculation, but summarizes to show the variance of `height` and the variance of the model values.
::: {#lst-galton-mutate2}
```{webr-r}
Galton |>
mutate(modval = mean(height), .by = sex) |>
summarize(var(height), var(modval))
```
:::
The output shows that the variability in the model values of height, is about half the variability in the height itself. This is the the answer to (2) from the introduction to this Lesson; `sex` accounts for about half of the variation in height.
Item (3) in the introduction was about measuring what remains unexplained. This "unexplained" part of height is called the **residual** from the model values. The residual is easy to calculate. @lst-galton-resid-var calculates the residuals from the simple model that accounts for `height` by `sex`.
::: {#lst-galton-resid-var}
```{webr-r}
Galton |>
mutate(modval = mean(height), .by = sex) |>
mutate(resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
:::
It's essential to note from the output of @lst-galton-resid-var that the variance in the response variable (`height`) is exactly the sum of the the variance of the model values and the variance of the residuals. In other words, the variance in the response variable is **perfectly split into two parts**:
1. the part accounted for by the explanatory variable and
2. the part that remains unaccounted for.
{{< include LearningChecks/L09/LC09-02.qmd >}}
{{< include LearningChecks/L09/LC09-03.qmd >}}
## Numerical explanatory variables
The previous section used the categorical variable `sex` as the explanatory variable. Galton's interest, however, was in the relationship between children's and parents' height.
`sex` is a *categorical* explanatory variable. Using `mutate(..., .by = sex)` is a good method of handling categorical explanatory variables. However, this does not work for *quantitative* explanatory variables. The reason is that `.by` translates a quantitative variable into a categorical one, with a result for each unique quantitative value.
It's trivial to substitute the height of the `mother` or `father` in place of `sex` in the method introduced in the previous section. However, as we shall see, the results are not satisfactory. Galton's key discovery was the proper method for relating two *quantitative* variables such as `height` and `mother`.
First, let's try simply substituting in `mother` as the explanatory variable and using `mean()` to create the model values. Then, use `point_plot()` to show the model values as a function of `mother`.
:::: {#lst-galton-by-mother}
```{webr-r}
Galton |>
mutate(modval = mean(height),
resid = height - modval,
.by = mother) |>
point_plot(modval ~ mother) |>
gf_line(modval ~ mother, color = "blue")
```
::::
The last line of @lst-galton-by-mother connects adjacent points with lines. It's hard to see any pattern in the model values.
It is common sense that the model linking mothers' heights to children's height should be **smooth**, not the jagged pattern seen in @lst-galton-by-mother. The source of the jaggedness is the use of `.by = mother` in `mutate(modval = mean(height), .by = mother)`. There is nothing in `.by = mother` to enforce the idea that the model value for mid-height mothers should be in-between the model values for short and for tall mothers.
The solution to the problem of jagged model values is to avoid the absolute splitting into non-overlapping groups by mother's height. Instead, we want to find a smooth relationship. Galton invented the method for accomplishing this. A modern form of his method is provided by the `model_values()` function., which we shall use to construct `Model3`.
:::: {#lst-mother-model3}
```{webr-r}
Galton |>
mutate(modval = model_values(height ~ mother),
resid = height - modval) |>
point_plot(modval ~ mother)
```
:::
Notice that the `.by = mother` step has been entirely removed. Notice also that `model_values()` uses the same kind of **tilde expression** as we have employed when plotting. The response variable is listed on the left of the ![](www/tilde.png), the explanatory variable on the right side. In other words, we are modeling the child's `height` as a function of `mother`'s height.
As always, modeling splits the variance of the response variable into two parts, one associated with the explanatory variable and the other holding what's left over: the residual. Here's the split for `Model3` which uses `mother` as an explanatory variable:
:::: {#lst-mother-model4}
```{webr-r}
Galton |>
mutate(modval = model_values(height ~ mother),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
::::
The mother's height doesn't account for much of the variation in the children's heights.
{{< include LearningChecks/L09/LC09-04.qmd >}}
## Multiple explanatory variables {#sec-multiple-vars1}
The models we work with in these Lessons *always* have exactly one response variable. [Note that the idea of "response" and "explanatory" variables refers to a model and are not at all intrinsic to a bare data frame. A data frame can contain many variables, any of which can be used as explanatory variables. The choice of response variable depends on the modeler's goals.]{.aside} But models can have any number of *explanatory* variables.
Whatever the number of explanatory variables and however many levels a categorical explanatory variable has the model splits the variance of the response into two complementary pieces: the variance accounted for by the explanatory variables and the part not accounted for, that is, the *residual* variance. [Many statistical terms mean something different in statistical than in everyday use. "Residual" is a pleasant exception: the statistical meaning is closely matched by its everyday dictionary definition.]{.aside}
To illustrate, here is a sequence of models of `height` with different numbers of explanatory variables.
Number of explanatory variables | tilde expression | `var(resid)`
-------------|-------------------|----------
0 | `height ~ 1` |
1 | `height ~ sex` |
2 | `height ~ sex + mother` |
3 | `height ~ sex + mother + father` |
Plugging each of these number
```{r eval = FALSE}
Galton |>
mutate(modval = # <1>
model_values( # <2>
..tilde.expression.here. # <3>
), # <2>
resid = height - modval) |> # <4>
summarize(var(height), var(modval), var(resid)) # <5>
```
**Three explanatory variables**
YOU WERE HERE: Perhaps convert this to an interactive
```{r results=asis(), eval=knitr::is_html_output()}
#| column: margin
Galton |>
mutate(modval =
model_values(height ~ sex + mother + father),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
```{r results=asis(), eval=knitr::is_latex_output(), echo=FALSE}
Galton |>
mutate(modval =
model_values(height ~ sex + mother + father),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
**Two explanatory variables**
```{r results=asis(), eval=knitr::is_html_output()}
#| column: margin
Galton |>
mutate(modval =
model_values(height ~ sex + mother),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
```{r results=asis(), eval=knitr::is_latex_output(), echo=FALSE}
Galton |>
mutate(modval = model_values(height ~ sex + mother),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
**One explanatory variable**
```{r results=asis(), eval=knitr::is_html_output()}
#| column: margin
Galton |>
mutate(modval = model_values(height ~ sex),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
```{r results=asis(), eval=knitr::is_latex_output(), echo=FALSE}
Galton |>
mutate(modval = model_values(height ~ sex),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
**Zero explanatory variables**
```{r results=asis(), eval=knitr::is_html_output()}
#| column: margin
Galton |>
mutate(modval = model_values(height ~ 1),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
```{r results=asis(), eval=knitr::is_latex_output(), echo=FALSE}
Galton |>
mutate(modval = model_values(height ~ 1),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```
## Comparing models with R^2^
When selecting explanatory variables, comparing two or more different models sharing the same response variable is often helpful: a simple model and a model that adds one or more explanatory variables to the simple model. The model with *no explanatory variables*, is always the simplest possible model. For example, in @sec-multiple-vars1, the model `height ~ 1` is the simplest. Compared to the simplest model, the model `height ~ sex` has one additional explanatory variable, `sex`. Similarly, `height ~ sex + mother` has one additional explanatory variable compared to `height ~ sex`, and `height ~ sex + mother + father` adds in still another explanatory variable.
The simpler model is said to be "**nested in**" the more extensive model, analogous to a series of [Matroshka dolls](https://en.wikipedia.org/wiki/Matryoshka_doll). A simple measure of how much of the response variance is accounted for by the explanatory variables is the ratio of the variance of the model values divided by the variance of the response variable itself. This ratio is called "R^2^", pronounced "R-squared."
::: {.column-margin}
![](www/Russian-Matroshka_no_bg.jpeg)
A sequence of five nested Matroshka dolls. Each smaller doll fits inside a larger one.
:::
For instance, R^2 for the model `height ~ sex + mother` is
$$\text{R}^2 = \frac{7.21}{12.84} = 0.56$$
[R^2^ is also known as the "coefficient of determination," a little-used term we shall avoid. Still, it's worth noting the attitude behind the term; it quantifies the extent to which the response variable is "determined" by the explanatory variables.]{.aside} By comparison, R^2 for the simpler model, `height ~ sex,` is slightly smaller:
$$\text{R}^2 = \frac{6.55}{12.84} = 0.51$$
For all models, $0 \leq$ R^2^ $\leq 1$. [Instructor Note: Strictly speaking, "all" should be qualified to mean "linear least-squares models with an intercept term."]{.aside} It is tempting to believe that the "best" model in a set of nested models is the one with the highest R^2^, but statistical thinkers understand that "best" ought to depend on the purpose for which the model is being built. This matter will be a major theme in the remaining Lessons.
## Exercises
```{r eval=FALSE, echo=FALSE}
# need to run this in console after each change
emit_exercise_markup(
paste0("../LSTexercises/09-Prelude-to-modeling/",
c("Q09-101.Rmd",
"Q09-102.Rmd",
"Q09-103.Rmd",
"Q09-104.Rmd",
"Q09-105.Rmd",
"Q09-106.Rmd")),
outname = "L09-exercise-markup.txt"
)
```
{{< include L09-exercise-markup.txt >}}