-
Notifications
You must be signed in to change notification settings - Fork 1
/
L04-Model-annotations.qmd
231 lines (168 loc) · 18.2 KB
/
L04-Model-annotations.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Annotating point plots with a model {#sec-model-annotation}
```{r include=FALSE}
source("../_startup.R")
set_chapter(4)
```
Lesson [-@sec-variation-and-distribution] introduced the violin-plot annotation to display graphically the "shape" of variation: which values are more common, which values rare, and which values never seen at all. In this Lesson, we turn to annotations that summarize patterns of relationships between a response variable and one or more explanatory variables. Such summaries are called **statistical models** and will be a major theme in these Lessons.
We start informally to help you develop an intuition about what a "pattern of relationship" means.
## Models with a single explanatory variable.
In introducing point plots in Lesson [-@sec-point-plots] we started with a simple setting: a response variable and a single explanatory variable. For instance, we might look at the relationship between wrist circumference and ankle circumference in the `Anthro_F` data frame. Arbitrarily, we'll choose `Wrist` as the response variable and `Ankle` as the explanatory variable, as in @fig-wrist-ankle.
```{r echo=FALSE}
show_pairs <- function(nsegs = 4, seed = 104, labels=TRUE) {
set.seed(seed)
Tmp <- Anthro_F |> select(Wrist, Ankle)
Tmp2 <- Tmp |> rename(WristB = Wrist, AnkleB = Ankle)
Joined <- bind_cols(take_sample(Tmp), take_sample(Tmp2))
Joined_small <-take_sample(Joined, n=nsegs) |>
mutate(color = rainbow(nsegs),
y = (Wrist + WristB)/2, x = (Ankle + AnkleB)/2,
lab = 1:nsegs)
P <- Tmp |> point_plot(Wrist ~ Ankle) |>
gf_segment(Wrist + WristB ~ Ankle + AnkleB,
data = Joined_small, color=~color) |>
gf_point(Wrist ~ Ankle, color = ~color, data=Joined_small) |>
gf_point(WristB ~ AnkleB, color = ~color, data=Joined_small)
if (labels)
P <- P |> gf_label(y ~ x, color = ~ color, label = ~ lab, data = Joined_small)
P + guides(color = "none")
}
```
::: {.column-page-right}
```{r echo=FALSE}
#| label: fig-wrist-ankle
#| fig-cap: 'Wrist circumference versus ankle circumference'
#| fig-subcap:
#| - The individual persons
#| - Connecting a handful of pairs with lines
#| - Drawing many such connecting lines
#| layout-ncol: 3
Anthro_F |> point_plot(Wrist ~ Ankle, jitter = "all")
show_pairs(nsegs = 4)
show_pairs(nsegs = 100, labels = FALSE)
```
:::
Some people can look at @fig-wrist-ankle(a) and see an "upward sloping trend" in the cloud of points. For others, let's take things more gradually. Suppose we pick, entirely at random, two different people from `Anthro_F`, such as the pair connected by line segment 1 in @fig-wrist-ankle(b). That line segment slopes upward, which is merely to say that the person at the right end of the line has both wrist and ankles that are larger than the person at the left end of the line. The same is true for line 3, but not for lines 2 and 4.
Comparing two individuals is trivial. But we would like to make a statement about *all* the points in the cloud. One way to do this is to pick many pairs at random and show the line segment for each pair. This is done in @fig-wrist-ankle(c). The figure is a jumble of randomly sloping lines. But the picture as a whole is not entirely random. The general impression created is that most of the lines slope upward.
A statistical model of `Wrist ~ Ankle` replaces this "general impression" with an overall pattern: a line-like band that more-or-less averages all the pairwise lines.
Graphing the model `Wrist ~ Ankle` is merely a matter of asking `point_plot()` to add an annotation. But rather than "violin" as we used in Lesson [-@sec-variation-and-distribution], we ask for a "model" annotation. Try it!
::: {#fig-wrist-ankle-annot fig-cap-location="margin"}
```{webr-r}
Anthro_F |>
point_plot(Wrist ~ Ankle, annot = "model")
```
Annotating `Wrist ~ Ankle` point plot with a statistical model.
:::
The model annotation is shaded blue to help distinguish it from individual data points. The band thickness acknowleges the variation in pairwise lines seen in @fig-wrist-ankle(c). The particular band presented by `point_plot()` is the one that comes as close as possible to the data points. "As close as possible" is defined in a specific way which we'll investigate later; for now it suffices to note that the band goes nicely through the cloud of data points.
The explanatory variable in @fig-wrist-ankle-annot is *quantitative*. Model annotations can also be drawn for *categorical* explanatory variables. To illustrate, consider the data in `Birdkeepers`, used in a study of smoking, bird-keeping, and lung cancer. The unit of observation is an individual person. The variable `YR` records the number of years that person smoked, while the categorical variable `LC` indicates whether the person had been diagnosed with lung cancer. The data and a model annotation are shown in @fig-birdkeepers-A
```{r eval=FALSE, results="hide"}
Birdkeepers |> point_plot(YR ~ LC, annot="model")
```
```{r echo = FALSE}
#| label: fig-birdkeepers-A
#| fig-cap: "Years of cigarette smoking versus diagnostic status for lung cancer."
#| fig-cap-location: margin
Birdkeepers |> point_plot(YR ~ LC, annot="model", model_ink = 0.6)
```
For a categorical explanatory variable, the model annotation is a vertical band (or "**interval**") for each of the categorical levels. As with the band @fig-wrist-ankle-annot, the model annotation in @fig-birdkeepers-A is vertically centered among the data points.
In later Lessons, we will discuss how `point_plot()` chooses the specific model annotation shown in any given case. But consider these closely related questions:
- Why are the model annotations shown as a band or interval, rather than as a single, simple line or single numerical value for each category?
- What do the model annotations tell us?
The bands or intervals in the model annotations shown by `point_plot()` are there as a reminder that the data are consistent with a range of models. The vertical thickness of the band/interval shows how large is that range. This is essential to drawing conclusions from the plots. For instance, in @fig-birdkeepers-A, the intervals for two levels "lung cancer" and "no cancer" have no vertical overlap. This tells the statistical thinker to be confident in a claim that the typical value of `YR` is genuinely different between the two levels. If the intervals had overlapped vertically, the statistical thinker would know to be skeptical about such a claim.
Similarly, in @fig-wrist-ankle-annot the model annotation is a sloping band. The slope indicates that `Ankle` and `Wrist` are related to one another: larger `Wrists` tend to go along with larger `Ankles`. If the two variables were unrelated, we would expect the band to run horizontally---zero slope---meaning that the typical wrist circumference is the same for all people, regardless of the ankle circumference. The vertical thickness of the band tells the statistical thinker a range of plausible slopes that match the data. In @fig-wrist-ankle, there is no horizontal line that can be drawn from end to end *within* the band. Thus, the statistical thinker can be confident that there is a non-zero slope describing the relationship between `Wrist` and `Ankle`.
You may have encountered statistical graphics similar to those in @fig-point-estimate but with an essential difference, the model annotations are lines rather than bands or intervals. In @fig-point-estimate, the model annotations are simple lines that in principle are infinitely thin. With a numerical explanatory variable, the model annotation is a line that can have a non-zero slope. In contrast, for a categorical explanatory variable, there is a horizontal line drawn at a single vertical value for each level of the explanatory variable. Such simplified graphics do not recognize that, in reality, there is a range of different lines that are plausible models. Since we can't tell from graphics like @fig-point-estimate what is the range of plausible models, the annotation provides no guidance about whether to be confident that the data tell us slopes or vertical differences are non-zero.
Statistical thinking makes extensive use of the concept that there is a range of plausible models consistent with the data. Any straight line that falls into the band in @fig-wrist-ankle is a plausible model of the data. In @fig-birdkeepers-A, any pair of values that fall into vertical intervals are a plausible model of the data.
::: {#fig-point-estimate.column-page-right}
```{r echo=FALSE}
#| fig-subcap:
#| - Straight-line
#| - Discrete values
#| layout-ncol: 2
Anthro_F |> point_plot(Wrist ~ Ankle) |>
gf_lm(Wrist ~ Ankle, color = "blue", size = 1.5)
BK <- Birdkeepers |> mutate(modvals = model_values(YR ~ LC))
BK |> point_plot(YR ~ LC) +
geom_errorbar(data=BK |>take_sample(n=1, .by = LC),
aes(x=LC, ymin=modvals, ymax=modvals),
color = "blue", width = 0.25, linewidth = 1.5)
```
Many statistical texts present models as a straight line or as discrete values. This omits essential information compared to the model annotations generated by `point_plot()`.
:::
::: {.callout-note .column-body-outset-right}
## "Trend" or "cause"
Each of the plausible models---as in Figures [-@fig-wrist-ankle-annot] or [-@fig-birdkeepers-A]---describes a specific relationship between the response and explanatory variables. For the wrist/ankle relationship, the plausible models all show a "trend" between ankle size and wrist size. For the smoking-years/lung-cancer relationship, the people with lung cancers "tend" to have smoken for more years than the no-cancer people.
The words "trend" or "tend" are weak. Often, statistical thinkers are interested in stronger statements, like these:
- Larger ankles *cause* larger wrists.
- Smoking for more years *increases the chances* of lung cancer.
We can call these **opinionated statements** because they make use of some hypothesis about how the world works held by the modeler rather than being forced solely by the data. Many people think it silly to claim that "larger ankles *cause* larger wrists." It seems much more probable that "larger people have larger wrists and also larger ankles." On the other hand, many people will be sympathetic to the statement "increases the chances of lung cancer." They have heard such things from other respected sources.
Some of the techniques covered in these *Lessons* are designed to substantiate or undermine opinionated statements like these. Until we understand and use these techniques, it is dicey to quantitatively support an opinionated statement from data.
Many statisticians prefer to avoid the whole matter of opinionated statements. [But see Lesson [-@sec-experiment] for an approach approved by even the most opinion-wary statistician.]{.aside} Weak, unopinionated language like "trend" or "tend" are used instead. Those preferring more technical-sounding language might use "associated with" or "correlated with."
:::
## Independence
We use model annotations to display whether variables are related. It's good to consider as well a particular type of relationship: independence. When the explanatory variable is categorical, the model annotations will be a vertical interval for each level. When the response is independent of the explanatory variable, those intervals will overlap. For instance, in @fig-independence(a) values of `YR` near 30 are in both vertical intervals.
For a quantitative explanatory variable, as in @fig-independence(b), independent variables will have a model band that is more-or-less horizontal. That is to say, at least one horizontal line will fall within the band.
::: {#fig-independence.column-page-right}
```{r echo=FALSE}
#| fig-subcap:
#| - Categorical explanatory variable
#| - Quantitative explanatory variable
#| layout-ncol: 2
Birdkeepers |> point_plot(YR ~ BK, annot="model", model_ink=0.7) +
geom_hline(yintercept=30, color="green") + ylab("Age (years)") + xlab("Is a birdkeeper?")
Anthro_F |> point_plot(BFat ~ Height, annot="model") + geom_abline(intercept=22, slope=0, color="green") +
ylab("Body fat (%)") + xlab("Height (meters)")
```
Model annotations consistent with the response and explanatory variables being independent. Panel (a) considers whether the age of the people in `Birdkeepers` is independent of whether the person keeps a bird. Panel (b), based on `Anthro_F` is about the possible relationship between a person's height and body fat as a percent of overall mass
:::
## Multiple explanatory variables
In Lesson [-@sec-point-plots] we used color and faceting to look at the response variable in terms of up to three explanatory variables. Statistical models can also handle multiple explanatory variables.
We'll illustrate with a commentary from a political pundit about education spending in US schools:
> *[T]he 10 states with the lowest per pupil spending included four — North Dakota, South Dakota, Tennessee, Utah — among the 10 states with the top SAT scores. Only one of the 10 states with the highest per pupil expenditures — Wisconsin — was among the 10 states with the highest SAT scores. New Jersey has the highest per pupil expenditures, an astonishing $10,561, which teachers’ unions elsewhere try to use as a negotiating benchmark. New Jersey’s rank regarding SAT scores? Thirty-ninth... The fact that the quality of schools... [fails to correlate] with education appropriations will have no effect on the teacher unions’ insistence that money is the crucial variable.*—--George F. Will, (September 12, 1993), "Meaningless Money Factor," The Washington Post, C7.
The opinionated claim here is that "money is the crucial variable" in educational outcomes. George Will seeks to rebut this claim with data. Fortunately for us, actual data on SAT scores and per pupil expenditures in the mid-1990s is available in the `mosaicData::SAT` data frame. The unit of observation in `SAT` is a US state. @fig-SAT-one(a) shows an annotated point plot of state-by-state expenditures and test scores. The trend signaled by the model annotation is that SAT scores are slightly lower in high-expenditure states, consistent will George Will's observations. But ...
::: {#fig-SAT-one .column-page-right}
```{r echo=FALSE}
#| fig-subcap:
#| - expenditures as the explanatory variable
#| - fraction of students taking SAT as the explanatory variable
#| layout-ncol: 2
SAT |> point_plot(sat ~ expend, annot="model") + xlab("Per pupil expenditures ($1000)")
SAT |> point_plot(sat ~ frac, annot="model") + xlab("Participation (%)")
```
SAT scores as a function of per-pupil expenditures and of fraction taking the SAT.
:::
Education is a complicated matter and there are factors other than expenditures that may be playing a role. One of these, shown in @fig-SAT-one(b), is that participation in the SAT varies substantially from state to state. In some states, almost all students take the test. In others, fewer than 10% of students take the test. The data show a relationship between participation and scores: scores are consistently higher in low-participation states.
```{r echo=FALSE}
#| label: fig-expend-partic
#| fig-cap: "The two explanatory variables shown in @fig-SAT-one are themselves related to one another."
SAT |> point_plot(expend ~ frac, annot="model") |>
add_plot_labels(x ="Participation (%)", y = "Per pupil expenditures ($1000s)")
```
Statistical modeling techniques enable us to use *both* expenditures and participation as explanatory variables. @fig-SAT-one does this with *one variable at a time*. But we can also use both explanatory variables *simultaneously*. Doing so is important especially when there is a relationship between the explanatory variables, as seen in the graph of expenditures versus participation (@fig-expend-partic).
::: {#fig-SAT-expend-partic .column-page-right}
```{r echo=FALSE}
#| fig-subcap:
#| - Expenditures on the horizontal axis, participation as color.
#| - Expenditures as color, participation on the horizontal axis.
#| layout-ncol: 2
SAT |> filter(expend < 8) |>
point_plot(sat ~ expend + frac, annot="model") + xlab("Expenditures ($1000)")
# SAT |> point_plot(sat ~ expend + frac + frac, annot="model") + xlab("Expenditures ($1000)")
SAT |> filter(expend < 8) |> point_plot(sat ~ frac + expend, annot="model") + xlab("Participation (%)")
```
A model of SAT scores as a function of both expenditures and participation. The model is the same for both (a) and (b), but the horizontal axis and color is reversed from one to the other.
:::
The two panels in @fig-SAT-expend-partic tell a consistent story, but with different graphical appearances. For instance, the clear vertical spacing between bands in the left panel indicate that SAT scores are influenced by the participation level, even taking into account expenditures. This appears as the downward slope in the bands in the right panel.
But when we look at expenditures---taking into account participation---we see horizontal bands in the left panel. (More precisely, bands that can encompass a horizontal line.) This indicates that we cannot confidently claim that expenditures are associated with SAT scores. In the right panel, the lack of association between expenditures and SAT scores is signaled by the vertical overlap between the bands.
Many people are discomfitted to hear that looking at the same data in different ways can lead to different conclusions. At this stage in the *Lessons* that may be true for you as well. Even so, should should already have a concrete sense of how we can denote the "different ways of looking at data." In modeling notation, the perspective `sat ~ expenditures` shows one pattern, while the perspective `sat ~ expenditures + participation` tells another. In Lesson [-@sec-effect-size] we will see a non-graphical way of looking at models that makes it easier to see the effect of one explanatory variable in the context of others. And in Lessons [-@sec-DAGs] and [-@sec-experiment] we will study the methods used in modern statistics to decide which of two possible models---say `sat ~ expenditures` or `sat ~ expenditures + participation`---is more appropriate to answer questions of causation.
## Exercises
```{r eval=FALSE, echo=FALSE}
# need to run this in console after each change
emit_exercise_markup(
paste0("../LSTexercises/04-Model-annotations/",
c("Q04-100.Rmd",
"Q04-101.Rmd",
"Q04-103.Rmd",
"Q04-102.Rmd")),
outname = "L04-exercise-markup.txt"
)
```
{{< include L04-exercise-markup.txt >}}