-
Notifications
You must be signed in to change notification settings - Fork 1
/
L12-Adjustment.qmd
488 lines (352 loc) · 22 KB
/
L12-Adjustment.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
# Adjustment {#sec-adjustment}
```{r label='250-Adjustment-4QgjSm', include=FALSE}
source("../_startup.R")
set_chapter(11)
```
The phrase "all other things being equal" is a critical qualifier in describing relationships. To illustrate: A simple claim in economics is that a high price for a commodity reduces the demand. For example, increasing the price of gasoline will reduce demand as people avoid unnecessary driving or purchase electric cars. Nevertheless, the claim can be considered obvious only with the qualifier *all other things being equal*. For instance, the fuel price might have increased because a holiday weekend and the attendant vacation travel has increased the demand for gasoline. Thus, higher gasoline prices may be associated with higher demand unless holding constant other variables such as vacationing.
The Latin equivalent of "all other things being equal" is sometimes used in economics: "**ceteris paribus**". The economics claim would be, "higher prices are associated with lower demand, *ceteris paribus*."
Although the phrase "all other things being equal" has a logical simplicity, it is impractical to implement "all." So instead of the blanket "all other things," it is helpful to consider just "some other things" to be held constant, being explicit about what those things are. Other phrases along the same lines are "adjusting for ...," "taking into account ...," and "controlling for ...."
## Groupwise adjustment
"**Life expectancy**" is a statistical summary familiar to many readers. Life expectancy is often the evidence provided in debates about healthcare policies or environmental conditions. For instance, consider this pull-quote from the [*Our World in Data* website](https://ourworldindata.org/us-life-expectancy-low):
> "*Americans have a lower life expectancy than people in other rich countries despite paying much more for healthcare.*"
::: {#tbl-life-expectancy .column-margin .content-visible when-format="html"}
Country | Female | Male
--------|----|-----
Japan | 87.6 | 84.5
Spain | 86.2 | 80.3
Canada | 84.7 | 80.6
United States | 80.9 | 76.0
Bolivia | 74.0 | 71.0
Russia | 78.3 | 66.9
North Korea | 75.9 | 67.8
Haiti | 68.7 | 63.3
Nigeria | 63.3 | 59.5
Somalia | 58.1 | 53.4
Life expectancy at birth for several countries and territories. [Source](https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy)
:::
<!-- for PDF version -->
::: {.column-margin .content-visible when-format="pdf"}
Country | Female | Male
--------|----|-----
Japan | 87.6 | 84.5
Spain | 86.2 | 80.3
Canada | 84.7 | 80.6
United States | 80.9 | 76.0
Bolivia | 74.0 | 71.0
Russia | 78.3 | 66.9
North Korea | 75.9 | 67.8
Haiti | 68.7 | 63.3
Nigeria | 63.3 | 59.5
Somalia | 58.1 | 53.4
Life expectancy at birth for several countries and territories. [Source](https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy)
:::
The numbers in @tbl-life-expectancy faithfully reflect the overall situation in the different countries. Yet, without adjustment, they are not well suited to inform about specific situations. For example, life expectancies are usually calculated *separately* for males and females, acknowledging a significant association of life expectancy with sex, not just the availability of medical care. We will call such a strategy "**groupwise adjustment**" because it's based on acknowledging difference between groups. You'll see similar groupwise adjustment of life expectancy on the basis of race/ethnicity.
Over many years teaching epidemiology at Macalester College, I asked students to consider life-expectancy tables and make policy suggestions for improving things. Almost always, their primary recommendations involved improving access to health care, especially for the elderly.
But life expectancy is not mainly, or even mostly, about old age. Two critical determinants are infant mortality and lethal activities by males in their late teenage and early adult years. If we want to look at conditions in the elderly, we need to consider elderly people separately, not mixed in with infants, children, and adolescents. For reasons we won't explain here, with life expectancy calculations it's routine to calculate a separate "life expectancy at age X" for each age year. @tbl-life-expectancy-at-70 shows, according to the World Health Organization, how many years longer a 70-year old can expect to live. The 30-year difference between Japan and Somalia seen in @tbl-life-expectancy is reduced, for 70-year olds, to about a decade. The differences between males and females are similarly reduced
::: {#tbl-life-expectancy-at-70 .column-margin .content-visible when-format="html"}
Country | Female | Male
--------|-----|-----
Japan | 21.3 | 17.9
Canada | 18.0 | 15.6
Spain | 17.0 | 14.0
United States | 18.3 | 16.3
Russia | 16.2 | 12.2
Bolivia | 13.6 | 13.0
Haiti | 12.9 | 12.1
Somalia | 11.6 | 9.7
Life expectancy at age 70. (Main source: [World Health Organization](https://apps.who.int/gho/data/view.main.61780?lang=en)) average of 65-74 year olds)
:::
<!-- PDF version of above -->
::: {.column-margin .content-visible when-format="pdf"}
Country | Female | Male
--------|-----|-----
Japan | 21.3 | 17.9
Canada | 18.0 | 15.6
Spain | 17.0 | 14.0
United States | 18.3 | 16.3
Russia | 16.2 | 12.2
Bolivia | 13.6 | 13.0
Haiti | 12.9 | 12.1
Somalia | 11.6 | 9.7
table: Life expectancy at age 70. (Main source: [World Health Organization](https://apps.who.int/gho/data/view.main.61780?lang=en)) average of 65-74 year olds)
:::
## Adjustment with *per* {#sec-per-adjustment}
The previous example demonstrated adjustment by comparing similar subgroups. Here, we focus on adjustment by converting quantities into `r glossary_term("**rates**", "rate")`. For instance, "10 miles" is a quantity of distance. But "10 miles per hour" is a rate that compares the distance travelled to the time taken for the trip. Rates are always *quotients*, one quantity divided by another.
Often, the denominator of a rate---recall that quotients consist of a "numerator" divided by a "denominator"---is often a duration of time, but not always. Some examples:
- "Dollars *per* gallon" for gasoline prices.
- "Pounds *per* square inch" as in measuring air pressure.
- "Euros *per* dollar" for expressing an exchange rate between two currencies.
- "Children *per* woman" is a way of expressing fertility rates.
- "Percent *per* year" is a common way of expressing a rate of growth, as with interest accumulating on a student loan.
In everyday speech, the word **per** is used to identify the quantity as a rate and serves grammatically to separate the numerator of the quotient from the denominator.
Rates are sometimes confusing to the unpracticed ear. For instance, some rates are constructed with two divisions. Physics students learn to express acceleration as "meters *per* second *per* second." Similarly, a country's birth rate can be expressed by "births *per* year *per* 1000 population."
In economics and governance, rates often involve taking a total quantity and dividing it by the size of the population. For instance, total yearly spending on chewing gum in the US is estimated to be [2.4 billion dollars](https://www.statista.com/statistics/1359942/retail-sales-of-gum-in-the-united-states-by-category/#:~:text=Retail%20sales%20of%20gum%20in,in%202020%2D2024%2C%20by%20category&text=In%202021%2C%20retail%20sales%20of,nearly%202.6%20billion%20U.S.%20dollars.). The size of the population in the US is about 340 million people. Expressed as a rate, chewing-gum spending is about 7 dollars *per* person.
Very often, such "*per* person" rates are expressed using Latin: *per capita*.
::: {#fig-CMS-spending}
> "*In 2020, children (0-18) accounted for 23 percent of the population and 10 percent of personal health care (PHC) spending, working age adults (19-64) accounted for 60 percent of the population and 53 percent of PHC, and older adults (65 and older) account for 17 percent of the population and 37 percent of PHC.*"
Personal health-care spending as a fraction of the total population for different age groups. [Source: US Centers for Medicare Studies](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/Downloads/AgeandGenderHighlights.pdf)
:::
::: {.column-margin .content-visible when-format="pdf"}
group | age span | population | PHC spending
:--------|----:|------------|----------
children | 0-18 | 23% | 10%
working age | 19-64 | 60% | 53%
older adults | 65+ | 17% | 37%
Table 3: A tabular arrangement of the data from the Centers for Medicare Studies
:::
::: {#tbl-PHC-spending .column-margin .content-visible when-format="html"}
group | age span | population | PHC spending
:--------|----:|------------|----------
children | 0-18 | 23% | 10%
working age adults| 19-64 | 60% | 53%
older adults | 65+ | 17% | 37%
A tabular arrangement of the data from the Centers for Medicare Studies
:::
The textual presentation of data in @fig-CMS-spending, presents relevant information but obscures the patterns. The author would have done better by placing the numbers that are to be compared to one another next to one another. A `r glossary_term("tabular", "table")` organization makes it much easier to compare the relative population sizes of the age groups. For instance, the population column of @tbl-PHC-spending shows at a glance that the population of children and older adults are about the same.
Similarly, the table's "PHC spending" column makes it obvious that PHC spending is much higher for working age adults than for either children or older adults.
In comparing the spending between groups, it can be helpful to *take into account* the differing population sizes. A **per capita** adjustment---spending divided by population size---accomplishes it. For instance, the per capita adjustment for children is 10% / 23%, that is, 0.43. @tbl-per-capita-spending shows the *per capita* spending for all three age groups.
::: {#tbl-per-capita-spending}
group | age | population | spending | spending *per capita*
:--------|----:|------------|----------|---------
children | 0-18 | 23% | 10% | 0.43
working age adults| 19-64 | 60% | 53% | 0.88
older adults | 65+ | 17% | 37% | 2.18
Adjusting spending for the size of the population gives a clearer indication of how spending compares between the different age groups.
:::
Including the *per capita* adjusted spending in the table makes it easy to see an important pattern: older adults have much higher health spending (per person) than the other groups.
The method of adjustment by dividing one quantity by another has much broader applications. To illustrate, let's return to the example of college grades from @sec-grade-joins. There, we calculated using simple wrangling each student's grade-point average and an instructor grade-giving average. The instructor's grade-giving average varies so much that it seems short-sighted to neglect it as a factor in determining a student's grade in that instructor's courses.
An adjustment for the instructor can be made by constructing a *per*-type index. An instructor gave each grade, but instead of considering the grade literally, let's divide the grade by the grade-giving average of the instructor involved.
::: {.column-margin .content-visible when-format="pdf"}
sid iid sessionID gradepoint
-------- --------- ------------- ------------
S32418 inst268 session2911 3
S32328 inst436 session3524 3.66
S32250 inst268 session2911 2.66
S32049 inst436 session2044 3.33
S31914 inst436 session2044 3.66
S31905 inst436 session2044 4
S31833 inst436 session3524 3.33
S31461 inst264 session1904 4
S31197 inst436 session3524 3
S31194 inst264 session1904 2
A few of the 6,124 rows in the `Extended_grades` table.
:::
We can consider the instructors' `iGPA` to calculate an instructor-adjusted GPA for students. We create a data frame with the instructor ID and numerical grade point for every grade in the `Grades` and `Sessions` tables. First, we use "joins" to bring together the tables from the database.
```{r message = FALSE}
Extended_grades <- Grades |>
left_join(Sessions) |>
left_join(Gradepoint) |>
select(sid, iid, sessionID, gradepoint)
```
```{r echo=FALSE}
# for printing purposes
set.seed(204)
Foobar <- Extended_grades |>
filter(sessionID %in%
c("session2911", "session1904", "session3524", "session2491",
"session3822", "session2044", "session2606")) |>
take_sample(n = 4, .by = sid) |> unique() |>
take_sample(n=2, .by = gradepoint) |> arrange(desc(sid)) |>
filter(!is.na(gradepoint))
```
::: {#fig-ex-grades .column-margin .content-visible when-format="html"}
```{r and_so_on="... for 6,124 rows altogether", echo = FALSE}
Foobar
# set.seed(109)
# Who_to_show <- Extended_grades |>
# filter(sid %in% !!head(Foobar$sid, 8)) |>
# take_sample(n = 10)
# Who_to_show
```
... for 364 instructors altogether
:::
Next, calculate the instructor-by-instructor "grade-giving average" (`gga`):
```{r}
Instructors <- Extended_grades |>
summarize(gga = mean(gradepoint, na.rm = TRUE), .by = iid)
```
```{r echo = FALSE}
Show_these <- Instructors |>
filter(iid %in% !!Foobar$iid)
```
::: {.column-margin .content-visible when-format="html"}
```{r}
Show_these
```
Three rows from the `Instructors` data frame.
:::
::: {.column-margin .content-visible when-format="pdf"}
iid gga
--------- -------
inst436 3.584
inst264 2.974
inst268 3.062
Three rows from the `Instructors` data frame.
:::
`r glossary_term("Joining", "join")` the `Instructors` data frame with `Extended_grades` puts the grade earned and the average grade given next to one another. The unit of observation is still a student receiving a grade in a class session.
```{r message = FALSE}
With_instructors <-
Extended_grades |>
left_join(Instructors)
```
```{r label='250-Adjustment-1vQRlK', echo=FALSE, and_so_on = "... for 364 instructors altogether", digits=3}
set.seed(1091)
Display2 <- With_instructors |>
filter(iid %in% !!Foobar$iid) |>
take_sample(2, .by = iid)
```
::: {.content-visible when-format="pdf"}
sid iid sessionID gradepoint gga
-------- --------- ------------- ------------ -------
S32310 inst436 session2193 3.66 3.584
S31794 inst436 session2541 3.66 3.584
S32289 inst264 session2235 4 2.974
S31461 inst264 session1904 4 2.974
S32211 inst268 session2650 2.33 3.062
S32250 inst268 session2911 2.66 3.062
A few rows of he result of joining `Extended_grades` with `Instructors`. The data frame has 6,124 rows in total.
:::
::: {.content-visible when-format="html"}
```{r echo=FALSE, and_so_on = "... for 6,124 rows altogether", digits=3}
Display2
```
Joining `Extended_grades` with `Instructors`.
:::
Make the *per* adjustment by dividing `gradepoint` by `gga` to create a grade index. We will then average this index for each student to create each student's instructor-adjusted GPA (`adj_gpa`), shown in @tbl-adj-gpa.
```{r}
Adjusted_gpa <-
With_instructors |>
mutate(index = gradepoint / gga) |>
summarize(adj_gpa = mean(index, na.rm = TRUE), .by = sid)
```
```{r echo=FALSE}
set.seed(109)
Display3 <- Adjusted_gpa |>
filter(sid %in% !!Foobar$sid) |>
take_sample(8)
```
::: {#tbl-adj-gpa .column-margin .content-visible when-format="html"}
```{r echo=FALSE, and_so_on = "... for 443 students altogether.", digits=3}
Display3
```
:::
::: {.column-margin .content-visible when-format="pdf"}
sid adj_gpa
-------- ---------
S31197 0.9578
S31914 1.04
S31461 1.148
S32250 0.9284
S31194 0.9978
S32418 0.9328
... for 443 rows altogether
:::
## Adjustment by modeling {#sec-adjustment-by-modeling}
```{r}
unadjusted <- FARS |> model_train(crashes ~ year) |>
model_eval(data = FARS)
adjusted <- FARS |>
model_train(crashes ~ year + vehicle_miles) |>
model_eval(data = FARS |> mutate(vehicle_miles=3000))
gf_point(.output ~ year, data = unadjusted) |>
gf_point(.output ~ year, data = adjusted, color = "blue")
```
We will use the word "**adjustment**" to name the statistical techniques by which "other things" are considered. Those other things, as they appear in data, are called "**covariates**."
"*Per*" adjusting for a covariate by creating an appropriate quotient is an important and somewhat intuitive technique. Less intuitive, but sometimes more powerful, is using a statistical model to accomplish the adjustment.
To illustrate, consider the data on auto traffic safety compiled by the US Department of Transportation. The data frame `FARS` (short for "Fatality Analysis Recording System") records data from 1994 to 2016. Did driving become safer or more dangerous over that period?
We can look at the relationship between the number of crashes each year and the year itself, as in @fig-fars-by-year.
```{r}
FARS |> mutate(rate = crashes / vehicle_miles) |>
mutate(after_adjustment = rate * 2849) |>
point_plot(after_adjustment ~ year)
```
::: {#fig-fars-by-year}
```{r}
FARS |>
point_plot(crashes ~ year, annot = "model")
```
The number of traffic crashes was more-or-less level up through 2006, then dropped off sharply until 2010. Perhaps the last two years show a return to the pre-2006 situation.
:::
An important factor in the number of crashes in a year is the amount of driving in that year. `FARS` records this as `vehicle_miles`, as if all cars shared the same odometer. But the number of crashes might be influenced by the number of licensed drivers (more inexperienced drivers leading to more crashes?), or the number of registered vehicles (older cars still on the road leading to more crashes?). We can take all these into account by building a model and evaluating it for each year at specified values for each of the covariates. We'll use the mean of each of the covariates for the model evaluation.
```{r}
FARS |>
summarize(miles = mean(vehicle_miles),
vehicles = mean(registered_vehicles),
drivers = mean(licensed_drivers))
```
Now, to build the model:
```{r}
Mod <- FARS |>
model_train(crashes ~ year + vehicle_miles +
registered_vehicles + licensed_drivers)
Evaluate_at <- FARS |>
mutate(vehicle_miles = 2849,
registered_vehicles = 240101,
licensed_drivers = 199337)
Mod |> model_eval(data = Evaluate_at) |>
point_plot(.output ~ year)
```
------
To illustrate, consider the Childhood Respiratory Disease Study which examined possible links between smoking and pulmonary capacity, as measured by "forced expiratory volume" (`FEV`). The data are recorded in the `CRDS` data frame. The unit of observation is a child. `CRDS` contains several variables including the "forced expiratory volume" (`FEV` variable) and `smoker`. Higher forced expiratory volume is a sign of better respiratory capacity. Knowing what we do today about the dangers of smoking, we might hypothesize that the FEV measurement will be related to smoking.
::: {#fig-fev-smoking}
```{r}
# IN DRAFT: Replace CRDS with FEV until the LSTbook package is updated.
CRDS |>
point_plot(FEV ~ smoker, annot = "model",
point_ink = 0.2, model_ink = 1)
```
Forced expiratory volume as a function of smoking status. A typical non-smoker has an FEV of about 2.5, while typical non-smokers have a higher FEV: about 3.25.
:::
To judge from @fig-fev-smoking, children who smoke tend to have larger FEV than their non-smoking peers. This counter-intuitive result might give pause. Is there some covariate that would clarify the picture?
Since FEV largely reflects the physical capacity of the lungs, perhaps we need to consider the child's age or height. "consider age or height" we mean "adjust for age or height."
There are two phases for modelling-based adjustment, one requiring careful thought and understanding of the specific system under study, the other---the topic of this section---involving only routine, straightforward modeling calculations.
Here's the calculation of the model `.output` as a function of `smoker`, without any adjustment. That is, `smoker` is the only explanatory variable.
::: {#tbl-unadjusted-fev}
```{r}
CRDS |>
model_train(FEV ~ smoker) |>
model_eval(data = CRDS) |>
select(.output, smoker) |>
unique()
```
The unadjusted model values, that is, the model output from `FEV ~ smoker` without any covariates.
:::
The adjusted model involves three phases:
**Phase 1**: Choose relevant covariates for adjustment. This almost always involves familiarity with the real-world context. Here, we'll use `height` and `age`, reflecting the fact that bigger kids tend to have larger FEV.
**Phase 2**: Build a model with the covariates from Phase 1 as explanatory variables.
```{r}
adjusted_model <-
CRDS |>
model_train(FEV ~ smoker + height + age)
```
**Phase 3**: Evaluate the adjusted model but don't use the actual values of the covariates. Instead, standardize the values of the covariates to constant, representative values. We'll use age 15 and height 60 inches.
::: {#tbl-with-adjustment}
```{r}
adjusted_model |>
model_eval(
data = CRDS |> mutate(age = 15, height = 60)
) |>
select(.output, smoker) |>
unique()
```
Models highlight trends in the data, but the trends in this case are different with and without adjustment. Adjustment for age and height raises the model output for non-smokers and lowers it for smokers.
This is the core of adjustment: comparing individual specimens *after* putting them on the same footing, that is, *ceteris paribus*.
## Exercises
```{r eval=FALSE, echo=FALSE}
# need to run this in console after each change
emit_exercise_markup(
paste0("../LSTexercises/12-Adjustment/",
c("Q12-101.Rmd",
"Q12-102.Rmd",
"Q12-103.Rmd",
"Q12-104.Rmd",
"Q12-105.Rmd",
"Q12-201.Rmd", # Project
"Q12-301.Rmd")),
outname = "L12-exercise-markup.txt"
)
```
{{< include L12-exercise-markup.txt >}}
## Enrichment topics
{{< include Enrichment-topics/ENR-12/Topic12-01.qmd >}}
{{< include Enrichment-topics/ENR-12/Topic12-02.qmd >}}