forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07_2_cuckoo_sol.Rmd
340 lines (254 loc) · 10.9 KB
/
07_2_cuckoo_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Exercise 7.2: ANOVA on the cuckoo dataset - solution"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Cuckoo dataset
The common cuckoo does not build its own nest: it prefers
to lay its eggs in another birds' nest. It is known, since 1892,
that the type of cuckoo bird eggs are different between different
locations. In a study from 1940, it was shown that cuckoos return
to the same nesting area each year, and that they always pick
the same bird species to be a "foster parent" for their eggs.
Over the years, this has lead to the development of geographically
determined subspecies of cuckoos. These subspecies have evolved in
such a way that their eggs look as similar as possible as those
of their foster parents.
The cuckoo dataset contains information on 120 Cuckoo eggs,
obtained from randomly selected "foster" nests.
For these eggs, researchers have measured the `length` (in mm)
and established the `type` (species) of foster parent.
The type column is coded as follows:
- `type=1`: Meadow pipit
- `type=2`: Tree pipit
- `type=3`: Dunnock
- `type=4`: European robin
- `type=5`: White wagtail
- `type=6`: Eurasian wren
# Goal
The researchers want totest if the type of foster parent
has an effect on the average length of the cuckoo eggs.
In theory, they want to study this for all six species.
Previously, we looked at a single pairwise comparison
between the European robin and the Eurasian wren with a
t-test. Here, we will analyse all types simultaneously
with ANOVA.
Load the required libraries
```{r, message=FALSE}
library(tidyverse)
library(multcomp)
```
## Import the data
```{r, message=FALSE}
Cuckoo <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/Cuckoo.txt")
head(Cuckoo)
```
# Data tidying
It seems that the `tpye` column is a `double` rather than a `factor`.
Let's fix this:
```{r}
Cuckoo <- Cuckoo %>%
mutate(type = as.factor(type)) %>%
mutate(type = recode(type,
"1" = "meadow",
"2" = "tree",
"3" = "dunnock",
"4" = "robin",
"5" = "wagtail",
"6" = "wren"
))
```
# Data exploration
How many birds do we have for each type?
```{r}
Cuckoo %>%
count(type)
```
Visualize the data
```{r}
Cuckoo %>%
ggplot(aes(x = type, y = length, fill = type)) +
theme_bw() +
scale_fill_brewer(palette = "RdGy") +
geom_boxplot() +
geom_jitter(width = 0.2) +
ggtitle("Boxplot of the length of eggs per type") +
ylab("length (mm)") +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3, color = "black"
)
```
# ANOVA
To study if the observed differences in average egg length
between the different foster bird types are significant, we can perform
an ANOVA.
## The model
We will fit the following linear model:
$$
E(Y_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} +
\beta_3 X_{i3} + \beta_4 X_{i4} + \beta_5 X_{i5}
$$
where $Y_i$ is the length of egg $i$ and the coefficients $beta_j$ corresond to
the **additional expected** length of a cuckoo egg fostered by bird type $j + 1$. By
convention, the intercept is called $\beta_0$ and it corresponds to the
reference bird type, in this case bird type 1 (the Meadow pipit). The variables
$X_ij$ are **dummy variables** which take on the value 1 if egg $i$ was fostered
by bird type $j + 1$ and are 0 otherwise. Note that in this case, eggs are
always fostered by a single bird type, so only one of the X's will be equal to
1.
For example, if egg $i$ was fostered by the meadow pipit (type 1), all the $X_{i1}, \dots, X_{i5}$ would be zero and we are left with:
$$E(Y_i | X_{i1} = \dots = X_{i5} = 0) = \beta_0$$
i.e. the intercept. If egg $i$ was fostered by the European robin (type 4), we would have:
$$E(Y_i | X_{i3} = 1) = \beta_0 + \beta_3$$
So $\beta_3$ can be interpreted as the *change* in average length of cuckoo eggs
between the reference bird type (the meadow pipit) and the European robin. The
coefficients can be negative or positive, so the change can be an increase or a
decrease.
To further aid with intepretation, we can visualize the model design using the **ExploreModelMatrix** package:
```{r}
ExploreModelMatrix::VisualizeDesign(Cuckoo, ~type)$plotlist[[1]]
```
## Formulate null and alternative hypothesis
The null hypothesis of ANOVA states that:
$H0$: The mean egg length is equal between the different bird types.
In terms of the linear model:
$$
H_0:\ \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0
$$
The alternative hypothesis of ANOVA states that:
$HA$: The mean egg length for at least one bird type is different
from the mean egg length in at least one other bird type.
$$
H_A:\ \beta_i \neq \beta_j \text{ for at least one } i \neq j
$$
## Parameter estimation
```{r}
mod <- lm(length ~ type, Cuckoo)
summary(mod)
```
The output of the model suggests that there are indeed differences in the
*average* length of cuckoo eggs between different foster parents. Note, however,
that in the standard output of `lm()`, the p-values are not adjusted for
multiple testing (a topic which we will touch upon later). In addition, this
model only shows the differences between foster types 2-6 and the **reference
type** (1), represented by the intercept.
The p-value of the F-test, given at the bottom of the summary, corresponds to a
**one-way ANOVA** where the full model is compared to a reduced model containing
only the intercept. Thus, it is testing the *omnibus* null hypothesis that all
slope coefficients ($\beta_1$ - $\beta_5$) are equal to 0.
## Check assumptions
Before we may proceed with the analysis, we must make sure that all
assumptions for ANOVA are met. ANOVA has three assumptions:
1. The observations are independent of each other (in all groups)
2. The data (length) must be normally distributed (in all groups)
3. The variability within all groups is similar
The first assumption is met, as we may assume that there are no
specific patterns of correlation between the randomly selected nests.
To check the normality assumption, we will use QQ plots.
```{r}
plot(mod, which = 2)
```
```{r}
Cuckoo %>%
ggplot(aes(sample = length)) +
geom_qq() +
geom_qq_line() +
facet_grid(~type)
```
There seem to be no clear deviations from normality.
The third assumption of equal variances seems to be met based on the
visualization with the boxplots (see above).
As such, we may proceed with the ANOVA analysis.
### Simulate to train your skills in assessing assumptions
```{r}
set.seed(1031)
sigma <- mod %>% sigma()
dataHlp <- Cuckoo
simModels <- list()
plotList <- list()
plotListQQ1 <- list()
par(mfrow = c(3, 3))
for (i in 1:9)
{
nobs <- Cuckoo %>% nrow()
dataHlp$ySim <- mod$fit + rnorm(nobs, sd = sigma)
simModels[[i]] <- lm(ySim ~ type, dataHlp)
plot(simModels[[i]], which = 2)
plotList[[i]] <- dataHlp %>%
ggplot(aes(type, ySim)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
plotListQQ1[[i]] <- dataHlp %>%
filter(type == "meadow") %>%
ggplot(aes(sample = ySim)) +
geom_qq() +
geom_qq_line() +
xlab("QQ for meadow")
}
do.call(gridExtra::grid.arrange, plotList)
do.call(gridExtra::grid.arrange, plotListQQ1)
```
The most deviating QQ-plot of the residuals in the nine simulation runs:
```{r}
par(mfrow = c(1, 1))
plot(simModels[[8]], which = 2)
```
## ANOVA model
We perform the ANOVA test using the results of the linear regression model.
Essentially, we test the following null hypothesis:
$$
H_0:\ \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0
$$
with the alternative being that at least one of the parameters is different from
0.
```{r}
anova(mod)
```
The p-value of the ANOVA analysis is extremely significant
(p-value < 0.001),
so we reject the null hypothesis that the mean
egg length is equal between the different bird types.
We can say that the mean egg length is significantly different
between at least two bird types on the 5% significance level.
Based on this analysis, we do not yet know between which particular
bird types there is a significant difference. To study this, we will
perform a **Tukey post-hoc analysis**.
## Post-hoc analysis
We will perform a post-hoc analysis, to look at the difference
in egg length between each pairwise comparison of bird types.
Importantly, with this strategy, the p-values will be correctly
adjusted for multiple testing.
The null hypothesis for each pairwise test is that there is no
difference in the mean egg length between both bird types.
The alternative hypothesis for each pairwise test states that there
is indeed a difference in the mean egg length between both bird types.
We will also calculate the confidence interval on the mean differences.
The post-hoc analysis is carried out using the `glht()` ("General Linear
Hypotheses") function from the *multcomp* package. We apply this to our linear
model object `mod` and specify in the `linfct=` argument that we want to perform
*multiple comparisons* (`mcp`), using Tukey's method. We store the output of
`glht()` and then display the results using the `summary()` function and
calculating confidence intervals.
```{r}
mcp <- glht(mod, linfct = mcp(type = "Tukey"))
summary(mcp)
confint(mcp)
```
We can also visualize the confidence interval by calling `plot()` on the `mcp` object:
```{r}
plot(mcp)
```
## Conclusion
The association between the mean length of a cuckoo's egg and the foster bird species is extremely significant (p << 0.001).
The mean length of cuckoo's eggs in nests of the Eurasian
wren are smaller as compared to those from all other bird
species in the study:
- the meadow pipit (adjusted p-value = `r summary(mcp)$test$pvalues[names(summary(mcp)$test$tstat) == "wren - meadow"] %>% format(digits=1)`, mean difference = `r round(confint(mcp)$confint["wren - meadow",1],1)`mm, 95% CI [`r round(confint(mcp)$confint["wren - meadow",-1],1)`])
- the tree pipit (adjusted p-value = `r summary(mcp)$test$pvalues[names(summary(mcp)$test$tstat) == "wren - tree"] %>% format(digits=1)`, mean difference = `r round(confint(mcp)$confint["wren - tree",1],1)`mm, 95% CI [`r round(confint(mcp)$confint["wren - tree",-1],1)`])
- the dunnock (adjusted p-value = `r summary(mcp)$test$pvalues[names(summary(mcp)$test$tstat) == "wren - dunnock"] %>% format(digits=1)`, mean difference = `r round(confint(mcp)$confint["wren - dunnock",1],1)`mm, 95% CI [`r round(confint(mcp)$confint["wren - dunnock",-1],1)`])
- the European robin (adjusted p-value = `r summary(mcp)$test$pvalues[names(summary(mcp)$test$tstat) == "wren - robin"] %>% format(digits=1)`, mean difference = `r round(confint(mcp)$confint["wren - robin",1],1)`mm, 95% CI [`r round(confint(mcp)$confint["wren - robin",-1],1)`])
- the white wagtail (adjusted p-value = `r summary(mcp)$test$pvalues[names(summary(mcp)$test$tstat) == "wren - wagtail"] %>% format(digits=1)`, mean difference = `r round(confint(mcp)$confint["wren - wagtail",1],1)`mm, 95% CI [`r round(confint(mcp)$confint["wren - wagtail",-1],1)`])
There is no evidence taht the mean length of the cuckoo bird's eggs for all
other pairwise comparisons between foster bird species are significantly
different at the 5% level.