forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07_1_lettuce_sol.Rmd
297 lines (232 loc) · 10.1 KB
/
07_1_lettuce_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Exercise 7.1: ANOVA on the lettuce dataset - solution"
author: "Lieven Clement and Jeroen Gilis"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Background
In agriculture, it is important to have a high yield of crops. For lettuce
plants, plants with many leaves are known to be preferred by the consumers.
One way to increase the number of leaves (or better, total
leaf weight) is by using a fertilizer. Recently, there has been
a tendency to rely more on natural fertilizers, such as compost.
Near Ghent, the institute of research for agriculture and fishery is testing
new, natural fertilization methods. One of these new fertilizers is called
biochar. Biochar is a residual product from pyrolysis, a process in which
biomass is burned under specific conditions (such as high pressure) in order
to produce energy. Biochar is similar to charcoal, but has some very useful
properties, such as retaining water in the soil. It also has a positive
influence on the soil microbiome.
# The lettuce dataset
The researchers hypothesize that biochar, compost and
a combination of both biochar and compost can have a different influence
on the growth of lettuce plants. To this end, they grew up
lettuce plants in a greenhouse. The pots were filled with
one of four soil types;
1. Soil only (control)
2. Soil supplemented with biochar (refoak)
3. Soil supplemented with compost (compost)
4. Soil supplemented with both biochar and compost (cobc)
The dataset `freshweight_lettuce.txt` contains the freshweight
(in grams) for 28 lettuce plants (7 per condition). The researchers
want to use an ANOVA test to assess if the treatments have a different effect
on the growth of lettuce plants. If so, they will use a post-hoc test
(Tuckey test) to discover which specific treatments have an effect.
Load the required libraries
```{r, message = FALSE, warning=FALSE}
library(tidyverse)
theme_set(theme_bw()) # Change ggplot2 default theme
library(multcomp)
```
# Data import
```{r,message=FALSE}
lettuce <- read_csv("https://raw.githubusercontent.com/statOmics/PSLSData/main/freshweight_lettuce.txt")
# Take a glimpse at the data
glimpse(lettuce)
```
First, we will set the `treatment` variable to a factor
```{r}
lettuce <- lettuce %>%
mutate(
## Convert treatment to factor
treatment = as.factor(treatment),
## Relevels such that "control" is the reference
treatment = relevel(treatment, ref = "control")
)
```
# Data exploration
```{r}
## Count the number of observations per treatment
lettuce %>%
count(treatment)
```
Now let's make a boxplot displaying the freshweight of each treatment condition:
```{r}
boxplot <- lettuce %>%
ggplot(aes(x = treatment, y = freshweight, fill = treatment)) +
scale_fill_brewer(palette = "RdGy") +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
ggtitle("Boxplot of the freshweigth for each treatment condition") +
ylab("freshweight (gram)") +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3, color = "black"
)
boxplot
```
Note that there are no clear outliers in the data.
We can see that the mean freshweight is very comparable
between the control and refoak treatments and between the
compost and cobc treatments. We can also see that the mean
freshweight is much higher in the cobc and control treatments
than in the control and refoak treatments. But is this
observed difference significant?
# ANOVA
To study if the observed difference between the
average freshweight values of the different treatment groups
are significant, we may perform an ANOVA.
## Formulate null and alternative hypotheses
The null hypothesis of ANOVA states that:
$H0$: The mean freshweigth is equal between the different treatment groups.
The alternative hypothesis of ANOVA states that:
$HA$: The mean freshweigth for at least one treatment group is different
from the mean freshweight in at least one other treatment group.
## Checking the assumptions for ANOVA
Before we may proceed with the analysis, we must make sure that all
assumptions for ANOVA are met. ANOVA has three assumptions:
1. The observations are independent of each other (in all groups)
2. The data (freshweigth) must be normally distributed (in all groups)
3. The variability within all groups is similar
### Assumption of independence
The first assumption is met; we started of with 28 lettuce plants and
we randomly submitted them to one of four treatment conditions. There
is no reason to believe that the plants display systematic differences
between treatment groups, other than the actual treatment.
### Assumption of normality
For the second assumption, we must check normality in each group.
```{r}
ggplot(lettuce, aes(sample = freshweight)) +
geom_qq() +
geom_qq_line(aes(col = treatment), show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
facet_wrap(vars(treatment))
```
According to these plots, the normality assumption is met
for each group. However, we need to be careful with relying on
these plots, as they are only based on 7 observations per group.
Indeed, checking the assumptions of normality (and equal variances)
on such small datasets can be tricky. For instance, when
sampling 7 datapoints from data that has an underlying distribution
that is not normally distributed, the small sample may be normally
distributed by chance. Equivalently, when sampling 7 datapoints
from a normal distribution, the small sample may appear not normally
distributed. We can show this as follows.
Simulate from uniform distribution:
```{r}
set.seed(5)
## sample 7 observations from a uniform distribution, repeat 6 times
df <- data.frame(x = runif(7 * 6), rep = rep(1:6, each = 7))
ggplot(df, aes(sample = x)) +
geom_qq() +
geom_qq_line(col = "darkorchid") +
facet_wrap(vars(rep)) +
ggtitle("QQ plots for uniformly distributed data")
```
Even though the data originates from a uniform distribution, several
samples meet the normality requirements by chance.
Sample from normal distribution:
```{r}
set.seed(5)
## sample 7 observations from a normal distribution, repeat 6 times
df <- data.frame(x = rnorm(7 * 6), rep = rep(1:6, each = 7))
ggplot(df, aes(sample = x)) +
geom_qq() +
geom_qq_line(col = "dodgerblue") +
facet_wrap(vars(rep)) +
ggtitle("QQ plots for normally distributed data")
```
Even though the data originates from a normal distribution, some deviations
can be observed due to random chance.
Alternatively, we may generate a qq-plot of the residuals
of a linear model. These residuals should be normally distributed
if the data for each treatment condition is normally distributed.
```{r}
fit <- lm(freshweight ~ treatment, lettuce)
plot(fit, which = 2, col = fit$model$treatment)
legend("bottomright", levels(fit$model$treatment), text.col = 1:4)
```
The Q-Q plot of the residuals shows a slight skew to the left. Care should be
taken when interpreting the results for this dataset. Due to the sample size, a
better alternative might be a non-parametric approach.
### Assumption of equal variances
We can check the assumption of equal variance with a boxplot:
```{r}
boxplot
```
As a measure of variability, we may take the height
of each box. This is the interval between
the 25% and 75% quantile. Here we can see that this
interval, as well as the length of the whiskers, is
approximately equal for most groups. However, the
variability of cobc does seem to be quite a bit larger
than the variability in the refoak group.
With this little observations per group, it is very
difficult to reliably assess the assumptions of normality and equal variances.
In this tutorial, we will assume that all assumptions are met.
**In a later tutorial, "Kruskal_lettuce_plants.Rmd",**
**we will see would happen if we decide to reject these assumptions.**
## Modeling
```{r}
fit <- lm(freshweight ~ treatment, data = lettuce)
fit_anova <- anova(fit)
fit_anova
```
The p-value of the ANOVA analysis is extremely significant
(p-value = `r format(fit_anova$"Pr(>F)"[1],digits=4)`),
so we reject the null hypothesis that the mean
freshweigth is equal for the different treatment groups.
We can say that the mean freshweigth is significantly different
between at least two treatment groups on the 5% significance level.
Based on this analysis, we do not yet know between which particular
groups there is a significant difference. To assess this, we will
perfrom the Tuckey post-hoc analysis.
## Post-hoc analysis
We will perform a post-hoc analysis to look at the difference
in fresweigth between each pairwise comparison of treatment groups.
Importantly, with this strategy, the p-values will be automatically
adjusted for multiple testing.
The null hypothesis for each pairwise test is that
$H0$ there is no difference in the average freshweight values between
both groups.
The alternative hypothesis for each pairwise test states that
$HA$ there is indeed a difference in the average freshweight values
between both groups.
```{r}
mcp <- glht(fit, linfct = mcp(treatment = "Tukey"))
mcp_summary <- summary(mcp)
mcp_summary
# We will also calculate the confidence interval on the mean differences.
mcp_confint <- confint(mcp)
mcp_confint
```
## Conclusion
We have found an extremely significant dependence
(p-value = `r round(fit_anova$"Pr(>F)"[1],digits=12)`),
between the mean freshweigth and the treatment condition
on the global 5% significance level.
The mean freshweight of plants grown with compost is extremely
significantly higher as compared to in control plants (Tuckey test,
adjusted p-value < 0.0001, 20.9g higher, 95% CI: [13.2; 28.5])
and as compared to refoak plants (Tuckey test, adjusted
p-value < 0.0001, 22.7g higher, 95% CI: [15.1; 30.3]).
The mean freshweight of plants grown with cobc is extremely
significantly higher as compared to in control plants (Tuckey test,
adjusted p-value < 0.0001, 16.1g higher, 95% CI: [8.5; 23.8])
and as compared to refoak plants (Tuckey test, adjusted
p-value < 0.0001, 18.0g higher, 95% CI: [10.4; 25.6]).
We do not find enough evidence to claim a difference in mean
freshweight between refoak and control plants or between cobc
and compost plants.
We may conclude that supplementing soil with compost or with
both compost and biochar will have a positive effect on the
freshweigth of lettuce plants.