forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05_1_diabetes_sol.Rmd
310 lines (220 loc) · 10.2 KB
/
05_1_diabetes_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
title: "Exercise 5.1: Hypothesis testing on the diabetes example - solution"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aims
In this exercise, you will revisit the basics of statistical hypothesis testing.
You will acquire the skills
1. to assess the assumptions of a one-sample and a paired t-test in data exploration.
2. to conduct a paired t-test in R and to interpret the results.
3. to conduct a one-sample t-test in R and to interpret the results.
# The diabetes dataset
The `diabetes` dataset holds information on a small experiment with
8 patients that are subjected to a glucose tolerance test.
Patients had to fast for eight hours before the test.
When the patients entered the hospital their baseline glucose level was measured (mmol/l).
Patients then had to drink 250 ml of a syrupy glucose solution containing 100
grams of sugar. Two hours later, their blood glucose level was measured again.
The data consist of three variables:
- `before`: glucose concentration upon 8 hours of fasting (mmol/l)
- `after`: glucose concentration 2 hours after drinking glucose solution (mmol/l).
- `patient`: identifier for the patient
## Research questions
The goal is to answer two research questions;
1. Is the average glucose level after the sugar intake different from the average glucose level before sugar intake?
2. Is the average glucose level of the patients two hours after the sugar intake higher than 7.8 mmol/L?
## Import the data
First, load the required R libraries:
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
```{r}
diabetes <- read_delim("https://raw.githubusercontent.com/statOmics/PSLSData/main/diabetes.txt", delim = " ")
```
# Data exploration
We will start with a data exploration.
Have a first look at the raw data. How is the data structured?
Is this data *tidy*?
```{r}
glimpse(diabetes)
head(diabetes)
```
We have 8 patients, for which glucose concentrations were measured, before and after sugar intake.
Note, that the dataset is not in the tidy format. The glucose concentration
variable is spread around 2 columns: `before` and `after`, while the "time"
variable is encoded in the column names instead of in a dedicated column. Data
in this form is also called *wide* data. Instead, we want to transform the data
to a *long* format.
To tidy the data, we can use the `gather()` function to
[pivot](https://r4ds.had.co.nz/tidy-data.html#pivoting) the data. In this case,
we want to "gather" the `time` (encoded in the column names `before` and
`after`) and `concentration` variables (which is encoded in the actual values).
The `patient` column should stay the same. We can specify this with the
following syntax.
```{r}
diabetes_tidy <- diabetes %>%
gather(key = "time", value = "concentration", -patient)
## Take a look at the new data
glimpse(diabetes_tidy)
diabetes_tidy
```
To improve the visualizations below, we can also sligthly recode the `time` variable, so that the "before" level is the first one. We can do this by converting `time` to a factor.
```{r}
diabetes_tidy <- diabetes_tidy %>%
mutate(
time = as.factor(time),
time = relevel(time, ref = "before")
)
```
We could visualize the data using boxplots. Note, however, that we lose the
paired characteristic of the data with this visualization. Nevertheless, it is
still a useful visualization to assess any differences in mean and variance
between the two time points.
```{r}
boxplot <- diabetes_tidy %>%
ggplot(aes(x = time, y = concentration, fill = time)) +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3,
color = "black"
) +
ylab("Concentration (mmol/l)")
boxplot
```
We observe that the variability of the glucose levels are higher after than
before the sugar intake. The difference plot also shows that glucose levels in
the blood are higher for most of the patients two hours after the sugar intake.
Only for one patient the glucose concentration two hours after treatment is
lower.
To visualize the paired nature of the data, we can use a **line plot**,
connecting the `before` and `after` points for each patient.
```{r}
diabetes_tidy %>%
ggplot(aes(time, concentration)) +
geom_point() +
geom_line(aes(group = patient))
```
# Question 1
Is the glucose level after 8 hours of fasting on average different from the glucose level two hours after intake of 100g of glucose?
As the data are paired, we can expect that the measurements before and after the glucose intake are correlated.
We can illustrate this with a scatterplot.
Note that in this case, the non-tidy data is actually a more suitable format to visualize.
```{r}
diabetes %>%
ggplot(aes(x = before, y = after)) +
geom_point() +
ylab("before (mmol/l)") +
xlab("after (mmol/l)")
```
```{r}
cor(diabetes$before, diabetes$after, method = "spearman")
```
## Check the assumptions
The paired t-test has 2 assumptions:
1. The differences in glucose levels are independent of each other.
2. The differences in glucose levels are normally distributed.
The first assumption is met given the experimental design.
Upon differencing the measurements before and after the glucose intake, we obtained one difference for each patient. As the patients were sampled at random from the population we can expect them to be independent.
Secondly, we assess if the differences are normally distributed.
We can calculate the differences for each patient using a `group_by()` +
`summarize()` combination on the tidy data.
```{r}
diabetes_diff <- diabetes_tidy %>%
group_by(patient) %>%
summarize(difference = diff(concentration))
diabetes_diff
```
```{r}
diabetes_diff %>%
ggplot(aes(sample = difference)) +
geom_qq() +
geom_qq_line()
```
Based on the QQ-plot, we may assume that our data are normally distributed.
As our assumptions are met we may continue with performing the paired t-test.
## Hypothesis test
We will perform a **paired t-test**.
- The null hypothesis of the test is that the glucose levels before and two hours after
glucose intake are on average equal.
- Which will be tested against the alternative hypothesis that the glucose levels before and two hours after the glucose intake are on average different.
```{r}
paired_t <- t.test(diabetes$after,diabetes$before, paired = TRUE)
paired_t
```
## Conclusion
There is on average an significant blood increase in the blood pressure upon
administering 100g of sugar to patients (p = `r round(paired_t$p.value, 2)`).
The glucose levels two hours after administering 100g of glucose are on average `r round(paired_t$estimate,1)` mmol/l higher than upon 8 hours of fastening
(95% CI [`r round(paired_t$conf.int,1)`]mmol/l).
# Alternative solution: One-sample t-test on the difference
Since the data is paired, we can also simply calculate the differences in
glucose level before and after sugar intake for each patient. We can then
perform a one-sample t-test on these differences, testing whether they are
significantly different from zero. This is equivalent to the paired t-test we
performed above.
We can verify this with the analysis below
```{r}
t.test(difference ~ 1, data = diabetes_diff, mu = 0)
```
Indeed, the output is equivalent to that of the paired t-test.
***
# Question 2
Is the average glucose level two hours after sugar intake
higher than the threshold of 7.8 mmol/l for pre-diabetes?
We can test this hypothesis using a **one-sample t-test**.
Indeed we are interested to compare the average glucose level to a known threshold for pre-diabetes.
## Assess the assumptions
Before we can perform a one sample t-test, we must check that the required
assumptions are met!
1. The observations are independent
2. The glucose levels two hours after the treatment are normally distributed
The first assumption requires us to think about how the data were collected.
Are there dependences in the data?
The 8 glucose levels upon sugar intake are collected on 8 random patients so we can assume that the glucose levels 2 hours after sugar intake are independent.
We can assess the second assumption with a quantile-quantile plot. First we need to filter the data, so we only retain the values for the `time == "after"` level
```{r}
diabetes_after <- diabetes_tidy %>%
filter(time == "after")
```
```{r}
diabetes_after %>%
ggplot(aes(sample = concentration)) +
geom_qq() +
geom_qq_line()
```
The data seem to be nicely scattered around the quantile-quantile
line (black line).
We also do not observe large deviations in the plot.
So we will assume that our data is normally
distributed.
Note, that there are not many observations to assess normality.
## Hypothesis test
Here, we will test if mean glucose level 2 hours after sugar intake is significantly
higher than the threshold for pre-diabetes of 7.8 mmol/l. More specifically, we will test the null hypothesis;
$H_0:$ the mean glucose level two hours after 100g of sugar intake is equal to 7.8 mmol/l
versus the alternative hypothesis;
$H_1:$ the mean glucose level two hours after 100g of sugar intake is greater than 7.8 mmol/l
```{r}
t_test_after <- t.test(concentration ~ 1,
data = diabetes_after,
mu = 7.8,
alternative = "greater",
conf.level = 0.95
)
t_test_after
```
## Conclusion
When writing a conclusion on your research hypothesis,
it is very important to be precise, concise, and complete.
An example of such a conclusion for our research question
is given below:
We can conclude that the mean glucose level of patients two hours after 100g of glucose intake is not significantly higher than the threshold for pre-diabetes of 7.8 mmol/l (p = `r round(t_test_after$p.value,2)`).
The mean glucose level two hours after the intake of 100g of sugar equals `r round(t_test_after$estimate,1)` mmol/l (95% CI [`r round(t_test_after$conf.int,1)`]).
As we have seen in the theory class, the 95% confidence
interval can be interpreted as;
With 95% confidence we can conclude that the true average of glucose level of the patients in the population is above `r round(t_test_after$conf.int[1],2)`.
Note, that we only report a one sided confidence interval because we only test against the alternative hypothesis that the glucose level two hours after sugar intake is on average larger than the threshold for pre-diabetes of 7.8 mmol/l.