forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08_ExperimentalDesignII_3_kpna2_sol.Rmd
251 lines (183 loc) · 8.94 KB
/
08_ExperimentalDesignII_3_kpna2_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Experimenteel Design II: replication en power - KPNA2"
author: "Lieven Clement, Alexandre Segers and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(ggplot2)
library(tidyverse)
```
# Background
Histologic grade in breast cancer provides clinically important prognostic information. Researchers examined whether histologic grade was associated with gene expression profiles of breast cancers and whether such profiles could be used to improve histologic grading. In this tutorial we will assess the association between histologic grade and the expression of the KPNA2 gene that is known to be associated with poor BC prognosis.
The patients, however, do not only differ in the histologic grade, but also on their lymph node status.
The lymph nodes were not affected (0) or chirugically removed (1).
- Redo data analysis (you can copy the results of the tutorial on multiple linear regression)
- What is the power to pick up each of the contrasts when their real effect sizes would be equal to the effect sizes we observed in the study?
- How does the power evolves if we have 2 upto 10 repeats for each factor combination of grade and node when their real effect sizes would be equal to the ones we observed in the study?
- What is the power to pick up each of the contrasts when the FC for grade for patients with unaffected lymph nodes equals 1.5 ($\beta_g = log2(1.5)$)?
# Data analysis
## Import KPNA2 data in R
```{r}
kpna2 <- read.table("https://raw.githubusercontent.com/statOmics/SGA21/master/data/kpna2.txt", header = TRUE)
kpna2
```
Because histolic grade and lymph node status are both categorical variables, we model them both as factors.
```{r}
kpna2$grade <- as.factor(kpna2$grade)
kpna2$node <- as.factor(kpna2$node)
```
```{r}
kpna2 %>%
ggplot(aes(x = node:grade, y = gene, fill = node:grade)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
```
As discussed in a previous exercise, it seems that there is both an effect of histologic grade and lymph node status on the gene expression. There also seems to be a different effect of lymph node status on the gene expression for the different histologic grades.
As we saw before, we can model this with a model that contains both histologic grade, lymph node status and the interaction term between both. When checking the linear model assumptions, we see that the variance is not equal. Therefore we model the gene expression with a log2-transformation, which makes that all the assumptions of the linear model are satisfied.
```{r}
fit <- lm(log2(gene) ~ grade * node, data = kpna2)
plot(fit)
```
```{r}
library(car)
Anova(fit, type = "III")
```
When checking the significance of the interaction term, we see that it is significant on the 5% significance level. We therefore keep the full model.
As we are dealing with a factorial design, we can calculate the mean gene expression for each group by the following parameter summations.
```{r}
ExploreModelMatrix::VisualizeDesign(kpna2, ~ grade * node)$plotlist
```
The researchers want to know the power for testing following hypotheses (remark that we will have to adjust for multiple testing):
- Log fold change between histologic grade 3 and histologic grade 1 for patients with unaffected lymph nodes (=0).
$$H_0: \log_2{FC}_{g3n0-g1n0} = \beta_{g3} = 0$$
- Log fold change between histologic grade 3 and histologic grade 1 for patients with removed lymph nodes (=1).
$$H_0: \log_2{FC}_{g3n1-g1n1} = \beta_{g3} + \beta_{g3n1} = 0$$
- Log fold change between unaffected and removed lymph nodes for patients of histologic grade 1.
$$H_0: \log_2{FC}_{g1n1-g1n0} = \beta_{n1} = 0$$
- Log fold change between unaffected and removed lymph nodes for patients of histologic grade 3.
$$H_0: \log_2{FC}_{g3n1-g3n0} = \beta_{n1} + \beta_{g3n1} = 0$$
- Difference in log fold change between patients of histological grade 3 and histological grade 1 with removed lymph nodes and log fold change between patients between patients of histological grade 3 and histological grade 1 with unaffected lymph nodes.
$$H_0: \log_2{FC}_{g3n1-g1n1} - \log_2{FC}_{g3n0-g1n0} = \beta_{g3n1} = 0$$ which is an equivalent hypotheses with $$H_0: \log_2{FC}_{g3n1-g3n0} - \log_2{FC}_{g1n1-g1n0} = \beta_{g3n1} = 0$$
We can test this using multcomp, which controls for multiple testing.
```{r}
library(multcomp)
mcp <- glht(fit, linfct = c(
"grade3 = 0",
"grade3 + grade3:node1 = 0",
"node1 = 0",
"node1 + grade3:node1 = 0",
"grade3:node1 = 0"
))
summary(mcp)
```
We get a significant p-value for the first, second, third and fifth hypothesis. The fourth hypothesis is not significant at the overall 5% significance level.
# Power of the tests for each of the contrasts
## Simulation function
Function to simulate data similar to that of our experiment under our model assumptions.
```{r}
simFastMultipleContrasts <- function(form, data, betas, sd, contrasts, alpha = .05, nSim = 10000, adjust = "holm") {
ySim <- rnorm(nrow(data) * nSim, sd = sd)
dim(ySim) <- c(nrow(data), nSim)
design <- model.matrix(form, data)
ySim <- ySim + c(design %*% betas)
ySim <- t(ySim)
### Fitting
fitAll <- limma::lmFit(ySim, design)
### Inference
varUnscaled <- t(contrasts) %*% fitAll$cov.coefficients %*% contrasts
contrasts <- fitAll$coefficients %*% contrasts
seContrasts <- matrix(diag(varUnscaled)^.5, nrow = nSim, ncol = 5, byrow = TRUE) * fitAll$sigma
tstats <- contrasts / seContrasts
pvals <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
pvals <- t(apply(pvals, 1, p.adjust, method = adjust))
return(colMeans(pvals < alpha))
}
```
## power of current experiment
```{r}
nSim <- 20000
betas <- fit$coefficients
sd <- sigma(fit)
contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1
form <- ~ grade * node
power1 <- simFastMultipleContrasts(form, kpna2, betas, sd, contrasts, nSim = nSim)
power1
```
We observe large powers for all contrasts, exept for contrast nodeg3, which has a small effect size.
## Power for increasing sample size
```{r}
nSim <- 20000
betas <- fit$coefficients
sd <- sigma(fit)
contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1
form <- ~ grade * node
powers <- matrix(NA, nrow = 9, ncol = 6)
colnames(powers) <- c("n", colnames(contrasts))
powers[, 1] <- 2:10
dataAllComb <- data.frame(
grade = rep(c(1, 3), each = 2) %>% as.factor(),
node = rep(c(0, 1), 2) %>% as.factor()
)
for (i in 1:nrow(powers))
{
predData <- data.frame(
grade = rep(dataAllComb$grade, powers[i, 1]),
node = rep(dataAllComb$node, powers[i, 1])
)
powers[i, -1] <- simFastMultipleContrasts(form, predData, betas, sd, contrasts, nSim = nSim)
}
powers
```
```{r}
powers %>%
as.data.frame() %>%
gather(contrast, power, -n) %>%
ggplot(aes(n, power, color = contrast)) +
geom_line()
```
## Power when FC for grade in patients with unaffected lymph nodes equals 1.5
```{r}
nSim <- 20000
betas2 <- fit$coefficients
betas2["grade3"] <- log2(1.5)
sd <- sigma(fit)
contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1
form <- ~ grade * node
power3 <- simFastMultipleContrasts(form, kpna2, betas2, sd, contrasts, nSim = nSim)
power3
```
We observe a drop in the power for all the contrasts!?
We only would expect to loose power for the contrasts assessing the grade effect because both fold changes have been decreased.
The reason why all powers are affected is because we also do multiple testing and the multiple testing using the pvalue correction using Holm's method for a particular contrast is also depending on the order of significance of the tests within each simulation.
When we do both simulations without correcting for multiple testing (adjust = "none") we see that changing the size of the fold change for grade for patients with unaffected lymph nodes only impacts the power for the contrasts for assessing the grade effect.
```{r}
power1None <- simFastMultipleContrasts(form, kpna2, betas, sd, contrasts, nSim = nSim, adjust = "none")
power3None <- simFastMultipleContrasts(form, kpna2, betas2, sd, contrasts, nSim = nSim, adjust = "none")
power1None
power3None
```