08_ExperimentalDesignII_3_kpna2_sol.Rmd

---
title: "Experimenteel Design II: replication en power - KPNA2"
author: "Lieven Clement, Alexandre Segers and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---


<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>


```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r}
library(ggplot2)
library(tidyverse)
```

# Background

Histologic grade in breast cancer provides clinically important prognostic information. Researchers examined whether histologic grade was associated with gene expression profiles of breast cancers and whether such profiles could be used to improve histologic grading. In this tutorial we will assess the association between histologic grade and the expression of the KPNA2 gene that is known to be associated with poor BC prognosis.
The patients, however, do not only differ in the histologic grade, but also on their lymph node status.
The lymph nodes were not affected (0) or chirugically removed (1).

- Redo data analysis (you can copy the results of the tutorial on multiple linear regression)
- What is the power to pick up each of the contrasts when their real effect sizes would be equal to the effect sizes we observed in the study?
- How does the power evolves if we have 2 upto 10 repeats for each factor combination of grade and node when their real effect sizes would be equal to the ones we observed in the study?
- What is the power to pick up each of the contrasts when the FC for grade for patients with unaffected lymph nodes equals 1.5 ($\beta_g = log2(1.5)$)?


# Data analysis
## Import KPNA2 data in R
```{r}
kpna2 <- read.table("https://raw.githubusercontent.com/statOmics/SGA21/master/data/kpna2.txt", header = TRUE)
kpna2
```

Because histolic grade and lymph node status are both categorical variables, we model them both as factors.

```{r}
kpna2$grade <- as.factor(kpna2$grade)
kpna2$node <- as.factor(kpna2$node)
```

```{r}
kpna2 %>%
  ggplot(aes(x = node:grade, y = gene, fill = node:grade)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter()
```

As discussed in a previous exercise, it seems that there is both an effect of histologic grade and lymph node status on the gene expression. There also seems to be a different effect of lymph node status on the gene expression for the different histologic grades.

As we saw before, we can model this with a model that contains both histologic grade, lymph node status and the interaction term between both. When checking the linear model assumptions, we see that the variance is not equal. Therefore we model the gene expression with a log2-transformation, which makes that all the assumptions of the linear model are satisfied.

```{r}
fit <- lm(log2(gene) ~ grade * node, data = kpna2)
plot(fit)
```

```{r}
library(car)
Anova(fit, type = "III")
```

When checking the significance of the interaction term, we see that it is significant on the 5% significance level. We therefore keep the full model.


As we are dealing with a factorial design, we can calculate the mean gene expression for each group by the following parameter summations.

```{r}
ExploreModelMatrix::VisualizeDesign(kpna2, ~ grade * node)$plotlist
```

The researchers want to know the power for testing following hypotheses (remark that we will have to adjust for multiple testing):

- Log fold change between histologic grade 3 and histologic grade 1 for patients with unaffected lymph nodes (=0).

$$H_0: \log_2{FC}_{g3n0-g1n0} = \beta_{g3} = 0$$

- Log fold change between histologic grade 3 and histologic grade 1 for patients with removed lymph nodes (=1).

$$H_0: \log_2{FC}_{g3n1-g1n1} = \beta_{g3} + \beta_{g3n1} = 0$$

- Log fold change between unaffected and removed lymph nodes for patients of histologic grade 1.

$$H_0: \log_2{FC}_{g1n1-g1n0} = \beta_{n1} = 0$$

- Log fold change between unaffected and removed lymph nodes for patients of histologic grade 3.

$$H_0: \log_2{FC}_{g3n1-g3n0} = \beta_{n1} + \beta_{g3n1} = 0$$

- Difference in log fold change between patients of histological grade 3 and histological grade 1 with removed lymph nodes and log fold change between patients between patients of histological grade 3 and histological grade 1 with unaffected lymph nodes.

$$H_0: \log_2{FC}_{g3n1-g1n1} - \log_2{FC}_{g3n0-g1n0} = \beta_{g3n1} = 0$$ which is an equivalent hypotheses with $$H_0: \log_2{FC}_{g3n1-g3n0} - \log_2{FC}_{g1n1-g1n0} = \beta_{g3n1} = 0$$

We can test this using multcomp, which controls for multiple testing.

```{r}
library(multcomp)

mcp <- glht(fit, linfct = c(
  "grade3 = 0",
  "grade3 + grade3:node1 = 0",
  "node1 = 0",
  "node1 + grade3:node1 = 0",
  "grade3:node1 = 0"
))
summary(mcp)
```

We get a significant p-value for the first, second, third and fifth hypothesis. The fourth hypothesis is not significant at the overall 5% significance level.

# Power of the tests for each of the contrasts

## Simulation function

Function to simulate data similar to that of our experiment under our model assumptions.

```{r}
simFastMultipleContrasts <- function(form, data, betas, sd, contrasts, alpha = .05, nSim = 10000, adjust = "holm") {
  ySim <- rnorm(nrow(data) * nSim, sd = sd)
  dim(ySim) <- c(nrow(data), nSim)
  design <- model.matrix(form, data)
  ySim <- ySim + c(design %*% betas)
  ySim <- t(ySim)

  ### Fitting
  fitAll <- limma::lmFit(ySim, design)

  ### Inference
  varUnscaled <- t(contrasts) %*% fitAll$cov.coefficients %*% contrasts
  contrasts <- fitAll$coefficients %*% contrasts
  seContrasts <- matrix(diag(varUnscaled)^.5, nrow = nSim, ncol = 5, byrow = TRUE) * fitAll$sigma
  tstats <- contrasts / seContrasts
  pvals <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
  pvals <- t(apply(pvals, 1, p.adjust, method = adjust))

  return(colMeans(pvals < alpha))
}
```

## power of current experiment

```{r}
nSim <- 20000
betas <- fit$coefficients
sd <- sigma(fit)

contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1

form <- ~ grade * node

power1 <- simFastMultipleContrasts(form, kpna2, betas, sd, contrasts, nSim = nSim)
power1
```

We observe large powers for all contrasts, exept for contrast nodeg3, which has a small effect size.

## Power for increasing sample size

```{r}
nSim <- 20000
betas <- fit$coefficients
sd <- sigma(fit)

contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1

form <- ~ grade * node

powers <- matrix(NA, nrow = 9, ncol = 6)
colnames(powers) <- c("n", colnames(contrasts))
powers[, 1] <- 2:10

dataAllComb <- data.frame(
  grade = rep(c(1, 3), each = 2) %>% as.factor(),
  node = rep(c(0, 1), 2) %>% as.factor()
)

for (i in 1:nrow(powers))
{
  predData <- data.frame(
    grade = rep(dataAllComb$grade, powers[i, 1]),
    node = rep(dataAllComb$node, powers[i, 1])
  )
  powers[i, -1] <- simFastMultipleContrasts(form, predData, betas, sd, contrasts, nSim = nSim)
}
powers
```

```{r}
powers %>%
  as.data.frame() %>%
  gather(contrast, power, -n) %>%
  ggplot(aes(n, power, color = contrast)) +
  geom_line()
```

## Power when FC for grade in patients with unaffected lymph nodes equals 1.5

```{r}
nSim <- 20000
betas2 <- fit$coefficients
betas2["grade3"] <- log2(1.5)
sd <- sigma(fit)

contrasts <- matrix(0, nrow = 4, ncol = 5)
rownames(contrasts) <- names(fit$coefficients)
colnames(contrasts) <- c("graden0", "graden1", "nodeg1", "nodeg3", "interaction")
contrasts[2, 1] <- 1
contrasts[c(2, 4), 2] <- 1
contrasts[3, 3] <- 1
contrasts[c(3, 4), 4] <- 1
contrasts[4, 5] <- 1

form <- ~ grade * node

power3 <- simFastMultipleContrasts(form, kpna2, betas2, sd, contrasts, nSim = nSim)
power3
```

We observe a drop in the power for all the contrasts!?

We only would expect to loose power for the contrasts assessing the grade effect because both fold changes have been decreased.

The reason why all powers are affected is because we also do multiple testing and the multiple testing using the pvalue correction using Holm's method for a particular contrast is also depending on the order of significance of the tests within each simulation.

When we do both simulations without correcting for multiple testing (adjust = "none") we see that changing the size of the fold change for grade for patients with unaffected lymph nodes only impacts the power for the contrasts for assessing the grade effect.

```{r}
power1None <- simFastMultipleContrasts(form, kpna2, betas, sd, contrasts, nSim = nSim, adjust = "none")
power3None <- simFastMultipleContrasts(form, kpna2, betas2, sd, contrasts, nSim = nSim, adjust = "none")
power1None
power3None
```