-
Notifications
You must be signed in to change notification settings - Fork 16
/
StatsRNAseq_Couturier.Rmd
233 lines (187 loc) · 7.05 KB
/
StatsRNAseq_Couturier.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: "Statistical analysis of RNAseq data"
author: "D.-L. Couturier and O. Rueda"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_document:
theme: united
highlight: tango
code_folding: show
toc: true
toc_depth: 2
toc_float: true
fig_width: 8
fig_height: 6
---
<!--- rmarkdown::render("/Volumes/Files/courses/cruk/RNAseqWithR/202007B/tex/sc/StatsRNAseq_Couturier.Rmd") --->
<!--- rmarkdown::render("~/courses/cruk/RNAseqWithR/202007B/tex/sc/StatsRNAseq_Couturier.Rmd") --->
```{r message = FALSE, warning = FALSE, echo = FALSE}
# IGNORE THIS: START
if(exists(".id")){
setOutputColors(normal = 17, number = 130, negnum = 21, date = 200,
string = 0, const = 13, stderror=154,
warn = c(1, 0, 1), error = c(1,15), true = 2, false = 196,
infinite = c(1,0,196), zero = c(230,230,16), verbose = TRUE,
zero.limit = 1e-12)
}
# IGNORE THIS: START
```
# Section 1: Contrast matrices
## One 2-level factor:
```{r message = FALSE, warning = FALSE, echo = TRUE}
one2levelfactor = data.frame(
sample = 1:6,
condition = rep(c("TreatmentA", "Control"), 3),
stringsAsFactors = TRUE)
# model without intercept and default levels:
X1 = model.matrix(~ condition - 1, data = one2levelfactor)
X1
# model with intercept and default levels
X2 = model.matrix(~ condition, data = one2levelfactor)
X2
# matrix multiplication: let s assume a mean of 2 for control and of 3 for group A
X1%*%c(2,3)
X2%*%c(2,1)
```
## One 3-level factor:
```{r message = FALSE, warning = FALSE, echo = TRUE}
one3levelfactor = data.frame(
sample = 1:6,
condition = rep(c("TreatmentA", "TreatmentB", "Control"), 2),
stringsAsFactors = TRUE)
# model without intercept and default levels:
X1 = model.matrix(~ condition - 1, data = one3levelfactor)
X1
# model with intercept and default levels
X2 = model.matrix(~ condition, data = one3levelfactor)
X2
# model with intercept and self-defined levels
one3levelfactor$condition2 = factor(one3levelfactor$condition,
levels=c("TreatmentB","TreatmentA","Control"))
X3 = model.matrix(~ condition2, data = one3levelfactor)
X3
# matrix multiplication: let s assume a mean of
# > 2 for control
# > 3 for group A
# > 4 for group B
# ! CHALLENGE !
# ! FIND the values of the BETA vector corresponding to X1, X2 and X3 !
```
## Two 2-level factors:
```{r message = FALSE, warning = FALSE, echo = TRUE}
# create dataset
two2levelfactor = data.frame(
sample = 1:8,
treatment = rep(c("TreatA","NoTreat"),4),
er = rep(c("+","-"),each=4),
stringsAsFactors = TRUE)
# design matrix without interaction
X1 = model.matrix(~ treatment + er, data=two2levelfactor)
X1
# design matrix with interaction
X2 = model.matrix(~ treatment * er, data=two2levelfactor)
X2
model.matrix(~ treatment + er + treatment:er, data=two2levelfactor)
# matrix multiplication: let s assume a mean of
# > 2 for control and ER-
# > 3 for group A and ER-
# > 4 for control and ER+
# > 6 for group A and ER+
# ! CHALLENGE !
# ! FIND the values of the BETA vector corresponding to X1, X2 and X3 !
```
# Section 2: DESeq2
## Introduction slide
Let's generate
* *cnts*, a toy matrix of counts of 1000 genes for 20 samples,
* *cond*, a vector indicating to which condition each sample belongs (1 for treatment 1, 2 for treatment 2),
```{r message = FALSE, warning = FALSE, echo = TRUE}
set.seed(777)
cnts <- matrix(rnbinom(n=20000, mu=100, size=1/.25), ncol=20)
cond <- factor(rep(1:2, each=10))
```
Let's
* combine the count matrix, the sample information and the assumed model in an object of class *DESeqDataSet*,
* perform the DE analysis via the function *DESeq*
* print the results
```{r message = FALSE, warning = FALSE, echo = TRUE}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)
dds <- DESeq(dds)
results(dds)
```
## Section 2 slides dedicated to dispersion
Let's print the relevant information to deduce the estimated NB distribution assumed for each gene and condition:
```{r message = FALSE, warning = FALSE, echo = TRUE}
mcols(dds)[,c("Intercept","cond_2_vs_1","dispGeneEst","dispFit","dispersion")]
```
Let's reproduce the plot showing the fitted probability mass functions per condition for gene 1:
```{r message = FALSE, warning = FALSE, echo = TRUE}
axe.x = seq(0,400)
f.x1 = dnbinom(axe.x, mu=2^6.90565, size=1/0.274708)
f.x2 = dnbinom(axe.x, mu=2^(6.90565-0.682067), size=1/0.274708)
par(mfrow=c(1,1),mar=c(4,4,0,0))
ylimw = max(c(f.x1,f.x2))
plot(1,1,ylim=c(0,ylimw),xlim=c(0,max(axe.x)),pch="",xlab="Counts",ylab="Probability",
axes=FALSE)
lines(axe.x,f.x1,col=.cruk$col[1])
lines(axe.x,f.x2,col=.cruk$col[3])
axis(1,pos=0)
axis(2,las=2,pos=0)
legend("topright",bg="light gray",lty=1,col=.cruk$col[c(1,3)],
legend=c("Condition 1","Condition 2"),title="Estimated distributions",box.lwd=NA)
abline(v=2^6.90565,col=.cruk$col[1],lty=3)
abline(v=2^(6.90565-0.682067),col=.cruk$col[3],lty=3)
```
# Section 3: Large Scale Hypothesis testing: FDR
When we are doing thousands of tests for differential expression, the overall significance level of a test is very difficult to control. Let's see why:
First, we simulate 40,000 genes not differentially expressed (with a mean of zero). We assume that we have 10 replicates of this experiment:
```{r}
N <- 40000
R <- 10
X <- matrix(rnorm(N* R, 0, 1), nrow=N)
```
Now we assume that we run a t-test under the null hypothesis that the mean is zero for each of these genes, that is each row in the matrix:
```{r}
t.test(X[1,])$p.value
pvals <- apply(X, 1, function(y) t.test(y)$p.value)
```
Because we have generated this data with mean zero, we know that none of these genes are differentially expressed, so we would like to be able to not reject any of the hypothesis. However, if you choose a significance level of 0.05 we get
```{r}
sum(pvals<0.05)
```
Too many rejections!!!
In fact, if we look at the distributions of the p-values obtained we get:
```{r}
hist(pvals)
```
That is, if the null hypothesis is true, the p-values will follow a uniform distribution.
This is the key to all methods that aim to control the proportion of false positives amongs the genes that we call differentially expressed. Let's add 1000 genes to our set that are really differentially expressed (mean of 1):
```{r}
df <- 1000
Y <- matrix(rnorm(df* R, 1, 1), nrow=df)
Z <- rbind(X, Y)
pvals <- apply(Z, 1, function(y) t.test(y)$p.value)
#
plot(pvals,col=rep(1:2,c(40000,1000)))
plot(p.adjust(pvals, method="BH"),col=rep(1:2,c(40000,1000)))
#
tapply(p.adjust(pvals, method="BH")<0.05,rep(1:2,c(40000,1000)),mean)
```
Let's look at the distribution of p-values now:
```{r}
hist(pvals)
```
What would be the number of false positives now? How many would we expect if we reject p-values samller than our significance level, 0.05?
```{r}
exp.sig<- (nrow(Z))*0.05
obs.sig <- sum(pvals<0.05)
FDR <- exp.sig / obs.sig
FDR
```
We can compare this with the Benjamini-Hochberg method:
```{r}
pvals.adj <- p.adjust(pvals, method="BH")
plot(pvals, pvals.adj)
abline(v=0.05, col=2)
```