forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 1
/
rna-seq-gene-set-testing.Rmd
581 lines (403 loc) · 21.8 KB
/
rna-seq-gene-set-testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
minutes: 300
layout: page
subtitle: Gene Set Testing for RNA-seq
bibliography: ref.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016
# Data set
- Organism: mouse
- Tissue: mammary gland
- Three conditions:
- virgin
- pregnant
- lactating
- Two cell types:
- basal stem-cell enriched cells (B)
- committed luminal cells (L)
- Six groups (3 conditions x 2 cell types) with 2 biological replicates per group
- As described in:
- ['EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival' (Fu et al. 2015)](https://www.ncbi.nlm.nih.gov/pubmed/25730472) published in Nature Cell Biology, with both sequence and counts available from Gene Expression Omnibus database (GEO) under accession number [GSE60450](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450)
- [A DE-licious recipe for differential expression analyses of RNA-seq](http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf)
# Testing relative to a threshold (TREAT)
When there is a lot of differential expression, sometimes we may want to cut-off on a fold change threshold as well as a p-value threshold so that we follow up on the most biologically significant genes. However, it is not recommended to simply rank by p-value and then discard genes with small logFC's, as this has been shown to increase the false discovery rate. In other words, you are not controlling the false discovery rate at 5\% anymore.
The test performed above tests the null hypothesis that gene expression is the same in classes compared in the contrast of interest, i.e. that fold change is 1 (and log2(FC) is 0). Rather than ranking genes per p-values and then filter on logFC, one should instead test the null hypothesis that the difference in level of expression between classes in the contrast is lower than a given threshold.
See ["Testing significance relative to a fold-change threshold is a TREAT"](https://academic.oup.com/bioinformatics/article/25/6/765/251641/Testing-significance-relative-to-a-fold-change)
## Fit the linear model
Remember how the linear model was fitted in the previous section:
```{r}
fit <- glmFit(dgeObj, design)
```
Load the DGEList object dgeObj saved in the previous session.
```{r}
library(edgeR)
load("Robjects/DE.Rdata")
```
Let's fit the model again:
```{r}
## Read the counts from the downloaded data
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
countdata
colnames(countdata) <- substr(colnames(countdata), 1, 7)
countdata
## Calculate the Counts Per Million measure
myCPM <- cpm(countdata)
## Identify genes with at least 0.5 cpm in at least 2 samples
thresh <- myCPM > 0.5
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
## Convert to an edgeR object
dgeObj <- DGEList(counts.keep)
## Perform TMM normalisation
dgeObj <- calcNormFactors(dgeObj)
# Estimate dispersion:
dgeObj <- estimateCommonDisp(dgeObj)
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)
plotBCV(dgeObj)
# Create design matrix:
# Obtain sample information
sampleinfo <- read.delim("data/SampleInfo_Corrected.txt")
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group
# Create the two variables
group <- as.character(group)
type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
ftable(type, status, exclude=c())
# Specify a design matrix without an intercept term:
design <- model.matrix(~ type + status)
design
#colnames(design)
# Fit model:
fit <- glmFit(dgeObj, design)
names(fit)
head(coef(fit))
```
Conduct likelihood ratio tests for the luminal vs basal contrast and show the top genes:
```{r}
# Conduct likelihood ratio tests for luminal vs basal and show the top genes:
# Remember that contrast names are kept in
# - the coef(), see head(coef(fit)),
# - the design matrix, see colnames(design)
lrt.BvsL <- glmLRT(fit, coef=2)
#class((lrt.BvsL)) # "DGELRT"
lrt.BvsL$comparison
# Show top genes:
topTags(lrt.BvsL)
# Access test results kept in the 'table' slot:
head(lrt.BvsL$table)
```
Check counts per million for the most significant genes:
```{r}
# Check counts per million for the most significant genes:
# Order genes by increasing order of p-value
o <- order(lrt.BvsL$table$PValue)
# Show counts per million for the top 10 genes for all samples
round(cpm(dgeObj)[o[1:10],])
# Remember the 'sampleinfo' dataframe keeps samples description
# Maybe plot counts per million for the top gene
barplot(cpm(dgeObj)[o[1],])
```
## MD plot
Create mean-difference plot to show log fold changes (differences) versus average log values (means):
```{r}
# lfc threshold: abs(logFC > 1)
# Create mean-difference ('MD') plot to show log fold changes (differences) versus average log values (means)
# Differentially exressed genes (DEGs) may be color-coded using the 'status' argument.
# Let's color genes according to the outcome of the LRT test.
# Classify the differential expression statistics as up, down or not significant using decideTestsDGE()
# ?decideTestsDGE
tmpDec <- decideTestsDGE(lrt.BvsL, lfc=1)
# Show the number of genes in each of the three classes:
summary(tmpDec)
# Create MD plot with extra blue lines to show log fold-change threshold:
edgeR::plotMD.DGELRT(lrt.BvsL, status=tmpDec); abline(h=c(-1, 1), col="blue")
```
## Testing relative to a threshold (TREAT) with edgeR
Read the manual on glmTreat() to "Conduct genewise statistical tests for a given coefficient or contrast relative to a specified fold-change threshold."
- "‘glmTreat’ implements a test for differential expression relative
to a minimum required fold-change threshold. Instead of testing
for genes which have log-fold-changes different from zero, it
tests whether the log2-fold-change is greater than ‘lfc’ in
absolute value. ‘glmTreat’ is analogous to the TREAT approach
developed by McCarthy and Smyth (2009) for microarrays."
- "‘glmTreat’ detects whether ‘glmfit’ was produced by ‘glmFit’ or
‘glmQLFit’. In the former case, it conducts a modified likelihood
ratio test (LRT) against the fold-change threshold. In the latter
case, it conducts a quasi-likelihood (QL) F-test against the
threshold."
```{r}
?glmTreat
```
We will use the "output from ‘glmFit’" created above.
### Test for abs(logFC) > 0
Let's test for abs(logFC) > 0:
```{r}
# > colnames(design)
#[1] "(Intercept)" "typeluminal" "statuspregnant" "statusvirgin"
res.treat.tlum.fc0 <- glmTreat(fit, "typeluminal", contrast = NULL, lfc = 0, null = "interval")
topTags(res.treat.tlum.fc0)
#coef(res.treat.tlum.fc0)
#class(res.treat.tlum.fc0$table)
#colnames(res.treat.tlum.fc0$table)
head(res.treat.tlum.fc0$table)
```
Number of genes either up- or down-regulated, or not showing significant difference in expression between the two groups:
```{r}
summary(decideTestsDGE(res.treat.tlum.fc0))
```
Draw MD plot:
```{r}
tmpDec.fc0 <- decideTestsDGE(res.treat.tlum.fc0)
edgeR::plotMD.DGELRT(res.treat.tlum.fc0, status=tmpDec.fc0)
abline(h=c(-1, 1), col="blue")
```
### Test for abs(logFC) > 1
Let's test for abs(logFC) > 1:
```{r}
# > colnames(design)
#[1] "(Intercept)" "typeluminal" "statuspregnant" "statusvirgin"
res.treat.tlum.fc1 <- glmTreat(fit, "typeluminal", contrast = NULL, lfc = 1, null = "interval")
# Glance at the top genes:
topTags(res.treat.tlum.fc1)
# Number of genes either up- or down-regulated, or not showing significant difference in expression between the two groups:
tmpDec.fc1 <- decideTestsDGE(res.treat.tlum.fc1)
summary(tmpDec.fc1)
# Draw MD plot:
edgeR::plotMD.DGELRT(res.treat.tlum.fc1, status=tmpDec.fc1)
abline(h=c(-1, 1), col="blue")
```
### Visualise effect of TREAT on DEG list
You may have noticed that fewer genes are highlighted in the MAplot for 'abs(logFC) > 1' than for 'abs(logFC) > 0'.
Let's identify genes flagged as DEG with 'abs(logFC) > 0 then abs(logFC)>1' but not 'abs(logFC) > 1'.
```{r}
# Draw list of DEGs with standard abs(logFC) > 0 followed by logFC filtering:
tmpDec.noTreat.fc1 <- decideTestsDGE(lrt.BvsL, lfc=1)
summary(tmpDec.noTreat.fc1)
# compare decisions between the two tests:
ftable(as.vector(tmpDec.noTreat.fc1), as.vector(tmpDec.fc1), exclude=c())
# Identify genes 'in the abs(FC)>0 set but not in the abs(FC)>1 set':
tmpDec.diff <- as.logical(tmpDec.noTreat.fc1) & ! as.logical(tmpDec.fc1)
table(tmpDec.diff)
edgeR::plotMD.DGELRT(res.treat.tlum.fc1, status=tmpDec.diff)
abline(h=c(-1, 1), col="blue")
```
```{r, results='hide'}
# compare decisions between the two tests:
ftable(as.vector(tmpDec.fc0), as.vector(tmpDec.fc1), exclude=c())
# Identify genes 'in the abs-FC>0 set but not in the abs-FC>1 set':
tmpDec.inFc0outFc1 <- as.logical(tmpDec.fc0) & ! as.logical(tmpDec.fc1)
table(tmpDec.inFc0outFc1)
edgeR::plotMD.DGELRT(res.treat.tlum.fc1, status=tmpDec.inFc0outFc1)
abline(h=c(-1, 1), col="blue")
```
> ## Challenge {.challenge}
>
> Change the cut-off so that we are interested in genes that change at least 50\% on the fold change scale.
>
> HINT: what is the corresponding logFC value of 50\% fold change? Assume basal.pregnant is 50\% higher than basal.lactate
>
# Gene Set Testing
Sometimes there is quite a long list of differentially expressed genes to interpret after a differential expression analysis, and it is usually infeasible to go through the list one gene at a time trying to understand its biological function. A common downstream procedure is gene set testing, which aims to understand which pathways/gene networks the differentially expressed genes are implicated in.
There are a number of different ways to go about testing for enrichment of biological pathways, and the test you choose usually depends on the question you're asking. There are two kinds of tests: competitive and self-contained gene set tests.
Competitive gene set tests, such as those implemented in `GOseq` and `camera` ask the question whether the differentially expressed genes tend to be over-represented in the gene set, compared to all the other genes in the experiment.
Self-contained tests, which include the `ROAST` procedure, ask the question "Are the genes in the set/pathway differentially expressed as a whole?"
## Gene Set Testing - competitive gene set tests
### GOseq analysis
GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation ([GOseq article](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-2-r14))
From the [GOseq vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf):
- GOseq first needs to quantify the length bias present in the dataset under consideration.
- This is done by calculating a Probability Weighting Function or PWF which can be thought of as a function which gives the probability that a gene will be differentially expressed (DE), based on its length alone.
- The PWF is calculated by fitting a monotonic spline to the binary data series of differential expression (1=DE, 0=Not DE) as a function of gene length.
- The PWF is used to weight the chance of selecting each gene when forming a null distribution for GO category membership.
- The fact that the PWF is calculated directly from the dataset under consideration makes this approach robust, only correcting for the length bias present in the data.
"GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each category's significance for over representation amongst DE genes. ... In most cases, the Wallenius
distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option."
Create list of DEGs:
```{r}
# Retrieve list of all genes tested:
results <- as.data.frame(topTags(lrt.BvsL, n = Inf))
results
# Derive list of DEGs by filtering on FDR:
genes <- results$FDR < 0.01
# Add gene names to that list:
names(genes) <- rownames(results)
```
Fit the Probability Weighting Function (PWF):
```{r}
library(goseq)
supportedGeneIDs()
supportedGenomes()
pwf <- nullp(genes, "mm10","knownGene")
```
Conduct gene set enrichment analysis:
```{r}
?goseq
go.results <- goseq(pwf, "mm10","knownGene")
go.results
```
### fgsea analysis
From the fgsea [vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html) "fast preranked gene set enrichment analysis (GSEA)":
This analysis is performed by:
- (i) ranking all genes in the data set based on their correlation to the chosen phenotype,
- (ii) identifying the rank positions of all members of the gene set, and
- (iii) calculating an enrichment score (ES) that represents the difference between the observed rankings and that which would be expected assuming a random rank distribution.
"After establishing the ES for each gene set across the phenotype, GSEA reiteratively randomizes the sample labels and retests for enrichment across the random classes. By performing repeated class label randomizations, the ES for each gene set across the true classes can be compared to the ES distribution from the random classes. Those gene sets that significantly outperform iterative random class permutations are considered significant." [commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/). The article describing the original software is available [here](http://www.pnas.org/content/102/43/15545.long).
```{r}
library(fgsea)
```
Create ranks:
```{r}
results.ord <- results[ order(-results[,"logFC"]), ]
head(results.ord)
ranks <- results.ord$logFC
names(ranks) <- rownames(results.ord)
head(ranks)
```
```{r}
#plot(ranks)
barplot(ranks)
```
Load the pathways:
```{r}
load("data/mouse_H_v5.rdata")
pathways <- Mm.H
```
Conduct analysis:
```{r}
?fgsea
fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize = 500, nperm=1000)
class(fgseaRes)
dim(fgseaRes)
head(fgseaRes)
```
Glance at results:
```{r}
head(fgseaRes[order(padj), ])
```
Plot outcome for the 'HALLMARK_MYOGENESIS' pathway:
First find rank of the 'HALLMARK_MYOGENESIS' pathway genes in the sorted genes:
```{r}
# We will create a barplot of logFC for the sorted genes and add one vertical red bar for each gene in the 'HALLMARK_MYOGENESIS' pathway
#pathways[["HALLMARK_MYOGENESIS"]]
tmpInd <- match(pathways[["HALLMARK_MYOGENESIS"]],names(ranks))
tmpInd <- tmpInd[!is.na(tmpInd)]
tmpInd
barplot(ranks)
abline(v=tmpInd, col="red")
```
Create enrichment score plot:
```{r}
plotEnrichment(pathways[["HALLMARK_MYOGENESIS"]],
ranks)
```
Remember to check the [GSEA article](http://www.pnas.org/content/102/43/15545.full) for the complete explanation.
Select top pathways and plot outcome for all these:
```{r}
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(pathways[topPathways], ranks, fgseaRes,
gseaParam = 0.5)
?plotGseaTable
```
### CAMERA gene set testing using the Broad's curated gene sets
Other databases of gene sets that are available come from the Broad Institute's Molecular Signatures Database ([MSigDB](http://software.broadinstitute.org/gsea/msigdb)). [CAMERA](https://academic.oup.com/nar/article/40/17/e133/2411151/Camera-a-competitive-gene-set-test-accounting-for) is good option for testing a very large number of gene sets such as the MSigDB sets, as it is very fast. It has the advantage of accounting for inter-gene correlation within each gene set [@wu2012camera].
Here we will be using the C2 gene sets for mouse, available as .rdata files from the WEHI bioinformatics page [http://bioinf.wehi.edu.au/software/MSigDB/index.html](http://bioinf.wehi.edu.au/software/MSigDB/index.html). The C2 gene sets contain 4725 curated gene sets collected from a variety of places: BioCarta, KEGG, Pathway Interaction Database, Reactome as well as some published studies. It doesn't include GO terms.
```{r}
# ?camera.DGEList
# Load in the mouse c2 gene sets
# The R object is called Mm.c2
load("data/mouse_c2_v5.rdata")
# Have a look at the first few gene sets
names(Mm.c2)[1:5]
# Number of gene sets in C2
length(Mm.c2)
```
The gene identifiers are Entrez Gene ID, as are the rownames of our DGEList object 'dgeObj'. We need to map the Entrez gene ids between the list of gene sets and our DGEList object. We can do this using the `ids2indices` function.
```{r}
c2.ind <- ids2indices(Mm.c2, rownames(dgeObj$counts))
```
CAMERA takes as input the DGEList object `dgeObj`, the indexed list of gene sets `c2.ind`, the design matrix, the contrast being tested, as well as some other arguments. By default, CAMERA can estimate the correlation for each gene set separately. However, in practise, it works well to set a small inter-gene correlation of about 0.05 using the `inter.gene.cor` argument.
```{r}
# Conduct analysis for the luminal-vs-basal contrast:
# Check contrasts:
colnames(design)
# Run analysis:
gst.camera <- camera.DGEList(dgeObj,index=c2.ind,design=design,contrast=2,inter.gene.cor=0.05)
```
CAMERA outputs a dataframe of the resulting statistics, with each row denoting a different gene set. The output is ordered by p-value so that the most significant should be at the top. Let's look at the top 5 gene sets:
```{r}
gst.camera[1:5,]
```
The total number of significant gene sets at 5\% FDR is:
```{r}
table(gst.camera$FDR < 0.05)
```
You can write out the camera results to a csv file to open in excel.
```{r}
write.csv(gst.camera,file="gst_LacLumVsBas.csv")
```
> ## Challenge {.challenge}
>
> 1. Run `camera` on the pregnant vs lactating contrast.
> 1. Run `camera` on a different set of MSigDB gene sets, the hallmark datasets, `mouse_H_v5.rdata`.
> You will need to load in the hallmark gene sets, and the object will be called `Mm.H` in R.
>
## Gene Set Testing - self-contained gene set tests
### ROAST gene set testing
[ROAST](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btq401) is an example of a self-contained gene set test [@wu2010roast]. It asks the question, "Do the genes in my set tend to be differentially expressed between my conditions of interest?". ROAST does not use information on the other genes in the experiment, unlike `camera`. ROAST is a good option for when you're interested in a specific set, or a few sets. It is not really used to test thousands of sets at one time.
From the Hallmark gene sets, two MYC pathways were most significant.
```{r}
H.camera[1:10,]
```
Let's see if there are any MYC signalling pathways in MsigDB C2 collection. We can do this with the `grep` command on the names of the gene sets.
```{r}
grep("MYC_",names(c2.ind))
# Let's save these so that we can subset c2.ind to test all gene sets with MYC in the name
myc <- grep("MYC_",names(c2.ind))
# What are these pathways called?
names(c2.ind)[myc]
```
Let's use ROAST to see if these MYC related gene sets tend to be differentially expressed. Note that the syntax for `camera` and `roast` is almost identical.
```{r}
myc.rst <- roast(dgeObj,index=c2.ind[myc],design=design,contrast=3,nrot=999)
myc.rst[1:15,]
```
Each row corresponds to a single gene set.
The NGenes column gives the number of genes in each set.
The PropDown and PropUp columns contain the proportions of genes in the set that are down- and up-regulated, respectively, with absolute fold changes greater than 2.
The net direction of change is determined from the significance of changes in each direction, and is shown in the Direction column.
The PValue provides evidence for whether the majority of genes in the set are DE in the specified direction, whereas the PValue.Mixed tests for differential expression in any direction.
FDRs are computed from the corresponding p-values across all sets.
> ## Challenge {.challenge}
>
> 1. Test whether the MYC signalling pathways tend to be differentially expressed between basal virgin vs lactating.
> 1. Look for gene sets containing "WNT" in the name and see whether they tend to be differentially expressed in basal pregnant vs lactating.
>
Notes
* A common application of ROAST is to use a set of DE genes that was defined from an analysis of an independent data set. ROAST can then determine whether similar changes are observed in the contrast of interest for the current data set.
* Even for GO-defined gene sets, goana and ROAST have different behaviours. In goana, the significance of differential expression for a GO term is determined relative to other DE genes that are not annotated with that term. In ROAST, only differential expression for the genes in the set are relevant to the significance of that set and its corresponding term. goana depends on a significance cutoff to choose DE genes, whereas ROAST does not require a cutoff and evaluates all genes in the set.
* ROAST estimates p-values by simulation, so the results may change slightly between runs. More precise p-values can be obtained by increasing the number of rotations, albeit at the cost of increased computational time.
* The smallest p-value that can be reported is 1/(2nrot + 1) where nrot is the number of rotations. This lower bound can be decreased by increasing nrot.
References