forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 1
/
align-and-count_wiSolution.Rmd
334 lines (237 loc) · 15.6 KB
/
align-and-count_wiSolution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
minutes: 300
layout: page
subtitle: Alignment and Counting
bibliography: ref.bib
---
**Authors: Belinda Phipson, Maria Doyle, Harriet Dashnow**
```{r, include=FALSE}
source("tools/chunk-options.R")
opts_chunk$set(fig.path = "fig-07/")
```
This material has been created using the following sources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
http://bioinf.wehi.edu.au/RNAseqCaseStudy/
Packages used:
Rsubread
Data files needed:
Mouse chromosome 1 Rsubread index files (~400MB).
Targets2.txt
The 12 fastq.gz files for the mouse dataset.
Mouse mammary data (fastq files): [https://figshare.com/s/f5d63d8c265a05618137](https://figshare.com/s/f5d63d8c265a05618137)
You should download these files and place them in your `/data` directory.
GEO entry for the dataset:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450
The raw reads were downloaded from SRA from the link given in GEO for the dataset (ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP045%2FSRP045534). These files are in .sra format. The sra toolkit from NCBI was used to convert the .sra files to .fastq files using the fastq-dump command.
## Downloading genome files
We have provided the index files for chromosome 1 for the mouse genome build mm10 for this workshop in order to save time on building the index. However, full genome fasta files for a number of different genomes are available to download from the UCSC genome browser, see http://hgdownload.soe.ucsc.edu/downloads.html; from NCBI: http://www.ncbi.nlm.nih.gov/genome; or from ENSEMBL: http://asia.ensembl.org/info/data/ftp/index.html.
## Introduction and data import
For the purpose of this workshop, we are going to be working with a small part of the mouse reference genome (chromosome 1) to demonstrate how to do read alignment and counting using R. Mapping reads to the genome is a very important task, and many different aligners are available, such as bowtie [@Langmead2012], topHat [@trapnell2009tophat], STAR [@Dobin2013] and Rsubread [@liao2013subread]. Rsubread is the only aligner that can run in R. Most alignment tools are run in a linux environment, and they are very computationally intensive. Most mapping tasks require larger computers than an average laptop, so usually read mapping is done on a server in a linux-like environment. Here we are only going to be mapping 1000 reads from each sample from our mouse lactation dataset [@Fu2015], and we will only be mapping to chromosome 1. This is so that everyone can have a go at alignment and counting on their laptops using RStudio.
First, let's load the Rsubread package into R.
```{r}
library(Rsubread)
```
Earlier we put all the sequencing read data (.fastq.gz files) in the data directory.
Now we need to find them in order to tell the Rsubread aligner which files to look at.
We can search for all .fastq.gz files in the data directory using the `list.files` command.
The pattern argument takes a regular expression.
In this case we are using the `$` to mean the end of the string, so we will only get files that end in ".fastq.gz"
```{r}
fastq.files <- list.files(path = "./data", pattern = ".fastq.gz$", full.names = TRUE)
fastq.files
```
## Alignment
### Build the index
Read sequences are stored in compressed (gzipped) FASTQ files. Before the differential expression analysis can proceed, these reads must be aligned to the mouse genome and counted into annotated genes. This can be achieved with functions in the Rsubread package.
The first step in performing the alignment is to build an index. In order to build an index you need to have the fasta file (.fa), which can be downloaded from the UCSC genome browser. Here we are building the index just for chromosome 1. This may take several minutes to run. Building the full index using the whole mouse genome usually takes about 30 minutes to an hr on a server. *We won't be building the index in the workshop due to time constraints, we have provided the index files for you*. The command below assumes the chr1 genome information for mm10 is stored in the "chr1.fa" file.
```{r,eval=FALSE}
# See above paragraph: "we have provided the index files for you". You do not need to run command below.
buildindex(basename="chr1_mm10",reference="chr1.fa")
```
The above command will generate the index files in the working directory. In this example, the prefix for the index files is chr1_mm10. You can see the additional files generated using the `dir` command, which will list every file in your current working directory.
```{r,results="hide"}
dir()
```
### Aligning reads to chromosome 1 of reference genome
Now that we have generated our index, we can align our reads using the `align` command. There are often numerous mapping parameters that we can specify, but usually the default mapping parameters for the `align` function are fine. If we had paired end data, we would specify the second read file/s using the `readfile2` argument. Our mouse data comprises 100bp single end reads.
We can specify the output files, or we can let `Rsubread` choose the output file names for us. The default output file name is the filename with ".subread.BAM" added at the end.
Now we can align our 12 fastq.gz files using the `align` command.
```{r,eval=FALSE}
align(index="data/chr1_mm10",readfile1=fastq.files)
```
This will align each of the 12 samples one after the other. As we're only using a subset of 1000 reads per sample, aligning should just take a minute or so for each sample. To run the full samples from this dataset would take several hours per sample. The BAM files are saved in the working directory.
To see how many parameters you can change try the `args` function:
```{r}
args(align)
```
In this example we have kept many of the default settings, which have been optimised to work well under a variety of situations. The default setting for `align` is that it only keeps reads that uniquely map to the reference genome. For testing differential expression of genes, this is what we want, as the reads are unambigously assigned to one place in the genome, allowing for easier interpretation of the results. Understanding all the different parameters you can change involves doing a lot of reading about the aligner that you are using, and can take a lot of time to understand! Today we won't be going into the details of the parameters you can change, but you can get more information from looking at the help:
```{r, eval=FALSE}
?align
```
We can get a summary of the proportion of reads that mapped to the reference genome using the `propmapped` function.
```{r}
bam.files <- list.files(path = "./data", pattern = ".BAM$", full.names = TRUE)
bam.files
```
```{r}
props <- propmapped(files=bam.files)
props
```
> ## Challenge {.challenge}
>
> 1. Try aligning the fastq files allowing multi-mapping reads (set `unique = FALSE`), and allowing for up to 6 "best" locations to be reported (`nBestLocations = 6`). Specify the output file names (bam.files.multi) by substituting ".fastq.gz" with ".multi.bam" so we don't overwrite our unique alignment bam files.
> 1. Look at the proportion of reads mapped and see if we get any more reads mapping by specifying a less stringent criteria.
>
## Challenge 1 - solution
1. Try aligning the fastq files allowing multi-mapping reads (set `unique = FALSE`), and allowing for up to 6 "best" locations to be reported (`nBestLocations = 6`). Specify the output file names (bam.files.multi) by substituting ".fastq.gz" with ".multi.bam" so we don't overwrite our unique alignment bam files.
```{r, eval=FALSE}
# make vector for output file names
bam.files.multi <- gsub(".fastq.gz", ".multi.bam", as.character(fastq.files))
# use the align() function
# allow for multi-mapping: 'unique = FALSE'
# allowing for up to 6 "best" locations to be reported: `nBestLocations = 6`
align(index="data/chr1_mm10",
readfile1=fastq.files,
unique = FALSE,
nBestLocations = 6,
output_format="BAM",
output_file=bam.files.multi
)
# Check files were produced:
dir("./data", pattern="*.multi.bam")
```
2. Look at the proportion of reads mapped and see if we get any more reads mapping by specifying a less stringent criteria.
```{r}
props.multimap <- propmapped(files=bam.files.multi)
props.multimap
# show proportion of reads mapped when multi-mapping is not allowed
props
# Fewer reads are reported as mapping by specifying a less stringent criteria
```
## Quality control
We can have a look at the quality scores associated with each base that has been called by the sequencing machine using the `qualityScores` function in *Rsubread*.
Let's first extract quality scores for 100 reads for the file "SRR1552450.fastq.gz".
```{r}
# Extract quality scores
qs <- qualityScores(filename="data/SRR1552450.fastq.gz",nreads=100)
# Check dimension of qs
dim(qs)
# Check first few elements of qs with head
head(qs)
```
A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call. (A quality score of 10 is a 1 in 10 chance of an incorrect base call.) To look at the overall distribution of quality scores across the 100 reads, we can look at a boxplot
```{r}
boxplot(qs)
```
> ## Challenge {.challenge}
>
> 1. Extract quality scores for SRR1552451.fastq.gz for 50 reads.
> 1. Plot a boxplot of the quality scores for SRR1552451.fastq.gz.
>
## Challenge 2 - solution
1. Extract quality scores for SRR1552451.fastq.gz for 50 reads.
```{r}
# Extract quality scores
fastqFileName <- "data/SRR1552451.fastq.gz"
nReads <- 50
qs <- qualityScores(filename=fastqFileName,nreads=nReads)
rm(fastqFileName, nReads)
# Check dimension of qs
dim(qs)
# Check first few elements of qs with head
head(qs)
```
2. Plot a boxplot of the quality scores for SRR1552451.fastq.gz.
```{r}
boxplot(qs)
```
## Counting
Now that we have figured out where each read comes from in the genome, we need to summarise the information across genes or exons. The alignment produces a set of BAM files, where each file contains the read alignments for each library. In the BAM file, there is a chromosomal location for every read that mapped uniquely. The mapped reads can be counted across mouse genes by using the `featureCounts` function. `featureCounts` contains built-in annotation for mouse (mm9, mm10) and human (hg19) genome assemblies (NCBI refseq annotation).
The code below uses the exon intervals defined in the NCBI refseq annotation of the mm10 genome. Reads that map to exons of genes are added together to obtain the count for each gene, with some care taken with reads that span exon-exon boundaries. `featureCounts` takes all the BAM files as input, and outputs an object which includes the count matrix. Each sample is a separate column, each row is a gene.
```{r, eval=FALSE}
fc <- featureCounts(bam.files, annot.inbuilt="mm10")
```
```{r}
# See what slots are stored in fc
names(fc)
```
The statistics of the read mapping can be seen with fc$stats. This reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See [subread documentation](http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf) ('Program output' section). (We know the real reason why the majority of the reads aren't mapping - they're not from chr 1!)
```{r}
## Take a look at the featurecounts stats
fc$stat
```
The counts for the samples are stored in fc$counts. Take a look at that.
```{r}
## Take a look at the dimensions to see the number of genes
dim(fc$counts)
## Take a look at the first 6 lines
head(fc$counts)
```
The row names of the fc$counts matrix represent the Entrez gene identifiers for each gene and the column names are the output filenames from calling the `align` function. The `annotation` slot shows the annotation information that `featureCounts` used to summarise reads over genes.
```{r}
head(fc$annotation)
```
> ## Challenge {.challenge}
>
> 1. Redo the counting over the exons, rather than the genes (specify `useMetaFeatures = FALSE`). Use the bam files generated doing alignment reporting only unique reads, and call the `featureCounts` object `fc.exon`. Check the dimension of the counts slot to see how much larger it is.
> 1. Using your ".multi.bam" files, redo the counting over genes, allowing for multimapping reads (specify `countMultiMappingReads = TRUE`), calling the object `fc.multi`. Check the stats.
>
## Challenge 3 - solution
1. Redo the counting over the exons, rather than the genes (specify `useMetaFeatures = FALSE`). Use the bam files generated doing alignment reporting only unique reads, and call the `featureCounts` object `fc.exon`. Check the dimension of the counts slot to see how much larger it is.
```{r}
# count over exons, not genes: `useMetaFeatures = FALSE`
fc.exon <- featureCounts(bam.files, annot.inbuilt="mm10",
useMetaFeatures = FALSE,
)
## Take a look at the dimensions to see the number of genes
dim(fc.exon$counts)
## Take a look at the first 6 lines
head(fc.exon$counts)
```
2. Using your ".multi.bam" files, redo the counting over genes, allowing for multimapping reads (specify `countMultiMappingReads = TRUE`), calling the object `fc.multi`. Check the stats.
```{r}
# count over exons, not genes: `useMetaFeatures = FALSE`
fc.exon.multimap <- featureCounts(bam.files.multi, annot.inbuilt="mm10",useMetaFeatures = FALSE,countMultiMappingReads = TRUE)
## Take a look at the dimensions to see the number of genes
dim(fc.exon.multimap$counts)
## Take a look at the first 6 lines
head(fc.exon.multimap$counts)
# TODO: find exon names
```
Notes
* If you are sequencing your own data, the sequencing facility will almost always provide fastq files.
* For publicly available sequence data from GEO/SRA, the files are usually in the Sequence Read Archive
(SRA) format. Prior to read alignment, these files need to be converted into the
FASTQ format using the fastq-dump utility from the SRA Toolkit. See http:
//www.ncbi.nlm.nih.gov/books/NBK158900 for how to download and use the
SRA Toolkit.
* By default, alignment is performed with `unique=TRUE`. If a read can be aligned to
two or more locations, *Rsubread* will attempt to select the best location using a
number of criteria. Only reads that have a unique best location are reported as
being aligned. Keeping this default is recommended, as it avoids spurious signal
from non-uniquely mapped reads derived from, e.g., repeat regions.
* The Phred offset determines the encoding for the base-calling quality string in the
FASTQ file. For the Illumina 1.8 format onwards, this encoding is set at +33.
However, older formats may use a +64 encoding. Users should ensure that the
correct encoding is specified during alignment. If unsure, one can examine the
first several quality strings in the FASTQ file. A good rule of thumb is to check
whether lower-case letters are present (+64 encoding) or absent (+33).
* `featureCounts` requires gene annotation specifying the genomic start and end
position of each exon of each gene. *Rsubread* contains built-in gene annotation
for mouse and human. For other species, users will need to read in a data frame
in GTF format to define the genes and exons. Users can also specify a custom annotation file in SAF format. See the Rsubread users guide for more information, or try `?featureCounts`, which has an example of what an SAF file should like like.
# Package versions used
```{r}
sessionInfo()
```
# References