Anton Enright & Dimitrios Vitsios '13 June, 2017'
In this practical we will be taking 8 samples of Solexa Sequencing data from the course samples that were prepared for small-RNA sequencing from total RNA.
It is important for next-gen sequence analysis to know the exact architecture of your sequencing reads including the 5' sequencing adapters and the sequences of barcoding tags and which samples each refers to
For our course data, the read architecture is as follows:
5' adpt | genomic RNA | 3' adpt |
---|---|---|
<5p----------> | NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | <---------3p> |
Since miRNAs are shorter than the average length of a Solexa read (35-80nt), we will usually sequence through a microRNA sequence into the 3' adapter sequence. This needs to be detected and cleaned from our reads before they can be mapped. Additionally, each sequence will need to be sorted according to which barcode is present at the 5' end of the read into appropriate sample files.
Usually small RNA sequencing results in millions of reads for processing. The vast majority of tools for dealing with this kind of data require large amounts of memory and significant computing resources. However, on this course we will be testing a new ultra-fast read processor from the Enright lab, called Reaper which will be available in BioConductor shortly.
Most sequencing results can be obtained in a format called fastq, reflecting that they are fasta files with quality scores. These files contain the actual nucleotide bases for every sequence as well as a quality score which reflects how sure we are that each base is actually the correct one.
The quality score is usually given by a Phred score, which is related logarithmically to the probability of making an error when base calling. The following table describes this:
Phred score | Probability that the base is incorrect | Precision of the base |
---|---|---|
10 | 1 in 10 | 90 % |
20 | 1 in 100 | 99 % |
30 | 1 in 1000 | 99.9 % |
40 | 1 in 10000 | 99.99 % |
Getting back to our fastq file, the first few lines look like this:
@GAII06_0003:3:1:1040:19544#0/1
CCAAAGTGCTGGAATCACAGGCGTGAGCTACCTCGCCTGGCCTGAATTATT
+
bb__Zbb^^^]Z_]]aT]`^TbbYb^_`T_cc`aac^Y\`b`\`L]_]\YX</tt>
The first line is the identifier (starting with <@>), the second the sequence. The third is an extra spacer line (which sometimes also has the identifier repeated). The fourth are ascii encoded quality scores, where each letter represents a different number.
This is the standard ascii table that was used to encode these numbers:
0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel
8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si
16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb
24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us
32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '
40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7
56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?
64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G
72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O
80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _
96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w
120 x 121 y 122 z
From this table, we can see that the very frequent quality score "b" actually represents a numerical value of 98. One small detail: since the first 0-32 ascii codes represent strange things (e.g. bell, new line, backspace) we cannot use them for encoding. Thus, in order to encode real quality scores (0-40) we first need to shift the quality scores to avoid these strange characters. Unfortunately, there are two current standards, one which shifts the quality scores by adding 33, another by adding 64. The file we'll be using has been shifted by 64. This means that "b" actually represents the quality score of 34 (98 - 64).
Some ascii characters are unprintable so the entire table is shifted by 33 giving a final lookup table as follows, where each symbol represents a unique quality score.
0 ! 1 " 2 # 3 $ 4 % 5 & 6 '
7 ( 8 ) 9 * 10 + 11 , 12 - 13 . 14 /
15 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7
23 8 24 9 25 : 26 ; 27 < 28 = 29 > 30 ?
31 @ 32 A 33 B 34 C 35 D 36 E 37 F 38 G
39 H 40 I 41 J 42 K 43 L 44 M 45 N 46 O
47 P 48 Q 49 R 50 S 51 T 52 U 53 V 54 W
55 X 56 Y 57 Z 58 [ 59 \ 60 ] 61 ^ 62 _
63 ` 64 a 65 b 66 c 67 d 68 e 69 f 70 g
71 h 72 i 73 j 74 k 75 l 76 m 77 n 78 o
79 p 80 q 81 r 82 s 83 t 84 u 85 v 86 w
87 x 88 y 89 z
The module is available here If you want to install this in your own R at home.
Lets install the module manually. Please open up a new terminal window.
cd /BGA2017/Course_material/June14
/BGA2017/R-3.4.0/bin/R CMD INSTALL Reaper_1.5.tar.gz
This will install the module into R manually.
* installing to library \u2018/home/aje/R/x86_64-pc-linux-gnu-library/3.4\u2019
* installing *source* package \u2018Reaper\u2019 ...
** libs
gcc -std=gnu99 -I/BGA2017/R-3.4.0/lib64/R/include -DNDEBUG -I/usr/local/include -D_USE_KNETFILE -D_FILE_OFFSET_BITS=64 -DBUILD_R_BINDINGS -fpic -g -O2 -c reaper.c -o reaper.o
gcc -std=gnu99 -I/BGA2017/R-3.4.0/lib64/R/include -DNDEBUG -I/usr/local/include -D_USE_KNETFILE -D_FILE_OFFSET_BITS=64 -DBUILD_R_BINDINGS -fpic -g -O2 -c sw.c -o sw.o
gcc -std=gnu99 -I/BGA2017/R-3.4.0/lib64/R/include -DNDEBUG -I/usr/local/include -D_USE_KNETFILE -D_FILE_OFFSET_BITS=64 -DBUILD_R_BINDINGS -fpic -g -O2 -c slib.c -o slib.o
gcc -std=gnu99 -I/BGA2017/R-3.4.0/lib64/R/include -DNDEBUG -I/usr/local/include -D_USE_KNETFILE -D_FILE_OFFSET_BITS=64 -DBUILD_R_BINDINGS -fpic -g -O2 -c table.c -o table.o
gcc -std=gnu99 -I/BGA2017/R-3.4.0/lib64/R/include -DNDEBUG -I/usr/local/include -D_USE_KNETFILE -D_FILE_OFFSET_BITS=64 -DBUILD_R_BINDINGS -fpic -g -O2 -c trint.c -o trint.o
The fastq files, pdata file and mircounts file are also available but preinstalled on these machines.
library(Reaper)
library(gplots)
library(RColorBrewer)
Now we will set our working directory to where the solexa FASTQ files (zipped) are stored
setwd("~/Desktop/course_data/solexa")
list.files()
## [1] "1.lane.clean.gz" "1.lane.clean.uniq.gz"
## [3] "1.lane.report.clean.len" "1.lane.report.clean.nt"
## [5] "1.lane.report.input.nt" "1.lane.report.input.q"
## [7] "1.lint.gz" "1.sumstat"
## [9] "10.lane.clean.gz" "10.lane.clean.uniq.gz"
## [11] "10.lane.report.clean.len" "10.lane.report.clean.nt"
## [13] "10.lane.report.input.nt" "10.lane.report.input.q"
## [15] "10.lint.gz" "10.sumstat"
## [17] "11.lane.clean.gz" "11.lane.clean.uniq.gz"
## [19] "11.lane.report.clean.len" "11.lane.report.clean.nt"
## [21] "11.lane.report.input.nt" "11.lane.report.input.q"
## [23] "11.lint.gz" "11.sumstat"
## [25] "12.lane.clean.gz" "12.lane.clean.uniq.gz"
## [27] "12.lane.report.clean.len" "12.lane.report.clean.nt"
## [29] "12.lane.report.input.nt" "12.lane.report.input.q"
## [31] "12.lint.gz" "12.sumstat"
## [33] "2.lane.clean.gz" "2.lane.clean.uniq.gz"
## [35] "2.lane.report.clean.len" "2.lane.report.clean.nt"
## [37] "2.lane.report.input.nt" "2.lane.report.input.q"
## [39] "2.lint.gz" "2.sumstat"
## [41] "3.lane.clean.gz" "3.lane.clean.uniq.gz"
## [43] "3.lane.report.clean.len" "3.lane.report.clean.nt"
## [45] "3.lane.report.input.nt" "3.lane.report.input.q"
## [47] "3.lint.gz" "3.sumstat"
## [49] "4.lane.clean.gz" "4.lane.clean.uniq.gz"
## [51] "4.lane.report.clean.len" "4.lane.report.clean.nt"
## [53] "4.lane.report.input.nt" "4.lane.report.input.q"
## [55] "4.lint.gz" "4.sumstat"
## [57] "5.lane.clean.gz" "5.lane.clean.uniq.gz"
## [59] "5.lane.report.clean.len" "5.lane.report.clean.nt"
## [61] "5.lane.report.input.nt" "5.lane.report.input.q"
## [63] "5.lint.gz" "5.sumstat"
## [65] "6.lane.clean.gz" "6.lane.clean.uniq.gz"
## [67] "6.lane.report.clean.len" "6.lane.report.clean.nt"
## [69] "6.lane.report.input.nt" "6.lane.report.input.q"
## [71] "6.lint.gz" "6.sumstat"
## [73] "7.lane.clean.gz" "7.lane.clean.uniq.gz"
## [75] "7.lane.report.clean.len" "7.lane.report.clean.nt"
## [77] "7.lane.report.input.nt" "7.lane.report.input.q"
## [79] "7.lint.gz" "7.sumstat"
## [81] "8.lane.clean.gz" "8.lane.clean.uniq.gz"
## [83] "8.lane.report.clean.len" "8.lane.report.clean.nt"
## [85] "8.lane.report.input.nt" "8.lane.report.input.q"
## [87] "8.lint.gz" "8.sumstat"
## [89] "9.lane.clean.gz" "9.lane.clean.uniq.gz"
## [91] "9.lane.report.clean.len" "9.lane.report.clean.nt"
## [93] "9.lane.report.input.nt" "9.lane.report.input.q"
## [95] "9.lint.gz" "9.sumstat"
## [97] "clean.sh" "lane1Brain2_sequence.fastq.gz"
## [99] "lane1Brain5_sequence.fastq.gz" "lane1Brain6_sequence.fastq.gz"
## [101] "lane1Heart1_sequence.fastq.gz" "lane1Heart11_sequence.fastq.gz"
## [103] "lane1heart3_sequence.fastq.gz" "lane1Heart8_sequence.fastq.gz"
## [105] "lane1Liver10_sequence.fastq.gz" "lane1Liver12x2_sequence.fastq.gz"
## [107] "lane1Liver4_sequence.fastq.gz" "lane1Liver7_sequence.fastq.gz"
## [109] "lane1Liver9_sequence.fastq.gz" "metadata1.txt"
## [111] "metadata10.txt" "metadata11.txt"
## [113] "metadata12.txt" "metadata2.txt"
## [115] "metadata3.txt" "metadata4.txt"
## [117] "metadata5.txt" "metadata6.txt"
## [119] "metadata7.txt" "metadata8.txt"
## [121] "metadata9.txt" "mircounts.txt"
## [123] "pdata.txt" "Reaper_1.5.tar.gz"
## [125] "reaper.pdf" "reaperlog.txt"
Hopefully, you will see a compressed FASTQ txt file for each of the 4 lanes
It is important that we also load information for reaper that tells it the following:
- Which FASTQ files are present.
- Which Barcode sequences correspond to which sample names.
- What 5' and 3' sequencing adapters were used in library generation.
You should see a sample table loaded into R:
pdata <- read.table("pdata.txt",header=TRUE,check.names=FALSE)
pdata
## filename 3p-ad tabu
## 1 lane1Brain2_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 2 lane1Brain5_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 3 lane1Brain6_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 4 lane1Heart11_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 5 lane1Heart1_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 6 lane1heart3_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 7 lane1Heart8_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 8 lane1Liver10_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 9 lane1Liver12x2_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 10 lane1Liver4_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 11 lane1Liver7_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## 12 lane1Liver9_sequence.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCC NA
## samplename
## 1 Brain
## 2 Brain
## 3 Brain
## 4 Heart
## 5 Heart
## 6 Heart
## 7 Heart
## 8 Liver
## 9 Liver
## 10 Liver
## 11 Liver
## 12 Liver
Next we will start the Reaper algorithm. It will perform the following functions on all the lanes we have provided:
- Splitting up reads according to provided barcodes
- Detection of 3' or 5' adapter contamination using Smith-Waterman local alignment
- Detection of low-complexity sequences, such as Poly-As or Poly-Ns
- Quality score thresholding and trimming if required
- Collapsing of reads according to depth in a summary FASTA result file per barcode per lane.
- Generation of Quality Control plots for assessing sequencing quality
Reaper is started by passing it our samples table and telling it which "mode" to run in, in this case the mode is set to: no-bc.
# For the sake of time we'll only run on the first 100000 reads
reaper(pdata,"no-bc",c("do"="10000"));
# This is the command to do all reads
# reaper(pdata,"no-bc"); We comment it out here
Cleaning many millions of reads will take some time, the method processes around 2M-4M reads per minute.
[1] "Starting Reaper for file: lane1Brain2_sequence.fastq.gz"
fastq geom
"lane1Brain2_sequence.fastq.gz" "no-bc"
meta basename
"metadata1.txt" "1"
Passing to reaper: dummy-internal --R -fastq lane1Brain2_sequence.fastq.gz -geom no-bc -meta metadata1.txt -basename 1
---
mRpm million reads per minute
mNpm million nucleotides per minute
mCps million alignment cells per second
lint total removed reads (per 10K), sum of columns to the left
25K reads per dot, 1M reads per line seconds mr mRpm mNpm mCps {error qc low len NNN tabu nobc cflr cfl lint OK} per 10K
........................................ 26 1 2.4 115 59 0 0 0 0 0 0 0 0 0 0 10000
Reaper is designed to be fast and memory-efficient so it should run on any machine with 500MB of RAM or more. The time taken to complete the run depends on how fast the processors in your machine are.
Let's take a look at the Quality Control Metrics generated
reaperQC(pdata)
## [1] "Processing Reaper Results for: lane1Brain2_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Brain5_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Brain6_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart1_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart11_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1heart3_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart8_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver10_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver12x2_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver4_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver7_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver9_sequence.fastq.gz lane"
Lets also make a nice PDF of the results and explore the QC metrics for the data
pdf("reaper.pdf",width=12)
reaperQC(pdata)
dev.off()
## [1] "Processing Reaper Results for: lane1Brain2_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Brain5_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Brain6_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart1_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart11_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1heart3_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Heart8_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver10_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver12x2_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver4_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver7_sequence.fastq.gz lane"
## [1] "Processing Reaper Results for: lane1Liver9_sequence.fastq.gz lane"
## png
## 2
You should get one plot back for each lane processed.
Here is a list of files generated during the Reaper run
10.lane.clean.gz
10.lane.clean.uniq.gz
10.lane.report.clean.len
10.lane.report.clean.nt
10.lane.report.input.nt
10.lane.report.input.q
10.lint.gz
10.sumstat
11.lane.clean.gz
11.lane.clean.uniq.gz
11.lane.report.clean.len
11.lane.report.clean.nt
11.lane.report.input.nt
11.lane.report.input.q
11.lint.gz
11.sumstat
12.lane.clean.gz
12.lane.clean.uniq.gz
12.lane.report.clean.len
12.lane.report.clean.nt
12.lane.report.input.nt
12.lane.report.input.q
12.lint.gz
12.sumstat
1.lane.clean.gz
1.lane.clean.uniq.gz
1.lane.report.clean.len
1.lane.report.clean.nt
1.lane.report.input.nt
1.lane.report.input.q
1.lint.gz
1.sumstat
2.lane.clean.gz
2.lane.clean.uniq.gz
2.lane.report.clean.len
2.lane.report.clean.nt
2.lane.report.input.nt
2.lane.report.input.q
2.lint.gz
2.sumstat
3.lane.clean.gz
3.lane.clean.uniq.gz
3.lane.report.clean.len
3.lane.report.clean.nt
3.lane.report.input.nt
3.lane.report.input.q
3.lint.gz
3.sumstat
4.lane.clean.gz
4.lane.clean.uniq.gz
4.lane.report.clean.len
4.lane.report.clean.nt
4.lane.report.input.nt
4.lane.report.input.q
4.lint.gz
4.sumstat
5.lane.clean.gz
5.lane.clean.uniq.gz
5.lane.report.clean.len
5.lane.report.clean.nt
5.lane.report.input.nt
5.lane.report.input.q
5.lint.gz
5.sumstat
6.lane.clean.gz
6.lane.clean.uniq.gz
6.lane.report.clean.len
6.lane.report.clean.nt
6.lane.report.input.nt
6.lane.report.input.q
6.lint.gz
6.sumstat
7.lane.clean.gz
7.lane.clean.uniq.gz
7.lane.report.clean.len
7.lane.report.clean.nt
7.lane.report.input.nt
7.lane.report.input.q
7.lint.gz
7.sumstat
8.lane.clean.gz
8.lane.clean.uniq.gz
8.lane.report.clean.len
8.lane.report.clean.nt
8.lane.report.input.nt
8.lane.report.input.q
8.lint.gz
8.sumstat
9.lane.clean.gz
9.lane.clean.uniq.gz
9.lane.report.clean.len
9.lane.report.clean.nt
9.lane.report.input.nt
9.lane.report.input.q
9.lint.gz
9.sumstat
We will now use a web-utility to map cleaned and filtered reads against miRBase known mature miRNA sequences. This server will take each cleaned unique read and compare it to precursor sequences downloaded from miRBase. In the case of mismatches, a read will still be assigned if it has less than two mismatches assuming both sequences are each others best hit. The system supports compressed (GZ or ZIP) files as well as FASTQ text files.
Click here To use Chimira
For this web-server, choose each of the .clean.uniq fastq files that were produced by reaper. Make sure you choose Human as the reference species.
We now have raw counts of reads on microRNAs ready for QC and differential analysis.