- Introduction to basic NGS pipelines
- Quality Assessment
fastqc
/multiqc
trimmomatic
- Mapping
bwa mem
- Example pipeline: Variant Calling
samtools
/bcftools
- Quality Assessment
-
09:00 - 10:00 Intro to NGS pipelines
-
10:00 - 11:00 Short read quality and trimming (part 1)
-
11:00 - 11:30 Coffee break
-
11:30 - 12:30 Short read quality and trimming (part 2)
-
12:30 - 14:00 Lunch break
-
14:00 - 15:00 Mapping
-
15:00 - 16:00 Variant calling (part 1)
-
16:00 - 16:15 Coffee break
-
16:15 - 18:00 Variant calling (part 1)
- Use
fastqc
andmultiqc
- Visualize read quality
- Quality filter and trim reads
- Distinguishing good/bad quality reads
The slides of the introductory talk are available here
The following material and exercises are based on the Next-Gen Sequence Analysis Workshop that took place during the Data Intensive Biology Summer Institute at UC Davis, USA, 2-14/07/2018 (link here)
Learning objectives
- Use
fastqc
andmultiqc
- Visualize read quality
- Quality filter and trim reads
- Distinguishing good/bad quality reads
We will use the following dataset. Run the command to download it locally:
curl -L -o ecoli_ref-5m.fastq.gz https://osf.io/frdz5/download
This is a FASTQ file -- let's take a look at it:
less ecoli_ref-5m.fastq.gz
(use the spacebar to scroll down, and type q
to exit less
)
Question:
- where does the filename come from?
Links:
We're going to use FastQC to summarize the data. We already installed fastqc
above, with the conda command.
Now, run FastQC on the file:
fastqc ecoli_ref-5m.fastq.gz
Now type ls
:
ls -d *fastqc.zip*
to list the files, and you should see:
ecoli_ref-5m_fastqc.zip
Inside the fastqc
directory you will find the report from the fastqc
program. You can also download my versions of the html and zip outputs.
Questions:
- What should you pay attention to in the FastQC report?
Links:
There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file
Now we're going to do some trimming! We'll be using Trimmomatic, which (as with fastqc
) we've already installed via conda
.
The first thing we'll need are the adapters to trim off (you might need to change the following command to reflect your own setup):
cp /home/fpsom/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/adapters/TruSeq2-PE.fa .
You can look at the contents of this file with cat TruSeq2-PE.fa
(you can download the same file from here)
Now, to run Trimmomatic on both of them:
trimmomatic SE ecoli_ref-5m.fastq.gz \
ecoli_ref-5m.qc.fq.gz \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
You should see output that looks like this:
...
Input Reads: 5000000 Surviving: 4996146 (99.92%) Dropped: 3854 (0.08%)
TrimmomaticSE: Completed successfully
Questions:
- How do you figure out what the parameters mean?
- How do you figure out what parameters to use?
- What adapters do you use?
- What version of Trimmomatic are we using here? (And FastQC?)
- Do you think parameters are different for RNAseq and genomic data sets?
- What's with these annoyingly long and complicated filenames?
For a discussion of optimal trimming strategies, see MacManes, 2014 -- it's about RNAseq but similar arguments should apply to metagenome assembly.
Links:
Run FastQC again on the trimmed files:
fastqc ecoli_ref-5m.qc.fq.gz
Let's take a look at the output files:
less ecoli_ref-5m.qc.fq.gz
(again, use spacebar to scroll, q
to exit less).
You can also download my versions of the html and zip outputs.
MultiQC aggregates results across many samples into a single report for easy comparison.
Run mulitqc
on both the untrimmed and trimmed files
multiqc .
And now you should see output that looks like this:
[INFO ] multiqc : This is MultiQC v1.6
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '.'
Searching 7 files.. [####################################] 100%
[INFO ] fastqc : Found 2 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
Questions:
- is the quality trimmed data "better" than before?
- Does it matter that you still have adapters!?
Exercise 1
Try running trimmomatic
with different paramaters in order to further improve / salvage your dataset. Try with different setups and identify the impact they have.
Exercise 2
Run the same steps for the following dataset that comes from the Schurch et al, 2016 yeast RNAseq study:
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
Learning objectives
- Explore coverage
- Understand the basics of mapping-based variant calling
- Learn basics of actually calling variants & visualizing
We will use the dataset you cleaned (trimmed and filtered) during the previous exercise. You can make sure you have the correct data loaded if you get the following output:
ls -la
-rwxrwxrwx 1 fpsom fpsom 59532325 Aug 19 11:56 ERR458493.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 59459449 Aug 19 12:02 ERR458493.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 58566854 Aug 19 11:57 ERR458494.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 58496906 Aug 19 12:02 ERR458494.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 58114810 Aug 19 11:58 ERR458495.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 58048466 Aug 19 12:03 ERR458495.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 102201086 Aug 19 11:59 ERR458500.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 102160038 Aug 19 12:03 ERR458500.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 101222099 Aug 19 11:59 ERR458501.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 101182595 Aug 19 12:04 ERR458501.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 100585843 Aug 19 12:01 ERR458502.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 100545339 Aug 19 12:10 ERR458502.qc.fq.gz
curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz
and look at it:
head orf_coding.fasta
bwa index orf_coding.fasta
bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz > ERR458493.sam
head ERR458493.sam
what does all this mean??
Goal: make it possible to go look at a specific bit of the genome.
samtools faidx orf_coding.fasta
samtools import orf_coding.fasta.fai ERR458493.sam ERR458493.bam
samtools sort ERR458493.bam -o ERR458493.sorted.bam
samtools index ERR458493.sorted.bam
samtools tview ERR458493.sorted.bam orf_coding.fasta
tview
commands of relevance:
-
left and right arrows scroll
-
q
to quit -
CTRL-h and CTRL-l do "big" scrolls
-
Typing
g
allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:YLR162W:293
(impressive pileup, shows two clear variants and three other less clear)YDR034C-A:98
(impressive pileup, shows two clear variants)YDR366C:310
(impressive pileup, less clear variants)YLR256W:4420
(impressive pileup, less clear variants)YBL105C:2179
(less depth, shows two clear variants)YDR471W:152
(impressive pileup, shows one clear variant)
Get some summary statistics as well:
samtools flagstat ERR458493.sorted.bam
The output should look like this:
1092728 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
13 + 0 supplementary
0 + 0 duplicates
989419 + 0 mapped (90.55% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Goal: find places where the reads are systematically different from the genome.
Now we can call variants using samtools mpileup:
samtools mpileup -u -t DP -f orf_coding.fasta ERR458493.sorted.bam | \
bcftools call -mv -Ov > variants.vcf
You can download my version of the variants.vcf
from here.
To look at the entire variants.vcf
file you can do cat variants.vcf
; all of the lines starting with #
are comments. You can use tail variants.vcf
to see the last ~10 lines, which should be all of the called variants.
Exercise 1
Try to run the variant calling pipeline using one of the other ERR458***
samples (i.e. bwa mem
, samtools
etc). Can you identify differences in the new vcf
file?
Exercise 2
In a realistic scenario, you don't want the variants per sample, but rather all the samples together (the mapping process is still done per sample). What would be the new samtools mpileup
command that will incorporate multiple files in the variant calling process?
solution
samtools mpileup -D -u -t DP -f orf_coding.fasta ERR458493.sorted.bam ERR458494.sorted.bam ERR458495.sorted.bam ERR458500.sorted.bam ERR458501.sorted.bam ERR458502.sorted.bam | bcftools call -mv -Ov > variantsAll.vcf
You can download the new "full" vcf
file from here.
- What are the drawbacks to mapping-based variant calling? What are the positives?
- Where do reference genomes come from?