Day 2: Introduction to basic NGS pipelines

Overview

Lead Instructor

Fotis Psomopoulos | @fopsom | email

Co-Instructor(s)

Amel Ghouila | @AmelGhouila | email

Helper(s)

Maria Tsagiopoulou | @tsayo7 email

General Topics

Introduction to basic NGS pipelines
- Quality Assessment
  - fastqc / multiqc
  - trimmomatic
- Mapping
  - bwa mem
- Example pipeline: Variant Calling
  - samtools / bcftools

Schedule

09:00 - 10:00 Intro to NGS pipelines
10:00 - 11:00 Short read quality and trimming (part 1)
11:00 - 11:30 Coffee break
11:30 - 12:30 Short read quality and trimming (part 2)
12:30 - 14:00 Lunch break
14:00 - 15:00 Mapping
15:00 - 16:00 Variant calling (part 1)
16:00 - 16:15 Coffee break
16:15 - 18:00 Variant calling (part 1)

Learning Objectives

Use fastqc and multiqc
Visualize read quality
Quality filter and trim reads
Distinguishing good/bad quality reads

Introduction to basic NGS pipelines

The slides of the introductory talk are available here

Basic DNASeq analysis/Variant calling workflow

The following material and exercises are based on the Next-Gen Sequence Analysis Workshop that took place during the Data Intensive Biology Summer Institute at UC Davis, USA, 2-14/07/2018 (link here)

Quality Assessment

Learning objectives

Use fastqc and multiqc
Visualize read quality
Quality filter and trim reads
Distinguishing good/bad quality reads

We will use the following dataset. Run the command to download it locally:

curl -L -o ecoli_ref-5m.fastq.gz https://osf.io/frdz5/download

This is a FASTQ file -- let's take a look at it:

less ecoli_ref-5m.fastq.gz

(use the spacebar to scroll down, and type q to exit less)

Question:

where does the filename come from?

Links:

FASTQ Format

FastQC

We're going to use FastQC to summarize the data. We already installed fastqc above, with the conda command.

Now, run FastQC on the file:

fastqc ecoli_ref-5m.fastq.gz

Now type ls:

ls -d *fastqc.zip*

to list the files, and you should see:

ecoli_ref-5m_fastqc.zip

Inside the fastqc directory you will find the report from the fastqc program. You can also download my versions of the html and zip outputs.

Questions:

What should you pay attention to in the FastQC report?

Links:

FastQC
FastQC tutorial video

There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file

Trimmomatic

Now we're going to do some trimming! We'll be using Trimmomatic, which (as with fastqc) we've already installed via conda.

The first thing we'll need are the adapters to trim off (you might need to change the following command to reflect your own setup):

cp /home/fpsom/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/adapters/TruSeq2-PE.fa .

You can look at the contents of this file with cat TruSeq2-PE.fa (you can download the same file from here)

Now, to run Trimmomatic on both of them:

trimmomatic SE ecoli_ref-5m.fastq.gz \
        ecoli_ref-5m.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25

You should see output that looks like this:

...
Input Reads: 5000000 Surviving: 4996146 (99.92%) Dropped: 3854 (0.08%)
TrimmomaticSE: Completed successfully

Questions:

How do you figure out what the parameters mean?
How do you figure out what parameters to use?
What adapters do you use?
What version of Trimmomatic are we using here? (And FastQC?)
Do you think parameters are different for RNAseq and genomic data sets?
What's with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see MacManes, 2014 -- it's about RNAseq but similar arguments should apply to metagenome assembly.

Links:

Trimmomatic

FastQC again

Run FastQC again on the trimmed files:

fastqc ecoli_ref-5m.qc.fq.gz

Let's take a look at the output files:

less ecoli_ref-5m.qc.fq.gz

(again, use spacebar to scroll, q to exit less).

You can also download my versions of the html and zip outputs.

MultiQc

MultiQC aggregates results across many samples into a single report for easy comparison.

Run mulitqc on both the untrimmed and trimmed files

multiqc .

And now you should see output that looks like this:

[INFO   ]         multiqc : This is MultiQC v1.6
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 7 files..  [####################################]  100%
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

Questions:

is the quality trimmed data "better" than before?
Does it matter that you still have adapters!?

Exercise 1

Try running trimmomatic with different paramaters in order to further improve / salvage your dataset. Try with different setups and identify the impact they have.

Exercise 2

Run the same steps for the following dataset that comes from the Schurch et al, 2016 yeast RNAseq study:

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz

Mapping and Variant Calling

Learning objectives

Explore coverage
Understand the basics of mapping-based variant calling
Learn basics of actually calling variants & visualizing

We will use the dataset you cleaned (trimmed and filtered) during the previous exercise. You can make sure you have the correct data loaded if you get the following output:

ls -la

-rwxrwxrwx 1 fpsom fpsom  59532325 Aug 19 11:56 ERR458493.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  59459449 Aug 19 12:02 ERR458493.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom  58566854 Aug 19 11:57 ERR458494.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  58496906 Aug 19 12:02 ERR458494.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom  58114810 Aug 19 11:58 ERR458495.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  58048466 Aug 19 12:03 ERR458495.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 102201086 Aug 19 11:59 ERR458500.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 102160038 Aug 19 12:03 ERR458500.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 101222099 Aug 19 11:59 ERR458501.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 101182595 Aug 19 12:04 ERR458501.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 100585843 Aug 19 12:01 ERR458502.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 100545339 Aug 19 12:10 ERR458502.qc.fq.gz

Download and gunzip the reference:

curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz

and look at it:

head orf_coding.fasta

Prepare it for mapping:

bwa index orf_coding.fasta

Map!

bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz  > ERR458493.sam

Observe!

head ERR458493.sam

what does all this mean??

Visualize mapping

Goal: make it possible to go look at a specific bit of the genome.

Index the reference genome:

samtools faidx orf_coding.fasta

Convert the SAM file into a BAM file:

samtools import orf_coding.fasta.fai ERR458493.sam ERR458493.bam

Sort the BAM file by position in genome:

samtools sort ERR458493.bam -o ERR458493.sorted.bam

Index the BAM file so that we can randomly access it quickly:

samtools index ERR458493.sorted.bam

Visualize with `tview`:

samtools tview ERR458493.sorted.bam orf_coding.fasta

tview commands of relevance:

left and right arrows scroll
q to quit
CTRL-h and CTRL-l do "big" scrolls
Typing g allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:
- YLR162W:293 (impressive pileup, shows two clear variants and three other less clear)
- YDR034C-A:98 (impressive pileup, shows two clear variants)
- YDR366C:310 (impressive pileup, less clear variants)
- YLR256W:4420 (impressive pileup, less clear variants)
- YBL105C:2179 (less depth, shows two clear variants)
- YDR471W:152 (impressive pileup, shows one clear variant)

Get some summary statistics as well:

samtools flagstat ERR458493.sorted.bam

The output should look like this:

1092728 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
13 + 0 supplementary
0 + 0 duplicates
989419 + 0 mapped (90.55% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Call variants!

Goal: find places where the reads are systematically different from the genome.

Now we can call variants using samtools mpileup:

samtools mpileup -u -t DP -f orf_coding.fasta ERR458493.sorted.bam | \
    bcftools call -mv -Ov > variants.vcf

You can download my version of the variants.vcf from here.

To look at the entire variants.vcf file you can do cat variants.vcf; all of the lines starting with # are comments. You can use tail variants.vcf to see the last ~10 lines, which should be all of the called variants.

Exercise 1

Try to run the variant calling pipeline using one of the other ERR458*** samples (i.e. bwa mem, samtools etc). Can you identify differences in the new vcf file?

Exercise 2

In a realistic scenario, you don't want the variants per sample, but rather all the samples together (the mapping process is still done per sample). What would be the new samtools mpileup command that will incorporate multiple files in the variant calling process?

solution

samtools mpileup -D -u -t DP -f orf_coding.fasta ERR458493.sorted.bam ERR458494.sorted.bam ERR458495.sorted.bam ERR458500.sorted.bam ERR458501.sorted.bam ERR458502.sorted.bam | bcftools call -mv -Ov > variantsAll.vcf

You can download the new "full" vcf file from here.

Discussion points / extra things to cover

What are the drawbacks to mapping-based variant calling? What are the positives?
Where do reference genomes come from?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.Day2.md

2.Day2.md

Day 2: Introduction to basic NGS pipelines

Overview

Lead Instructor

Co-Instructor(s)

Helper(s)

General Topics

Schedule

Learning Objectives

Introduction to basic NGS pipelines

Basic DNASeq analysis/Variant calling workflow

Quality Assessment

FastQC

Trimmomatic

FastQC again

MultiQc

Mapping and Variant Calling

Download and gunzip the reference:

Prepare it for mapping:

Map!

Observe!

Visualize mapping

Index the reference genome:

Convert the SAM file into a BAM file:

Sort the BAM file by position in genome:

Index the BAM file so that we can randomly access it quickly:

Visualize with `tview`:

Call variants!

Discussion points / extra things to cover

Files

2.Day2.md

Latest commit

History

2.Day2.md

File metadata and controls

Day 2: Introduction to basic NGS pipelines

Overview

Lead Instructor

Co-Instructor(s)

Helper(s)

General Topics

Schedule

Learning Objectives

Introduction to basic NGS pipelines

Basic DNASeq analysis/Variant calling workflow

Quality Assessment

FastQC

Trimmomatic

FastQC again

MultiQc

Mapping and Variant Calling

Download and gunzip the reference:

Prepare it for mapping:

Map!

Observe!

Visualize mapping

Index the reference genome:

Convert the SAM file into a BAM file:

Sort the BAM file by position in genome:

Index the BAM file so that we can randomly access it quickly:

Visualize with tview:

Call variants!

Discussion points / extra things to cover

Visualize with `tview`: