Skip to content

Latest commit

 

History

History
124 lines (85 loc) · 5.88 KB

File metadata and controls

124 lines (85 loc) · 5.88 KB

Workshop: Misc

Hands-on

Assembly statistics

Use assembly-stats to check the quality of some of your assemblies produced during the workshop. Which assembly you produced for Salmonella (considering Illumina and Nanopore data) has the best N50?

Code & README

mamba create -y -p envs/assembly-stats assembly-stats
conda activate envs/assembly-stats
# Just an example, for the E. coli flye assembly
assembly-stats flye_output/assembly.fasta

Assembly comparison

Use quast to compare different statistics for selected assemblies. For example, chose the assemblies you produced with Illumina and Nanopore as an input. Investigate the output and reports. Do you agree with quast regarding what the "best" assembly is?

Check the --help and chose appropriate parameters to run the tool.

Code & REAMDE

mamba create -y -p envs/quast quast
conda activate envs/quast
#quast --help

Investigation of fragmented ORFs via IDEEL

A neat way to do so is using so-called IDEEL plots as initially proposed by Mick Watson. Different people implemented this approach, for example github.com/phiweger/ideel.

Here, we will combine various tasks to get this approach running.

Lets first install the code from this repository and the necessary dependencies via mamba. Then, we generate the protein database index, prepare the input genome files and run the workflow.

mamba create -y -p envs/ideel snakemake prodigal diamond r-ggplot2 r-readr
conda activate envs/ideel

wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"

git clone https://github.com/phiweger/ideel.git
cd ideel

# get a reference database of protein sequences
# generate an index with diamond
mv ../uniprot_sprot.fasta.gz .
diamond makedb --in uniprot_sprot.fasta.gz -d database_uniprot

# The output of the workflow will be written to --directory. 
mkdir ideel-results
# In there, make a directory called "genomes"
mkdir ideel-results/genomes
# put assemblies in there with .fa file extension
cp ../ecoli-reference.fna ideel-results/genomes/eco-reference.fa
cp ../flye_output/assembly.fasta ideel-results/genomes/eco-consensus-flye.fasta
cp ../eco-consensus-racon.fasta ideel-results/genomes/eco-consensus-racon.fasta
cp ../eco-medaka/consensus.fasta ideel-results/genomes/eco-consensus-medaka.fasta
# The workflow wants the files to have .fa instead of .fasta!
rename 's/\.fasta$/.fa/' ideel-results/genomes/*.fasta

# run the workflow
snakemake --configfile config.json --config db=../database_uniprot.dmnd --directory ideel-results/ --cores 4
cd ..

Investigate the final output plots. How fragmented are your ORFs? Run the workflow on an assembly produced with Illumina and Nanopore. How does the fragmentation of your ORFs change when you polish your assembly?

Hybrid assembly

When you have both Nanopore and Illumina data available for the same sample it can be worth to assemble the data "together". Commonly used for this task is (was) Unicycler.

Exercise

  • make a mamba environment and install Unicycler
  • check out the Unicycler code repository
  • .. and the --help to learn how to use the tool
  • provide short and long reads as input
    • you can decide to use the raw or length-filtered long reads
    • In the section above about Illumina short-read assembly you will find the paths to the Illumina reads
  • how does the resulting assembly compare to the short- and long-read-only assemblies?

Note that Unicycler internally uses SPAdes as a first step. That means that the assembly process will start from short reads. This can introduce problems early that can be also not solved at later steps. Because of that, it can be better to start with an initial long-read assembly (e.g. flye) and then introduce the short-read data later for polishing (e.g. polypolish) instead of using Unicycler.

Remember that you can compare different assemblies via quast!

Nowadays, Nanopore sequencing improved that much that in many cases it does not make sense anymore to run short-read-first assemblies. Ryan Wick, the original developer of Unicycler, Polypolish, Trycycler, ... wrote some very good explanations about this.

Check out his Trycycler tool!

Exercise

  • use Trycycler to combine the assembly results you have from different runs of the same sample into a consensus long-read assembly

Taxonomic classification with Sourmash

We can use sourmash to classify the assembled contigs taxonomically using reference sequences. An alternative could be kraken2 on read level, for example.

  • compute sequence signatures for inputs (compute)
  • select 10000 hashes of input k-mers (--scaled 10000)
  • select 31 as k-mer size (--k 31)
mamba create -y -p envs/sourmash sourmash
conda activate envs/sourmash

# download sourmash GTDB index v202 w/ taxonomy
# GTDB genomic representatives (47.8k genomes), LCA, kmer 31 --> https://sourmash.readthedocs.io/en/latest/databases.html
wget --no-check-certificate https://osf.io/ypsjq/download -O gtdb-rs202.genomic-reps.k31.lca.json.gz 
DB='gtdb-rs202.genomic-reps.k31.lca.json.gz'

# calculate signatures for your genome
sourmash compute --scaled 10000 -k 31 eco-medaka/consensus.fasta -o sourmash.sig
sourmash lca classify --db $DB --query sourmash.sig -o taxonomic-classification.txt

#check results
cat taxonomic-classification.txt

Additional information on Sourmash: Code | Publication