Skip to content

Latest commit

 

History

History
393 lines (256 loc) · 21.1 KB

Thursday_Morning.md

File metadata and controls

393 lines (256 loc) · 21.1 KB

The epidemiology of a clonal pathogen

Co-ordinator: Dr Sion Bayliss

Dependencies

  • Roary
  • Prokka
  • PhyMl
  • FastTree

###Taito

Before we begin log into the taito in a non-interactive shell using ssh username@taito.csc.fi

# Make an interactive instance.
screen -S WorkingSession
sinteractive

# Load the appropriate modules
module load biokit
module load prokka  

If you need to close/suspend your computer and keep and processes running on the node you can press Ctrl+A+D to detach the screen.

To reattach to your session use the command screen -r WorkingSession

Summary

In the previous session you were introduced to whole genome assembly and associated QC. During the next session you will focus on how the results of NGS assembly can be used to study the epidemiology of your sample.

You will be using an expanded version of the Renibacterium salmoninarum dataset you produced during the Wednesday afternoon session.

You will:

  • Identify the pangenome of your sample
  • Identify notable genes from the pangenome
  • Build a phylogenetic tree from the core genes in your pangenome.
  • Compare phylogenetic tree building methods.
  • Visualise the pangenome and phylogenetic trees and learn how they might be used to understand the epidemiology of a pathogen.

NOTE: Mapping to a reference genome vs. aligning assemblies

Today we will be applying a number of analyses to the contigs produced by assembly of raw reads. You should be made aware that there are other equally valid and complementary approachs to studying phylogeny.

One approach, typically referred to as mapping, involves aligning your raw reads onto a reference genome in order to call variants. This is suitable for resequencing projects, phylogeny of clonal populations and is sometimes used to study closely related species.

Mapping is usually considered more accurate than assembly and is less time consuming. I will discuss mapping in more depth during the brief lecture. However, today we will be focusing on how to study epidemiology using core gene concatenates.

Pangenomics

The pangenome is defined as 'describes the full complement of genes in a clade'.

We can sub divide the pangenome into two broad categories, the core and accessory genomes:

  • Core Genome - The core genome is often the first step in population genomics studies and the core genome can be defined in different ways. Often it is considered genes that are present in 100% of isolates within your sample. Alternatively, we may include genes that are missing from only individual, exceptional isolates or those that are present in 100% of genomes but are know to be horizontally acquired. Core genes are often important for a number of reasons such as understanding the typical metabolic pathways within your organism or for the identification of drug or vaccine targets.

  • Accessory Genome - Genes that are present only in a subset of isolates from your sample. Accessory elements can include virulence, resistance or adaptation elements such as phages or plasmids but can also include gene duplications or other horizontally acquired elements such as insertion sequences.

Identifying the pangenome

Firstly, we will identify the pangenome of our sample using a tool called Roary

In order to run Roary we need to a gff file ( general feature format ) for each of the genomes we are interested in studying.

(General feature format)[http://www.sequenceontology.org/gff3.shtml].
This is a file format that is comprised of header lines (denoted by #) and feature lines. 
Each feature line contains information about a specific sequence feature ( e.g. SNP,Repeat,Gene,Promoter,etc ).
Features lines contain nine columns:
Column 1: "seqid"	The ID of the landmark used to establish the coordinate system for the current feature. 
Column 2: "source"	The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. 
Column 3: "type"	The type of the feature
Columns 4 & 5:	"start" and "end" (location of a feature on the seqid)
Column 6: "score"	The score of the feature, a floating point number. 
Column 7: "strand"	The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. 
Column 8: "phase"	For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.
Column 9: "attributes"	A list of feature attributes in the format tag=value. 

These are generated by Prokka (which you used yesterday) and have been provided for you for each subset of isolates in directory /Thursday19thMay/Renibacterium_Example/GFFs/ that you will have copied to your working directory. Navigate to the GFF folder.

Now lets run Roary:

# Taito specific
module load biokit
module load prokka

# Navigate to the folder containing all of the gff files.
cd GFFs

# Run Roary (it will make a results directory for you)
# The *.gff lists all files with the .gff extension, in this case our files of interest. 
roary -p 4 -o pangenome_default -f ./pangenome_default -v *.gff 
Command Breakdown

-p INT    number of threads 
-o STR    clusters output filename 
-f STR    output directory 
-v        verbose output to STDOUT

When checking in the output folder we will find a number of important files including:

summary_statistics.txt - a brief summary of the pangenome

gene_presence_absence.csv - a comprehensive summary of the genes that have been identified in your sample, the clusters to which they have been assigned and other pertinent information.

Using your sftp client (e.g. WinSCP or Filezilla) copy summary_statistics.txt and gene_presence_absence.csv to your local computer.

We could have also created a number of useful plots by passing the -r argument to Roary. Unfortunately, this option is not available on taito so I have provided the plots for you in /Thursday19thMay/Renibacterium_Example/Pangenome/. Using your sftp client (e.g. WinSCP or Filezilla) copy Rplots.pdf to your local computer.

Exploring the Pangenome

Our first place to look is the summary_statistics.txt file.

This will have the format:

Core genes	(99% <= strains <= 100%)	3266
Soft core genes	(95% <= strains < 99%)		0
Shell genes	(15% <= strains < 95%)		147
Cloud genes	(0% <= strains < 15%)		130
Total genes	(0% <= strains <= 100%)		3543

We can see that the Renibacterium isolates have very few accessory genes. Renibacterium salmoninarum has a stable pangenome and a highly clonal population structure. It doesn’t often acquire many (if any) genes through horizontal genes transfer.

If we open Rplots.pdf we can look at a number of informative plots. Most of the tabs have error bars that were generated by bootstrapping (isolates were subsampled in a random order to estimate this impacts on the identification of the pangenome). Some important plots include:

  • Number of new genes - After the introduction of the first genome very few 'new' genes are introduced.
  • Number of conserved genes - The number of conserved genes drops very slowly, beginning to reach its asymptote, indicated that we have identified a stable core genome.
  • Number of genes in the pangenome - As above it has reached its asymptote, it is likely we have discovered the majority of genes in the pangenome.

We can explore the individual pangenome of our sample in more depth by opening the gene_presence_absence.csv in excel or libreoffice

The file has the format:

Gene	Non-unique Gene name	Annotation	No. isolates	No. sequences	Avg sequences per isolate	Genome Fragment	Order within Fragment	Accessory Fragment	Accessory Order with Fragment	QC	Min group size nuc	Max group size nuc	Avg group size nuc	ERR327907	ERR327908	ERR327920	ERR327945	ERR327946	ERR327952	ERR327956	ERR327957	ERR327959	ERR327960	ERR327961	ERR327962	ERR327963	ERR327965	ERR327966	ERR327970
yidD		Putative membrane protein insertion efficiency factor	16	16	1	3	144				386	386	386	PROKKA_00021	PROKKA_00021___41	PROKKA_00021___6929	PROKKA_00021___13743	PROKKA_00021___20585	PROKKA_00021___27463	PROKKA_00021___34295	PROKKA_00021___41115	PROKKA_00021___47933	PROKKA_00021___54765	PROKKA_00043___61633	PROKKA_00021___68499	PROKKA_00021___75321	PROKKA_00021___82161	PROKKA_00021___88981	PROKKA_00021___95825
group_482		hypothetical protein	1	1	1	6	107	168	33		671	671	671											PROKKA_00505___62557

Simplistically, each gene has been clustered on amino acid % identity. Each cluster/gene/CDS has an entry in the file.

We can see summary information such as number of genomes containing the CDS, the order of the gene in the genome/contig, average length of the gene in nucleotides etc.

If we scroll down to the bottom of the file (or sort ascending on column 'No. Isolates') we can see the accessory genes (those not present in all 16 isolates).

Questions:

  • What do you notice about the entries in the Annotation column?
  • Is there any over-represented annotations? Why might this be?

Rs has a large number of insertion sequences (short tracks of self replicating DNA). These repetitive sequences can be a problem for assemblers so only some of them will be assembled correctly in each genome.

Roary, by default, looks for 'true' paralogs i.e. one copy of a gene in one genome that matches one copy of a gene in another (it looks for matching context/synteny).

We can switch off paralog matching using the -s option. What does the pangenome look like with paralog matching switched off?

roary -p 4 -s -o pangenome_noparalogs -f ./pangenome_noparalogs -v *.gff

Using your sftp client (e.g. WinSCP or Filezilla) copy summary_statistics.txt and gene_presence_absence.csv to your local computer.

Questions:

  • Have the core or accessory genes been effected more by the introduction of paralogous genes to our gene clusters?
  • Sort the column on No. Sequences. Do you notice anything about the top hits?

Core gene alignment

Roary can also be use to create an alignment of core genes. We will use this core gene alignment to produce phylogenetic trees.

We can run Roary using the following command BUT this takes a considerable amount of time as all the genes will be individually aligned.

So I have provided you with the files from this a Roary run using the commands below in the directory Thursday19thMay/Renibacterium/Pangenome/:

roary -p 4 -e -z -r -o pangenome_default -f ./pangenome -v *.gff 
Command Breakdown

-e        create a multiFASTA alignment of core genes using PRANK
-z        dont delete intermediate files (does not delete individual gene files)

We can also explore the per gene alignments (because we included the -z option to retain the intermediate files).

If you have software to open/visualise alignments, such as Mega or Seaview, you can follow the section below. Otherwise, I will use these examples as a group.

Navigate to /Thursday19thMay/Renibacterium/Pangenome/pan_genome_sequences/ and search for genes alpha_LP.fa.aln. You will notice the alignment is truncated at the start of the gene in some isolates.This truncation could be due to real genetic variance or assembly and annotation error. 

The gene apbE has a region where a misalignment has introduced a base difference. This is a case where an artifactual base changed has been introduced. 

These alignments will be incorporated into our core gene alignment and may be a source of artefactual variation. We will not take this further today but it is something to consider when building a core gene alignment with Roary.

We have made a pangenome for our sample and have also highlighted a few of the concerns and considerations that you should keep in mind when performing this sort of analysis. However, in summary we have a a huge table as our output. This is quite hard to interpret. Next we are going to generate phylogenetic trees and incorporate these into our analyses.

Phylogeny

In directory /Thursday19thMay/Renibacterium/Pangenome/ you will find a file called core_gene_alignment.aln.

core_gene_alignment.aln - a multifasta file containing an aligned set of core genes for each isolate that was passed to Roary

We are going to build a phylogenetic tree from the sequences contained in this file.

NOTE - a quicker and easier approach to generating a tree from the core genome using contigs would be to align the filtered contigs and building a tree from this alignment. A collection of tools called the Harvest suite can be used to do this very quickly for a large number of genomes. However, Harvest is not available for Windows computers (used by a majority of the working group) so we will not use this today.

There are multiple types of tree building software available for working with genes or whole genome sequences. There is also a range of different tree building methods. These vary from fast and inaccurate (neighbour-joining) to extremely slow and computer intensive methods with a high degree of accuracy (maximum likelihood).

Typically we might use a neighbour joining (NJ) tree to quickly assess the phylogenetic structure of a dataset or apply it to to a very large dataset that would take a prohibitive length of time to analyse with maximum likelihood (ML) or Bayesian methods.

We are going to use FastTree and for demonstration purposes. It should be installed on your local computer. It can generate an approximate-ML or NJ trees quite quickly.

Firstly, we will generate a neighbour joining tree.

# Make a tree/alignment specific folder
mkdir Phylogeny && cd Phylogeny

# copy core_gene_alignment.aln to this folder. 

# Make a nj tree with fastree
fasttree -nosupport -noml -nome -nt core_gene_alignment.aln > fastree_nj.tre
Command Breakdown
  -noml to turn off maximum-likelihood
  -nome to turn off minimum-evolution NNIs and SPRs
  -nt 	input nucleotide alignment. 
  -nosupport to not compute support values

We will visualise the phylogentic tree using a program called Figtree

If we go to Tree -> Midpoint Root we can make our tree easier to visualise by rooting on the midpoint between our two most distant isolates.

As we can see in our data there are two major, deep branching clades and further subclusters within these.

We will produce an ML tree to see if there are any differences in the phylogenetic inference. We will use a program called PhyML on taito.

# Taito 
module purge
module load gcc/4.8.2
module load intelmpi/4.1.3
module load phyml/20150217

# Convert multi-line fasta to a one-line fasta
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' core_gene_alignment.aln > temp.fasta && mv temp.fasta core_gene_alignment.aln

# replace - characters with N characters (PhyML only acccepts ATCGN)
sed 's/\-/N/g' core_gene_alignment.aln > temp.fasta && mv temp.fasta core_gene_alignment.aln

# Convert input to phylip format
perl /wrk/mirossi/shared_all/Thursday19thMay/Renibacterium_Example/fasta2phylip.pl core_gene_alignment.aln core_gene_alignment.phylip

# Run with GTR+Cat 
phyml -i core_gene_alignment.phylip -s BOTH -o tlr -m GTR -c 4 -b 100 -v 0  --no_memory_check --print_site_lnl --run_id gtr_cat4

This should take ~12 minutes to run.

Command Breakdown
-i	input
-s	data type (must be phylip) 
-o	tlr = optimise starting tree topology, the branch lengths and rate parameters (transition/transversion ratio, proportion of invariant sites, gamma distribution parameter). 
-m 	GTR = general time reversible
-c 	4 rate categories for the gamma distribution
-b 	100 bootstraps - PhyML then returns the bootstrap tree with branch lengths and bootstrap values.

Alternatively we can make an approximate ML tree using FastTree on your local machine. This can be 10-100x faster on datasets containing many genomes or datasets that contain a high number of variants per genome.

# We will want to run this in the background using screen 
screen -S ML_tree

fasttree -gtr -gamma -cat 4 -nt < core_gene_alignment.fasta > fastree_ml.tre

# press Ctrl+A+D to detach from the screen and screen -r ML_tree to re-attach the screen. 
# This should take ~55 mins
Command Breakdown

  -gtr -- generalized time-reversible model (nucleotide alignments only)
  -cat # to specify the number of rate categories of sites (default 20)
      or -nocat to use constant rates
  -gamma -- after optimizing the tree under the CAT approximation,
      rescale the lengths to optimize the Gamma20 likelihood
  -nt 	input nucleotide alignment. 

NOTE: For project or publication purposes you will want to generate a ML tree using the gold standards of tree building such as RAXML or PhyML.

Lets compare our consensus ML and NJ trees trees using a script developed by by Morgan N. Price in Adam Arkin's group at Lawrence Berkeley National Lab. You can find the perl script in the directory scripts. For interpreting the results please check http://meta.microbesonline.org/fasttree/treecmp.html

perl <path/directory/scripts>/CompareTree.pl -tree treeFile1 -versus treeFile2

Copy and paste the results into libreOffice/Excel.

The value after frac: is the number of splits that are shared between both trees. The value is quite low. We have differences between our ML and NJ trees.

Lets explore why that might be. Using your sftp client (e.g. WinSCP or Filezilla) copy fastree_nj.tre and core_gene_alignment.gtr_cat4 to your local computer. Open your phyML and NJ tree in Figtree. Root both at the midpoint.

Question:

  • Which splits are different between the trees?
On the PhyML tree expand the Node Labels Tab -> Display -> Labels

This gives the bootstrap support for each of our branches (we bootstrapped our ML tree 100 times) 

As we can see the split containing ERR327960, ERR327940 and ERR327926 has low bootstrap support value. The split was only well supported at that point in ~44 out of 100 replicate bootstrap trees.

To investigate our phylogenetic inference further we might want to:

  • Delve deeper into the data and look at the SNP data per site (are any of the sites/genes unreliable?)
  • Does our data have significant recombination that could effect the inference (ClonalFrameML/BratNextGen/Gubbins)?
  • Build a tree using another approach such as a Bayesian tree (MrBayes).
  • Prepare the data using another approach such as mapping (BWA->samtools) or assembly alignment (Harvest Suite/Parsnp). Have we lost informative SNPs by removing the intergenic sites?

We don't have time to take the analysis further today, but these are considerations for future projects.

We are going to use NJ tree, as it is easiest to interpret, and annotate it with our isolate metadata.

Open the NJ tree in Figtree.

Go to File -> Import Annotation -> Open Metadata.tab

Expand on the Tip Label Option Box to the left of the tree -> Display -> Country 

Questions:

  • Focus on the largest cluster of isolates. What was the historical transmission route of Renibacterium salmoninarum?
  • Using the same procedure Display -> Host answer the question 'how host-specific is Renibacterium salmoninarum?'

Visualising Pangenomes Alongside Phylogeny.

Pangenomes provide another layer of information about bacterial genomes and may help with genealogical inference. For example, two isolates may have the same core genome sequence but one may have acquired a mobile genetic element.

In order to visualise this information we are going to use a recently released tool called Phandango. This tool is very useful as it allows us to interact with our data. Follow the link to open Phandango.

Now we can drag and drop our tree file (ensure the file extension is .tre) and our metadata file (metadata.csv) onto the Phandango window.

Now we can use our scroll wheel to zoom into any region on the figure. We can also subset out tree. More instructions are available at the appropriate tab at the top of the screen.

We can also use Phandago to provide us with more information about our phylogeny. Drag your gene_presence_absence.csv from Roary onto the Phandango window. This allows us to visualise the pangenome alongside the phylogenetic tree for the sample.

Questions:

  • Sample ERR327961 has a large connected stretch of accessory genome. What is it?
  • The four deep branching samples that form a separate lineage have a number distinct set of accessory genes. Identify them using the scripts on the Roary homepage.

Conclusion

This morning you have learned how to identify the pangenome of a cohort of bacterial genomes, how to take the core gene alignment, create phylogenetic trees from this alignment and how to visualise your results at each step.

During the afternoon session you will learn how to identify and remove the signal of recombination from your sample.

A tutorial that takes the analysis on the Renibacterium isolates a few steps further can be found here. It uses some of the software you will be applying this afternoon. This section is optional for those of you that would like to try it in your own time.