The idea of this program is threefold:
Generate a workflow to perform a multiple sequence alignment of specific genes. The data is provided by prokka annotation and based on metagenomes. The tree based on this data helps to search for clusters within gene groups.
In a metagenomic dataset the taxonomic annotation is often lacking or unreliable. If we provide reference genes of the same group from isolated organisms, we can use sequence similarity as indication. Reference genes can be chosen based on the function you investigate and/or from the organisms you know which are present (e.g. by 16S data). It should be reminded that this is not a phylogenetic analysis: functional genes undergo evolutional selection and might be similar for this reason, not due to common ancestors.
If the experimental setup allows for it, we can later combine the tree information with abundance information. This allows us in the end to check whether specific gene clusters show a specific behaviour in abundance.
The phnJ gene is responsible for degradation of phosphonates in many bacteria. But not all bacteria containing it are capable of degrading the phosphonate glyphosate. To figure out if my glyphosate-inpacted metagenomes contain the "right" version of phnJ, we build trees of the phnJ gene and include reference genes from known glyphosate degraders. This tree is then linked to the phnJ abundance data to see if specific clusters become more abundant after glyphosate addition (arguebly those capable of degradation).
Install the conda environment with
conda env create -f msa_heatmap.yml
The workflow is written in NextFlow. To run it on the test data set, run
nextflow run .
# or to override default parameters
nextflow run . --gene_name phnM --heatmap_scale logarithmic
To run the example, the following files need to be present in the input_files
directory:
- the list with the tax ids of the reference organisms (no empty lines allowed, file comes with the repository)
- a
gff
- file with annotations of your metagenomic samples (glyph.gff
, add manually) - a
faa
- file with the protein fasta sequences of the genes (glyph.faa
, add manually) - a kallisto table containing the abundance of your genes (
kallisto_abundance_matrix.tsv.gz
, add manually)
you can run the workflow without the faa
and gff
file, this will result in a tree containing only genes from reference organisms.
Additional parameters can also be set (see nextflow.config
for a complete list of available parameters).
First, all protein sequences for a gene of interest (defined in the config file) are extracted from the prokka protein
fasta file. Subsequently, the gene sequence of interest (parameter: gene_name
) are downloaded from UniProt for the
strains/organisms of interest based on the list of tax IDs (parameter: tax_list
). Duplicates in the concatenated
sequences are removed and the remaining sequences are aligned with MAFFT. Afterwards, the alignment gets transformed
to phylip format. Eventually, the phylogenetic tree of the proteins is calculated with RAxML.