-
Notifications
You must be signed in to change notification settings - Fork 4
07. Visualizing GCFs Across Phylogenies
The program lsaBGC-See.py
is a relatively see-mple program which takes in a list of BGCs belonging to a GCF - a so called GCF listing file - and produces visuals of how these BGCs look across a phylogeny. It is fundamentally different from the other programs in that it does not generate any report, it just produces plots in PDF format or track files for the interactive tree of life (iTol) to allow for phylogenetic based visualization of the gene structure details for BGCs in the GCF.
lsaBGC-See.py
can take in either a species phylogeny in newick format or construct a GCF-specific phylogeny from single nucleotide positions which are nearly core (found in 90% of samples) or single copy core genes (depending on options specified). Then, it uses these phylogeny(ies) to structure and visualize gene architectures of BGCs across samples (flipping them where appropriate to more easily allow for identification of homology). A key feature of the framework is that if some samples have multiple BGCs (e.g., segments of a single BGC broken up due to assembly fragmentation), it will modify the phylogenetic tree using the ete3 toolkit to create sister leafs of such samples to allow for displaying the segments of the BGC alongside each other on the phylogenetic tree.
Additionally, as can be seen in the figure below, lsaBGC-See.py
is a great way to visually explore and identify population level differences in BGC carriage or gene content. Such analyses can further be explored in depth using lsaBGC-PopGene.py
.
Fun Fact: lsaBGC-See.py was the original idea for developing a software for lineage specific analysis of BGCs. It was really difficult to give up the pun name for the software being BGSee.
usage: lsaBGC-See.py [-h] -g GCF_LISTING -m ORTHOFINDER_MATRIX -o OUTPUT_DIRECTORY [-i GCF_ID] [-s SPECIES_PHYLOGENY]
[-p BGC_PREDICTION_SOFTWARE] [-c CPUS] [-k SAMPLE_SET] [-y] [-f]
Program: lsaBGC-See.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
This program will create automatic visuals depicting genes across a species or BGC-specific phylogeny as well as
iTol tracks visualizing BGCs from a single GCF across a species tree. Alternatively, if a species tree is not
available, it will also create a phylogeny based on single copy core genes of the GCF.
optional arguments:
-h, --help show this help message and exit
-g GCF_LISTING, --gcf_listing GCF_LISTING
BGC listings file for a gcf. Tab delimited: 1st column lists sample name while the 2nd column is the path to a BGC prediction in Genbank format.
-m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
OrthoFinder matrix.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Output directory.
-i GCF_ID, --gcf_id GCF_ID
GCF identifier.
-s SPECIES_PHYLOGENY, --species_phylogeny SPECIES_PHYLOGENY
The species phylogeny in Newick format.
-p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
Default is antiSMASH.
-c CPUS, --cpus CPUS Number of cpus to use for MCL step.
-k SAMPLE_SET, --sample_set SAMPLE_SET
Sample set to keep in analysis. Should be file with one sample id per line.
-y, --create_gcf_phylogeny
Create phylogeny from sequences of homolog groups in GCF.
-f, --only_scc Use only single-copy-core homolog groups for constructing GCF phylogeny.
This program will create a phylogenetic heatmap. Similar to lsaBGC-See, the phylogeny is either a user-provided species tree or a GCF specific phylogeny. In contrast to lsaBGC-See - which only showcases BGCs identified as belonging to a certain GCF, lsaBGC-ComprehenSeeIve assesses the OrthoFinder homolog group by sample presence/absence matrix to show whether samples have homolog groups found in the focal GCF, even if they were not deemed to possess an instance of the GCF.
In contrast to lsaBGC-See.py
, lsaBGC-ComprehenSeeIve.py
takes advantage of the comprehensive homolog group inference performed upfront and is intended as a supplement to lsaBGC-Easy.py
when lsaBGC-AutoExpansion.py
is skipped (the current default) to allow users to identify if some instances of the GCF might have been missed (e.g. because they were fragmented in a sample's genome).
Usage is identical to lsaBGC-See.py
.