Skip to content

07. Visualizing GCFs Across Phylogenies

Rauf Salamzade edited this page Apr 17, 2023 · 13 revisions

lsaBGC-See.py

The program lsaBGC-See.py is a relatively see-mple program which takes in a list of BGCs belonging to a GCF - a so called GCF listing file - and produces visuals of how these BGCs look across a phylogeny. It is fundamentally different from the other programs in that it does not generate any report, it just produces plots in PDF format or track files for the interactive tree of life (iTol) to allow for phylogenetic based visualization of the gene structure details for BGCs in the GCF.

lsaBGC-See.py can take in either a species phylogeny in newick format or construct a GCF-specific phylogeny from single nucleotide positions which are nearly core (found in 90% of samples) or single copy core genes (depending on options specified). Then, it uses these phylogeny(ies) to structure and visualize gene architectures of BGCs across samples (flipping them where appropriate to more easily allow for identification of homology). A key feature of the framework is that if some samples have multiple BGCs (e.g., segments of a single BGC broken up due to assembly fragmentation), it will modify the phylogenetic tree using the ete3 toolkit to create sister leafs of such samples to allow for displaying the segments of the BGC alongside each other on the phylogenetic tree.

Additionally, as can be seen in the figure below, lsaBGC-See.py is a great way to visually explore and identify population level differences in BGC carriage or gene content. Such analyses can further be explored in depth using lsaBGC-PopGene.py.

Fun Fact: lsaBGC-See.py was the original idea for developing a software for lineage specific analysis of BGCs. It was really difficult to give up the pun name for the software being BGSee.

Usage

usage: lsaBGC-See.py [-h] -g GCF_LISTING -m ORTHOFINDER_MATRIX -o OUTPUT_DIRECTORY [-i GCF_ID] [-s SPECIES_PHYLOGENY]
                     [-p BGC_PREDICTION_SOFTWARE] [-c CPUS] [-k SAMPLE_SET] [-y] [-f]

        Program: lsaBGC-See.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program will create automatic visuals depicting genes across a species or BGC-specific phylogeny as well as
        iTol tracks visualizing BGCs from a single GCF across a species tree. Alternatively, if a species tree is not
        available, it will also create a phylogeny based on single copy core genes of the GCF.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC listings file for a gcf. Tab delimited: 1st column lists sample name while the 2nd column is the path to a BGC prediction in Genbank format.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder matrix.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Output directory.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -s SPECIES_PHYLOGENY, --species_phylogeny SPECIES_PHYLOGENY
                        The species phylogeny in Newick format.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -c CPUS, --cpus CPUS  Number of cpus to use for MCL step.
  -k SAMPLE_SET, --sample_set SAMPLE_SET
                        Sample set to keep in analysis. Should be file with one sample id per line.
  -y, --create_gcf_phylogeny
                        Create phylogeny from sequences of homolog groups in GCF.
  -f, --only_scc        Use only single-copy-core homolog groups for constructing GCF phylogeny.

lsaBGC-ComprehenSeeIve.py

This program will create a phylogenetic heatmap. Similar to lsaBGC-See, the phylogeny is either a user-provided species tree or a GCF specific phylogeny. In contrast to lsaBGC-See - which only showcases BGCs identified as belonging to a certain GCF, lsaBGC-ComprehenSeeIve assesses the OrthoFinder homolog group by sample presence/absence matrix to show whether samples have homolog groups found in the focal GCF, even if they were not deemed to possess an instance of the GCF.

In contrast to lsaBGC-See.py, lsaBGC-ComprehenSeeIve.py takes advantage of the comprehensive homolog group inference performed upfront and is intended as a supplement to lsaBGC-Easy.py when lsaBGC-AutoExpansion.py is skipped (the current default) to allow users to identify if some instances of the GCF might have been missed (e.g. because they were fragmented in a sample's genome).

Usage is identical to lsaBGC-See.py.

Clone this wiki locally