Skip to content

12. The lsaBGC AutoAnalyze Workflow

Rauf Salamzade edited this page Aug 2, 2022 · 2 revisions

About the lsaBGC-AutoAnalyze.py Workflow

lsaBGC-AutoAnalyze.py is a workflow which automatically runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and optionally lsaBGC-DiscoVary.py for all GCFs of interest, producing a few consolidated result files and visualizations at the end as well.

Required Inputs:

  • -l / --input_listing : Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-Process.py. E.g. the Sample_Annotation_Files.txt file produced by lsaBGC-AutoExpansion.py.
  • -g / --gcf_listing_dir : Directory with GCF listing files. E.g. the Updated_GCF_Listings/ directory produced by lsaBGC-AutoExpansion.py.
  • -m / --orthofinder_matrix : OrthoFinder homolog group by sample matrix. E.g. the Orthogroups.expanded.tsv file produced by lsaBGC-AutoExpansion.py.
  • -o / --output_directory : The path to the output/workspace directory.

Order of Operations:

For each GCF, run:

  1. lsaBGC-See.py
  2. lsaBGC-PopGene.py
  3. lsaBGC-Divergence.py
  4. lsaBGC-DiscoVary (Optional if metagenomic/raw short-read sequencing data is provided)

At the end, the wrapper will additionally generate consolidated reports (for all GCFs) of lsaBGC-PopGene.py and lsaBGC-Divergence.py results and create overview visualizations. Final results can be found in the subdirectory Final_Results/ which are described on the Wiki here.

Usage:

usage: lsaBGC-AutoAnalyze.py [-h] -o OUTPUT_DIRECTORY [-i INPUT_LISTING] -g GCF_LISTING_DIR -m ORTHOFINDER_MATRIX [-k SAMPLE_SET] [-s SPECIES_PHYLOGENY] -w EXPECTED_SIMILARITIES [-p BGC_PREDICTION_SOFTWARE] [-u POPULATIONS] [-l DISCOVARY_INPUT_LISTING] [-n DISCOVARY_ANALYSIS_NAME]
                             [-c CPUS]

        Program: lsaBGC-AutoAnalyze.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        Wrapper program to automate running lsaBGC analytical programs for each GCF.

        Iteratively runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and optionally lsaBGC-DiscoVary.py for
        each GCF in a GCF listings directory, produced by lsaBGC-Ready, lsaBGC-Cluster, or lsaBGC-AutoExpansion.py.


optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Parent output/workspace directory.
  -i INPUT_LISTING, --input_listing INPUT_LISTING
                        Path to tab delimited file listing: (1) sample name
                        (2) path to whole-genome Genbank and (3) path to whole-genome predicted proteome
                        (an output of lsaBGC-Ready.py or lsaBGC-AutoExpansion.py).
  -g GCF_LISTING_DIR, --gcf_listing_dir GCF_LISTING_DIR
                        Directory with GCF listing files.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder homolog group by sample matrix.
  -k SAMPLE_SET, --sample_set SAMPLE_SET
                        Sample set to keep in analysis. Should be file with one sample id per line.
  -s SPECIES_PHYLOGENY, --species_phylogeny SPECIES_PHYLOGENY
                        Path to species phylogeny. If not provided a FastANI based neighborjoining tree will be constructed and used.
  -w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
                        Path to file listing expected similarities between genomes/samples. This is
                        computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
                        sample to sample similarities based on alignment used to create species phylogeny.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -u POPULATIONS, --populations POPULATIONS
                        Path to user defined populations/groupings file. Tab delimited with 2 columns: (1) sample name and (2) group identifier.
  -l DISCOVARY_INPUT_LISTING, --discovary_input_listing DISCOVARY_INPUT_LISTING
                        Sequencing readsets for DiscoVary analysis. Tab delimited file listing: (1) sample name, (2) forward readset, (3) reverse readset for metagenomic/isolate sequencing data.
  -n DISCOVARY_ANALYSIS_NAME, --discovary_analysis_name DISCOVARY_ANALYSIS_NAME
                        Identifier/name for DiscoVary. Not providing this parameter will avoid running lsaBGC-DiscoVary step.
  -c CPUS, --cpus CPUS  Total number of cpus to use.