-
Notifications
You must be signed in to change notification settings - Fork 4
12. The lsaBGC AutoAnalyze Workflow
Rauf Salamzade edited this page Aug 2, 2022
·
2 revisions
lsaBGC-AutoAnalyze.py
is a workflow which automatically runs lsaBGC-See.py
, lsaBGC-PopGene.py
, lsaBGC-Divergence.py
, and optionally lsaBGC-DiscoVary.py
for all GCFs of interest, producing a few consolidated result files and visualizations at the end as well.
-
-l
/--input_listing
: Path to tab delimited file listing: (1) sample name (2) path to Prokka Genbank and (3) path to Prokka predicted proteome. This file is produced by lsaBGC-Process.py. E.g. theSample_Annotation_Files.txt
file produced bylsaBGC-AutoExpansion.py
. -
-g
/--gcf_listing_dir
: Directory with GCF listing files. E.g. theUpdated_GCF_Listings/
directory produced bylsaBGC-AutoExpansion.py
. -
-m
/--orthofinder_matrix
: OrthoFinder homolog group by sample matrix. E.g. theOrthogroups.expanded.tsv
file produced bylsaBGC-AutoExpansion.py
. -
-o
/--output_directory
: The path to the output/workspace directory.
For each GCF, run:
lsaBGC-See.py
lsaBGC-PopGene.py
lsaBGC-Divergence.py
lsaBGC-DiscoVary
(Optional if metagenomic/raw short-read sequencing data is provided)
At the end, the wrapper will additionally generate consolidated reports (for all GCFs) of lsaBGC-PopGene.py
and lsaBGC-Divergence.py
results and create overview visualizations. Final results can be found in the subdirectory Final_Results/
which are described on the Wiki here.
usage: lsaBGC-AutoAnalyze.py [-h] -o OUTPUT_DIRECTORY [-i INPUT_LISTING] -g GCF_LISTING_DIR -m ORTHOFINDER_MATRIX [-k SAMPLE_SET] [-s SPECIES_PHYLOGENY] -w EXPECTED_SIMILARITIES [-p BGC_PREDICTION_SOFTWARE] [-u POPULATIONS] [-l DISCOVARY_INPUT_LISTING] [-n DISCOVARY_ANALYSIS_NAME]
[-c CPUS]
Program: lsaBGC-AutoAnalyze.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
Wrapper program to automate running lsaBGC analytical programs for each GCF.
Iteratively runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and optionally lsaBGC-DiscoVary.py for
each GCF in a GCF listings directory, produced by lsaBGC-Ready, lsaBGC-Cluster, or lsaBGC-AutoExpansion.py.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Parent output/workspace directory.
-i INPUT_LISTING, --input_listing INPUT_LISTING
Path to tab delimited file listing: (1) sample name
(2) path to whole-genome Genbank and (3) path to whole-genome predicted proteome
(an output of lsaBGC-Ready.py or lsaBGC-AutoExpansion.py).
-g GCF_LISTING_DIR, --gcf_listing_dir GCF_LISTING_DIR
Directory with GCF listing files.
-m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
OrthoFinder homolog group by sample matrix.
-k SAMPLE_SET, --sample_set SAMPLE_SET
Sample set to keep in analysis. Should be file with one sample id per line.
-s SPECIES_PHYLOGENY, --species_phylogeny SPECIES_PHYLOGENY
Path to species phylogeny. If not provided a FastANI based neighborjoining tree will be constructed and used.
-w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
Path to file listing expected similarities between genomes/samples. This is
computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
sample to sample similarities based on alignment used to create species phylogeny.
-p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
Default is antiSMASH.
-u POPULATIONS, --populations POPULATIONS
Path to user defined populations/groupings file. Tab delimited with 2 columns: (1) sample name and (2) group identifier.
-l DISCOVARY_INPUT_LISTING, --discovary_input_listing DISCOVARY_INPUT_LISTING
Sequencing readsets for DiscoVary analysis. Tab delimited file listing: (1) sample name, (2) forward readset, (3) reverse readset for metagenomic/isolate sequencing data.
-n DISCOVARY_ANALYSIS_NAME, --discovary_analysis_name DISCOVARY_ANALYSIS_NAME
Identifier/name for DiscoVary. Not providing this parameter will avoid running lsaBGC-DiscoVary step.
-c CPUS, --cpus CPUS Total number of cpus to use.