01. Installation

Installation Guide

Should take ~5-10 minutes. If there are any issues, please just let us know and open a case in the Git Issues.

We support Unix and Mac systems, but not Windows.

Quick and Easy Step by Step Installation using Conda

Installation can be performed via conda and should take ~5-10 minutes and has been tested on both unix (specifically Ubuntu) and macOS. We are happy to attempt to address issues with installation if any arise, please open a Git Issues case:

# 1. clone Git repo and cd into it!
git clone https://github.com/Kalan-Lab/lsaBGC
cd lsaBGC/

# 2. create conda environment using yaml file and activate it!
conda env create -f lsaBGC_env.yml -p /path/to/lsaBGC_conda_env/
conda activate /path/to/lsaBGC_conda_env/

# 3. complete python installation with the following commands:
python setup.py install
pip install -e .

Setting up optional annotation databases

Setup database(s) for annotation used by lsaBGC-Ready.py. This is currently just the, KOfam + PGAP profile HMMs (~5GB). To setup databases, simply run the script:

setup_annotation_dbs.py

If clustering of BGCs into GCFs using BiG-SCAPE is preferred to lsaBGC-Cluster.py, setup BiG-SCAPE using the following:

setup_bigscape.py

Using Docker (for major workflows only)

A docker image is provided for the lsaBGC-Easy.py and lsaBGC-Euk-Easy.py workflows together with a wrapper script. The image is pretty large (~21Gb) but includes all the databases and dependencies needed for lsaBGC, BiG-SCAPE, antiSMASH, and GECCO analysis. For lsaBGC, to save space, the KOfam database is not included. For antiSMASH, MEME is not incldued, thus RODEO and CASSIS analyses are not available.

To use the latest Docker image, please: (1) install Docker and (2) download the wrapper script:

# download wrapper script
wget https://raw.githubusercontent.com/Kalan-Lab/lsaBGC/main/docker/run_LSABGC.sh

# change its permissions
chmod +x run_LSABGC.sh

# run the wrapper script 
./run_LSABGC.sh

Current version information in latest lsaBGC docker image:

lsaBGC = 1.53 (+ PGAP HMM database downloaded on 05/31/2024)
GECCO = 0.9.8 (DEFAULT SOFTWARE FOR BGC PREDICTIONS)
antiSMASH = 7.0.1 (+ databases; REQUEST-ABLE AND AUTOMATIC!)
BiG-SCAPE = 1.1.9 (+ Pfam database downloaded on 07/31/2024)
OrthoFinder = 2.5.4
Panaroo = 1.5.0

Automated Installation

Here is some code to copy-and-paste into a text file, you can call auto_install.sh. After editing the second line to reflect the location where you want to installation to take place, you can execute the file as such bash auto_install.sh in terminal. If issues with permissions, you can run chmod 777 auto_install_lsaBG.sh and retry the execution command.

Note, you muse have conda already installed.

Note, it will install ~12B of database material into the location, so you should ideally have 30GB of free space available at this location to avoid any problems.

At the end of the program, you should be presented with a command that looks something like conda activate /path/to/lsaBGC_conda_env/ which is what you would issue to activate the virtual conda environment anytime you want to run lsaBGC.

#!/bin/bash
INSTALL_DIR=/path/to/installation_location/ # CHANGE THIS LINE!!!

cd $INSTALL_DIR
TMP_DIR=$INSTALL_DIR/TMP/
mkdir $TMP_DIR
git clone https://github.com/Kalan-Lab/lsaBGC
conda env create -f $INSTALL_DIR/lsaBGC/lsaBGC_env.yml -p $INSTALL_DIR/lsaBGC_conda_env/
conda activate $INSTALL_DIR/lsaBGC_conda_env
cd $INSTALL_DIR/lsaBGC/
python setup.py install
pip install .
setup_annotation_dbs.py
setup_bigscape.py
echo $INSTALL_DIR
conda deactivate
echo $'To activate conda environment and use lsaBGC in the future, simply type:\nconda activate '"$INSTALL_DIR"'/lsaBGC_conda_env/'

Testing the installation

We have included a small test dataset with the code, which can be run from within the lsaBGC Git repo after activating the conda environment using the following command:

bash run_tests.sh

To test your installation was successful further, we recommend running these testing cases and checking results match the expected results we had obtained previously. For a big-picture understanding of the use of lsaBGC and how the different programs can connect together, we recommend checking out the tutorial.

Dependency versions for Salamzade et al. 2022

As described in the Installation section above, dependencies can be set up easily through the use of a Conda environment and the provided yaml file. Check it for the most up to date information!

The set of dependencies for the core lsaBGC programs and auxiliary scripts are extensive - but all easy to install and compatible via Conda. All dependencies along with versions used for testing and specified for downloading in the yaml file include:

name: lsaBGC
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.8.10
  - pip=21.3.1
  - biopython=1.79
  - pysam=0.16.0.1
  - bioconda::bowtie2=2.4.4
  - bioconda::samtools=1.12
  - bioconda::mafft=7.487
  - bioconda::mcl=14.137
  - bioconda::pal2nal=14.1
  - bioconda::hmmer=3.3.2
  - bioconda::fasttree=2.1.10
  - conda-forge::ete3=3.1.2
  - bioconda::fastani=1.32
  - bioconda::mash=2.3
  - bioconda::diamond=2.0.8
  - conda-forge::r-base=4.0.3
  - conda-forge::r-ggplot2=3.3.5
  - conda-forge::r-cowplot=1.1.1
  - conda-forge::r-phytools=0.7_80
  - conda-forge::r-ape=5.5
  - conda-forge::r-scatterpie=0.1.6
  - conda-forge::r-dplyr=1.0.4
  - conda-forge::r-gggenes=0.4.1
  - bioconda::bioconductor-ggtree=2.4.0
  - conda-forge::r-ggalluvial=0.12.3
  - conda-forge::r-rvcheck=0.1.8
  - conda-forge::pomegranate=0.13.3
  - conda-forge::r-data.table=1.14.0
  - conda-forge::r-plyr=1.8.6
  - bioconda::comparem=0.1.2
  - bioconda::orthofinder=2.5.4
  - bioconda::prodigal=2.6.3
  - conda-forge::pandas=1.4.2

Other Conda Environments Potentially Needed if Using Original Workflow for Processing used in Salamzade et al. 2023

The following is generally no longer recommended and will be deprecated!

You might have noticed that a few key dependencies, needed to create the input data for lsaBGC, are missing. This includes the software packages antiSMASH, Prokka, and OrthoFinder. While these are certainly needed to utilize lsaBGC as intended, the assumption is that the target audience of this software might have these bulky software prerequisites already installed on their servers. Regardless, even if they do not have them pre-installed, to install them in a single Conda environment, along with the other requirements for lsaBGC, results in difficult to resolve conflicts and it is likely a cleaner solution to have these as separate environments anyhow. As such, instructions are provided here for setting up respective environments for each of these three software packages. These environment paths can then be provided to lsaBGC-Process.py for it to automatically produce all the input needed for the lsaBGC framework.

AntiSMASH Environment - Identification of Biosynthetic Gene Clusters

We used AntiSMASH version 6.0.0 for development and testing of lsaBGC. To install antiSMASH as a Conda environment please refer to the software's documentation: https://docs.antismash.secondarymetabolites.org/install/

Briefly, to install such a Conda environment (what will be needed by lsaBGC-Process.py), one can issue the following commands:

conda create -p /path/to/antismash_env/ antismash

or alternatively to install the version we used for development and tested with issue the following command:

conda create -p /path/to/antismash_env/ antismash=6.0.0

Next, AntiSMASH requires key databases it uses to identify BGCs in genomes, these can be installed as described in the documentation:

conda activate /path/to/antismash_env/
download-antismash-databases
conda deactivate

Warning, databases should be roughly 20Gb in size, so download somewhere with sufficient space.

Prokka Environment - Gene Calling, Basic Annotation, and Genbank Construction

To install a Conda environment with the latest version of Prokka, one could issue the following command:

conda create -p /path/to/prokka_env/ prokka

To install a Conda environment with the version of Prokka we used for testing and development of lsaBGC, version 1.13, please use the following command

conda create -p /path/to/prokka_env/ prokka=1.13

Note, the version of Prokka we used unfortunately had issues for installation via Conda. Essentially, some Perl libraries were installed under one version of Perl, whereas others were installed under a different version of Perl. This results in confusion of which Perl instance contains which Perl libraries. To fix this a symlink can be set up for libraries missing in each Perl instances libraries directory to point to the other Perl instances libraries. This fix is described in more detail here: https://github.com/tseemann/prokka/issues/448 (see comment by mroach-avri made on Feb 28th, 2021).

OrthoFinder Environment - De Novo Identification of Protein Homolog Groups

OrthoFinder2 can also be set up within a discrete Conda environment similar to the AntiSMASH or Prokka.

To install a Conda environment with the latest version of Prokka, one could issue the following command:

conda create -p /path/to/orthofinder_env/ orthofinder

To install a Conda environment with the version of OrthoFinder2 we used for testing and development of lsaBGC, version 2.5.4, please use the following command

conda create -p /path/to/orthofinder_env/ orthofinder=2.5.4

Using Alternate Software

We want to mention that while we developed and tested lsaBGC using AntiSMASH and OrthoFinder2, analogous software could be used in their place.

For instance, GECCO, a machine learning approach to detect biosynthetic gene clusters by the Zeller lab (Carrol et al. 2021, bioRxiv) could be used instead of AntiSMASH and can produce Genbank formatted result files. Thus, in theory, this should be a simple switch to incorporate, however, as is often the case, things tend to break unexpectedly when they are taken off the tested trails. Thus, if you are adventurous and test this out and it doesn't work, please open an issues ticket and I will be happy to assist in getting it to work.

Similarly, instead of OrthoFinder2, other software designed to delineate protein homolog groups can be used, of which there are plenty. If you find OrthoFinder is rather slow, an alternate software designed of interest might be SynerClust developed by the Earl Bacterial Genomics Group at the Broad Institute (Georgescu et al. 2018). SynerClust avoids all vs all blast of all genomes by using a guiding phylogenetic tree and further uses syntenic information to split orthologous gene clusters where possible from related paralogs.

Whichever software you use for the task, you would just need to format the resulting homology information into a sample by homolog group matrix with the following format:

X         <tab>           Sample 1                          <tab>     Sample 2   ....
Homolog_Group_1_ID <tab>  Sample1_Gene1, Sample1_Gene34     <tab>     Sample2_Gene1  ....
Homolog_Group_2_ID <tab>  Sample1_Gene45                    <tab>     Sample2_Gene559  ....

Thus, if a sample has multiple genes for a single homolog group, the gene identifiers should be provided as a comma+single-space separated list.

Note, it is critical for any alternative software you use that the gene identifiers in the BGC genbanks match the gene identifiers in the homolog group by sample matrix.

Special note on installation of DESMAN for allelic typing in lsaBGC-DiscoVary

As of writing this installation guide, the Conda installation of Desman appears broken, at least for certain functions. We thus installed it manually on our server and set the environment variable $PATH to include the path to the directory containing the Desman executable programs. We recommend doing this for now until the Conda environment is updated/fixed or using lsaBGC-DiscoVary without Desman based allelic phasing (would result in a consensus sequence being determined for a homolog group in a metagenome).

To install Desman manually please visit the program's Github page, and follow the Installation guide!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly