Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push v1.37 from develop branch to main #21

Merged
merged 39 commits into from
May 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
d27af02
update to v1.37
raufs May 11, 2023
1c39322
Update CHANGES.md
raufs May 11, 2023
e761c7d
Update Dockerfile
raufs May 11, 2023
8497891
Update Dockerfile
raufs May 11, 2023
62d6884
Update CHANGES.md
raufs May 12, 2023
4929f7c
Update LSABGC
raufs May 12, 2023
33c6d53
Update LSABGC
raufs May 12, 2023
30701a1
Update LSABGC
raufs May 12, 2023
5a5592d
Update run_LSABGC.sh
raufs May 12, 2023
5b973a9
Update lsaBGC-Easy.py
raufs May 12, 2023
853cd70
Update lsaBGC-Euk-Easy.py
raufs May 12, 2023
ada24de
Update run_LSABGC.sh
raufs May 12, 2023
a3b63c5
Update GCF.py
raufs May 12, 2023
a01bcd4
Update lsaBGC-PopGene.py
raufs May 12, 2023
542986a
Update run_LSABGC.sh
raufs May 12, 2023
723f9d6
Update Dockerfile
raufs May 12, 2023
ab12166
Update LSABGC
raufs May 12, 2023
b547e24
Update GCF.py
raufs May 12, 2023
a9238fb
Update CHANGES.md
raufs May 12, 2023
d71877a
Update GCF.py
raufs May 12, 2023
4623ad5
Update Dockerfile
raufs May 12, 2023
97752af
Update LSABGC
raufs May 12, 2023
090f19b
Update Dockerfile
raufs May 12, 2023
eedcafe
Update Dockerfile
raufs May 13, 2023
e10369e
Update lsaBGC-Easy.py
raufs May 13, 2023
22df494
Update README.md
raufs May 13, 2023
ed038d6
Update README.md
raufs May 13, 2023
e0524b2
Update lsaBGC-Easy.py
raufs May 13, 2023
fee37cd
Update lsaBGC-Euk-Easy.py
raufs May 13, 2023
f632438
Update lsaBGC-Easy.py
raufs May 13, 2023
f4c1131
Update lsaBGC-Euk-Easy.py
raufs May 13, 2023
117b448
Update Dockerfile
raufs May 13, 2023
4755c56
Update README.md
raufs May 13, 2023
263dd3e
Update README.md
raufs May 13, 2023
032b5ef
Update util.py
raufs May 13, 2023
5eb8bcf
Update util.py
raufs May 13, 2023
c446615
Update LSABGC
raufs May 13, 2023
0966285
Update Dockerfile
raufs May 13, 2023
2777060
Update README.md
raufs May 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Major Updates
* May 11, 2023 - v1.37 - Introduces changes to eventually have Docker working for lsaBGC-(Euk)-Easy workflows. Implemented less problematic parsing of BGC positions along genomes through prediction-software specific parsing of BGC GenBanks - important for BGC rich taxa. Switched from MAGUS to MUSCLE super5 for rapid multiple sequence alignment.
* May 5, 2023 - v1.36 - Improve checks for genome count in lsaBGC-(Euk)-Easy workflows, correct for issue related to PGAP database, simplify conda environment requirements.
* Apr 17, 2023 - Set AutoExpansion to off by default in lsaBGC-(Euk)-Easy, introduced lsaBGC-ComprehenSeeIve, changed handling off primary genomes being rerun through lsaBGC-AutoExpansion.
* Apr 14, 2023 - Introduced lsaBGC-MIBiGMapper.py, lsaBGC-Euk-Easy.py, visualize_BGC-ome.py, slight updates to plots, added automatic color formatting to spreadsheet, simplified README.
* Mar 2, 2023 - Added GSeeF analysis to the end of the lsaBGC-Easy workflow and added support for parsing annotations from DeepBGC and GECCO into GSeeF.
Expand Down
23 changes: 19 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,7 @@ If clustering of BGCs into GCFs using BiG-SCAPE is preferred to lsaBGC-Cluster.p
```
setup_bigscape.py
```

Additional, information pertaining to installation can be found at: [Installation Guide](https://github.com/Kalan-Lab/lsaBGC/wiki/01.-Installation)


A small test case is provided here and can be run after installation by simply issuing (takes around ~7 minutes using 4 cpus/threads):

```
Expand All @@ -68,13 +66,30 @@ bash run_tests.sh
There are also additional [test cases](https://github.com/Kalan-Lab/lsaBGC_Ckefir_Testing_Cases) to demonstrate usage of individual programs along with expected outputs from commands. We also have a [walk-through tutorial Wiki page](https://github.com/Kalan-Lab/lsaBGC/wiki/03.-Quick-Start-&-In-Depth-Tutorial:-Exploring-BGCs-in-Cutibacterium) to showcase the use of the suite and relations between core programs.

The major outputs of the final `lsaBGC-AutoAnalyze.py` run are in the resulting folder `test_case/lsaBGC_AutoAnalyze_Results/Final_Results/` and described on [this wiki page](https://github.com/Kalan-Lab/lsaBGC/wiki/13.-Overview-of-lsaBGC-AutoAnalyze's-Final-Results). Examples for the final AutoAnalyze results from an `lsaBGC-Easy.py` run on Cutibacterium avidum can be found [here on Google Drive](https://drive.google.com/drive/u/1/folders/1jHFFOUTd4SbIO-xiGG8MWTZaP1U4RF1j).

## Quick Start - using `lsaBGC-Easy.py` and `lsaBGC-Euk-Easy.py`

Check out how to use `lsaBGC-Easy.py` and `lsaBGC-Euk-Easy.py` on [their wiki page](https://github.com/Kalan-Lab/lsaBGC/wiki/14.-lsaBGC-Easy-Tutorial)!

![image](https://user-images.githubusercontent.com/4260723/181613839-df183cdc-1103-403f-b5d1-889484f52be9.png)

### Using Docker

A docker image is provided for the `lsaBGC-Easy.py` and `lsaBGC-Euk-Easy.py` workflows together with a wrapper script. The image is pretty large (~21Gb) but includes all the databases and dependencies needed for lsaBGC, BiG-SCAPE, antiSMASH, and GECCO analysis. For lsaBGC, to save space, the KOfam database is not included. For antiSMASH, MEME is not incldued, thus RODEO and CASSIS analyses are not available.

To use the latest Docker image, please: (1) install Docker and (2) download the wrapper script:

```
# download wrapper script
wget https://raw.githubusercontent.com/Kalan-Lab/lsaBGC/main/docker/run_LSABGC.sh

# change its permissions
chmod +x run_LSABGC.sh

# run the wrapper script
./run_LSABGC.sh
```

## Acknowledgements:

We would like to thank members of the Kalan lab, Currie lab, Kwan lab, Anantharaman, and Pepperell labs at UW Madison for feedback on the development of lsaBGC.
Expand Down
10 changes: 5 additions & 5 deletions bin/lsaBGC-PopGene.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ def create_parser():
parser.add_argument('-k', '--sample_set', help="Sample set to keep in analysis. Should be file with one\nsample id per line.", required=False)
parser.add_argument('-u', '--population_classification', help='Popualation classifications for each sample. Tab delemited: 1st column lists sample\nname while the 2nd column is an identifier for the population the sample\nbelongs to.', required=False, default=None)
parser.add_argument('-p', '--bgc_prediction_software', help='Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO)\n[Default is antiSMASH].', default='antiSMASH', required=False)
parser.add_argument('-d', '--regular_mafft', action='store_true', help="Run mafft --linsi and not the MAGUS divide-and-conquer approach which\nallows for scalability and more efficient computing.", default=False, required=False)
parser.add_argument('-d', '--regular_mafft', action='store_true', help="Run 'mafft --linsi' instead of MUSCLE in super5 mode (default). Should lead to higher accuracy in MSA at the expense of efficiency.", default=False, required=False)
parser.add_argument('-e', '--each_pop', action='store_true', help='Run analyses individually for each population as well.', required=False, default=False)
parser.add_argument('-t', '--filter_for_outliers', action='store_true', help='Filter instances of homolog groups which deviate too much from\nthe median gene length observed for the initial set of proteins.', required=False, default=False)
parser.add_argument('-w', '--expected_similarities', help="Path to file listing expected similarities between genomes/samples. This is\ncomputed most easily by running lsaBGC-Ready.py with '-t' specified,\nwhich will estimate\nsample to sample similarities based on alignment used to create\nspecies phylogeny.", required=False, default=None)
Expand Down Expand Up @@ -138,7 +138,7 @@ def lsaBGC_PopGene():
"BGC Prediction Software", "Populations Specification/Listing File", "Sample Retention Set",
"Run Analysis for Each Population", "Filter for Outlier Homolog Group Instances",
"File with Expected Amino Acid Differences Between Genomes/Samples",
"AAI Similarity Instead of ANI", "Use Regular MAFFT - not MAGUS?", "cpus"]
"AAI Similarity Instead of ANI", "Use Regular MAFFT - not MUSCLE super5?", "cpus"]
util.logParametersToFile(parameters_file, parameter_names, parameter_values)
logObject.info("Done saving parameters!")

Expand Down Expand Up @@ -188,10 +188,10 @@ def lsaBGC_PopGene():

# Step 5: Create codon alignments if not provided a directory with them (e.g. one produced by lsaBGC-See.py)
logObject.info("User requested construction of phylogeny from SCCs in BGC! Beginning phylogeny construction.")
logObject.info("Beginning process of creating protein alignments for each homolog group using mafft, then translating these to codon alignments using PAL2NAL.")
logObject.info("Beginning process of creating protein alignments for each homolog group using mafft/or , then translating these to codon alignments using PAL2NAL.")
GCF_Object.constructCodonAlignments(outdir, only_scc=False, cpus=cpus, list_alignments=True, filter_outliers=False)
if filter_for_outliers:
GCF_Object.constructCodonAlignments(outdir, only_scc=False, cpus=cpus, list_alignments=True, filter_outliers=True, use_magus=(not regular_mafft))
GCF_Object.constructCodonAlignments(outdir, only_scc=False, cpus=cpus, list_alignments=True, filter_outliers=True, use_ms5=(not regular_mafft))
logObject.info("All codon alignments for SCC homologs now successfully achieved!")

# Step 6: Analyze codon alignments and parse population genetics and conservation stats
Expand Down Expand Up @@ -223,4 +223,4 @@ def lsaBGC_PopGene():
sys.exit(0)

if __name__ == '__main__':
lsaBGC_PopGene()
lsaBGC_PopGene()
3 changes: 2 additions & 1 deletion bin/lsaBGC-Ready.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,7 +367,8 @@ def lsaBGC_Ready():
util.setupReadyDirectory([deepbgc_split_directory])
bgc_genbank_listing_file = util.splitDeepBGCGenbank(bgc_genbank_listing_file, deepbgc_split_directory, outdir, logObject)

bgc_mappings = util.mapBGCtoGenomeBySequence(bgc_genbank_listing_file, sample_genomes, outdir, logObject, cpus=cpus)
bgc_mappings = util.mapBGCtoGenomeBySequence(bgc_genbank_listing_file, sample_genomes, outdir,
bgc_prediction_software, logObject, cpus=cpus)
sample_bgcs, bgc_to_sample = util.processBGCGenbanks(bgc_genbank_listing_file, bgc_mappings, bgc_prediction_software,
sample_genomes, bgcs_directory, proteomes_directory, logObject)

Expand Down
40 changes: 40 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This code is adapted from BiG-SCAPE's Dockerfile - if you are using zol - you should definitely check out
# BiG-SCAPE/CORASON incase it suites your needs better (https://www.nature.com/articles/s41589-019-0400-9) - eg.
# you are interested in clustering diverse gene-clusters based on protein domain simlarity or are interested
# in investigating the variability of contexts for a single reference gene!

FROM continuumio/miniconda3
LABEL maintainer="Rauf Salamzade - Kalan Lab, UW-Madison"

WORKDIR /usr/src
SHELL ["/bin/bash", "-c"]


# Clone lsaBGC github repo and create conda environment, then create and activate conda environment,
# and install lsaBGC
RUN apt-get update && apt-get install -y git wget && \
git clone develop https://github.com/Kalan-Lab/lsaBGC && rm -rf lsaBGC/test_case.tar.gz && \
conda install -c conda-forge mamba && \
mamba env create -f /usr/src/lsaBGC/lsaBGC_env.yml -p /usr/src/lsaBGC_conda_env/ && \
mamba create -p /usr/src/antismash_conda_env/ -c bioconda -c conda-forge -c defaults antismash -y && \
source activate /usr/src/antismash_conda_env/ && download-antismash-databases && \
conda remove --force meme && conda deactivate && \
conda clean --all -y && conda remove mamba && \
echo "source activate /usr/src/lsaBGC_conda_env/" > ~/.bashrc && source ~/.bashrc && \
apt-get clean -y && apt-get autoclean -y && apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*

# Install lsaBGC
WORKDIR /usr/src/lsaBGC/
ENV PATH /usr/src/lsaBGC_conda_env/bin:$PATH
RUN python setup.py install && pip install -e . && setup_annotation_dbs.py -nk -dsh && setup_bigscape.py && \
chmod -R 777 /usr/src/lsaBGC/ && chmod 777 /home

USER 1000:1000
RUN mkdir /home/input /home/output
WORKDIR /home
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

ENTRYPOINT ["LSABGC"]
CMD ["--help"]
76 changes: 76 additions & 0 deletions docker/LSABGC
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env python3

### Program: LSABGC
### Author: Rauf Salamzade
### Kalan Lab
### UW Madison, Department of Medical Microbiology and Immunology

# BSD 3-Clause License
#
# Copyright (c) 2023, Kalan-Lab
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import os
import sys
from lsaBGC import util

prog_set = set(['lsaBGC-Easy.py', 'lsaBGC-Euk-Easy.py'])

version = util.parseVersionFromSetupPy()

if __name__ == '__main__':
args = sys.argv[1:]
if len(args) == 0 or '--help' == args[0] or '-h' == args[0] or '-v' == args[0] or '--version' == args[0] or not (args[0] in prog_set):
print("""
Program: LSABGC
Version: %s""" % (version))
print("""
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

Wrapper for running 2 main workflows of the lsaBGC suite within Docker:

- lsaBGC-Easy: Run lsaBGC-Easy for bacterial genomes from a common taxa (e.g. lineage, species or genus).
* NOTE: Only works if you have genomes downloaded - e.g. GTDB automatic download part does not work within Docker!
The arguments: "-n None" will thus be appended to commands.

- lsaBGC-Euk-Easy: Run lsaBGC-Easy but specifically designed for fungal [small eukaryotic] genomes (experimental!)

Usage example:
./run_LSABGC.sh lsaBGC-Easy.py -n "None" -g Directory_of_Genomes_in_FASTA/ -o Results/ -c 10
./run_LSABGC.sh lsaBGC-Euk-Easy.py -g Directory_of_Genomes_in_GenBank/ -o Results/ -c 10

----------------------------------------------------------------------------------------------------------------------------
KEY NOTES:
----------------------------------------------------------------------------------------------------------------------------
* For lsaBGC-Easy.py (but not lsaBGC-Euk-Easy.py), the -n "None" is currently required.
* MEME (used in antiSMASH for CASSIS and RODEO) has been manually uninstalled in the antiSMASH conda environment within the
lsaBGC Docker image because it requires a commercial license.
""")
else:
os.system(' '.join(args))
129 changes: 129 additions & 0 deletions docker/run_LSABGC.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#!/bin/bash

# AUTHOR: Rauf Salamzade
# AFFILIATION: Kalan Lab, UW-Madison
# run_LSABGC.sh - wrapper to run lsaBGC-Easy.py and lsaBGC-Euk-Easy.py

# function
get_abs_filename() {
# $1 : relative filename
filename=$1
parentdir=$(dirname "${filename}")

if [ -d "${filename}" ]; then
echo "$(cd "${filename}" && pwd)"
elif [ -d "${parentdir}" ]; then
echo "$(cd "${parentdir}" && pwd)/$(basename "${filename}")"
fi
}

## The following is also adapted from analogous file from BIG-SCAPE
if [[ $# -eq 0 || $1 == "-h" || $1 == "--help" || $1 == "-v" || $1 == "--version" ]]; then
docker pull raufs/lsabgc:latest
docker run \
--detach=false \
--rm \
--user=$(id -u):$(id -g) \
raufs/lsabgc:latest \

elif [[ $1 == 'lsaBGC-Easy.py' ]]; then
set -o errexit
set -o nounset

# Links within the container
readonly CONTAINER_INPUT_DIR=/home/input
readonly CONTAINER_OUTPUT_DIR=/home/output

# variables for updating/input paths to account for Docker mounting

DOCKER_VOLUME_ARGS=""
EASY_ARGS=""
OUTPUT_PARENT_DIR="NA"
while [[ ! $# -eq 0 ]]; do
if [[ "$1" == '-g' || "$1" == '--user_genomes_directory' ]]; then
shift
ABS_VALUE=$(get_abs_filename $1)
INPUT_DIR=$(basename $ABS_VALUE)
INPUT_PARENT_DIR=$(dirname $ABS_VALUE)
EASY_ARGS+="-g $CONTAINER_INPUT_DIR/$INPUT_DIR "
DOCKER_VOLUME_ARGS+="--volume $INPUT_PARENT_DIR:$CONTAINER_INPUT_DIR:ro "
shift
elif [[ "$1" == '-o' || "$1" == '--output_directory' ]]; then
shift
ABS_VALUE=$(get_abs_filename $1)
OUTPUT_DIR=$(basename $ABS_VALUE)
OUTPUT_PARENT_DIR=$(dirname $ABS_VALUE)
EASY_ARGS+="-o $CONTAINER_OUTPUT_DIR/$OUTPUT_DIR "
DOCKER_VOLUME_ARGS+="--volume $OUTPUT_PARENT_DIR:$CONTAINER_OUTPUT_DIR:rw "
shift
else
EASY_ARGS+="$1 "
shift
fi
done
EASY_ARGS+="--docker_mode"


if [[ ! -d ${OUTPUT_PARENT_DIR} && $OUTPUT_PARENT_DIR != "NA" ]]; then
mkdir ${OUTPUT_PARENT_DIR}
fi

# run workflow
docker pull raufs/lsabgc:latest
docker run ${DOCKER_VOLUME_ARGS} --detach=false --rm --user=$(id -u):$(id -g) raufs/lsabgc:latest ${EASY_ARGS}

elif [[ $1 == 'lsaBGC-Euk-Easy.py' ]]; then
set -o errexit
set -o nounset

# Links within the container
readonly CONTAINER_INPUT_DIR=/home/input
readonly CONTAINER_OUTPUT_DIR=/home/output

# variables for updating/input paths to account for Docker mounting

DOCKER_VOLUME_ARGS=""
EASY_ARGS=""
OUTPUT_PARENT_DIR="NA"
while [[ ! $# -eq 0 ]]; do
if [[ "$1" == '-g' || "$1" == '--user_genomes_directory' ]]; then
shift
ABS_VALUE=$(get_abs_filename $1)
INPUT_DIR=$(basename $ABS_VALUE)
INPUT_PARENT_DIR=$(dirname $ABS_VALUE)
EASY_ARGS+="-g $CONTAINER_INPUT_DIR/$INPUT_DIR "
DOCKER_VOLUME_ARGS+="--volume $INPUT_PARENT_DIR:$CONTAINER_INPUT_DIR:ro "
shift
elif [[ "$1" == '-o' || "$1" == '--output_directory' ]]; then
shift
ABS_VALUE=$(get_abs_filename $1)
OUTPUT_DIR=$(basename $ABS_VALUE)
OUTPUT_PARENT_DIR=$(dirname $ABS_VALUE)
EASY_ARGS+="-o $CONTAINER_OUTPUT_DIR/$OUTPUT_DIR "
DOCKER_VOLUME_ARGS+="--volume $OUTPUT_PARENT_DIR:$CONTAINER_OUTPUT_DIR:rw "
shift
else
EASY_ARGS+="$1 "
shift
fi
done
EASY_ARGS+="--docker_mode"


if [[ ! -d ${OUTPUT_PARENT_DIR} && $OUTPUT_PARENT_DIR != "NA" ]]; then
mkdir ${OUTPUT_PARENT_DIR}
fi

# run workflow
docker pull raufs/lsabgc:latest
docker run ${DOCKER_VOLUME_ARGS} --detach=false --rm --user=$(id -u):$(id -g) raufs/lsabgc:latest ${EASY_ARGS}

else
docker pull raufs/lsabgc:latest
docker run \
--detach=false \
--rm \
--user=$(id -u):$(id -g) \
raufs/lsabgc:latest \

fi
Loading