Written by Diana Lin.
rAMPage is an in silico anti-microbial peptide (AMP) discovery pipeline that takes in bulk RNA-seq reads and outputs a FASTA file of annotated, confident, short, and charged putative AMPs.
- Clone this repository:
git clone https://github.com/bcgsc/rAMPage.git
- Download and install the dependencies (specified in the Dependencies section below), into
rAMPage/src
.- some of these dependencies need to be configured:
SignalP
,ProP
,SABLE
,EnTAP
(see configurations) - install AMPlify using
conda
(required--biopython
andpandas
are dependencies for other scripts other than AMPlify)cd rAMPage conda create --prefix src/AMPlify python=3.6 conda activate AMPlify conda install -c bioconda amplify
- some of these dependencies need to be configured:
- Update all the paths in
rAMPage/scripts/config.sh
to reflect dependencies inrAMPage/src
and dependencies pre-installed elsewhere. - Source
scripts/config.sh
in the root of the repository.source scripts/config.sh
- Create working directories for each dataset using this convention:
taxonomic-class/species/tissue-or-condition
- NOTE: the top-level parent directory must correspond to the taxonomic class of the dataset. This class is used to choose which file in
amp_seqs
to use for homology search. - e.g. M. gulosa:
insecta/mgulosa/venom-gland
- e.g. P. toftae:
amphibia/ptoftae/skin-liver
- NOTE: the top-level parent directory must correspond to the taxonomic class of the dataset. This class is used to choose which file in
- Move all reads and reference FASTA files to the respective working directories for each dataset. See below for an example.
- Create a 2 or 3-column space-delimited text file as specified by the Input section below, called
input.txt
, in the working directory of each dataset.
At the end of setup, you should have a directory structure similar to below (excludes other directories, like scripts/
):
rAMPage
├── amphibia
│ └── ptoftae
│ └── skin-liver
│ ├── input.txt
│ └── raw_reads
│ ├── SRR8288040_1.fastq.gz
│ ├── SRR8288040_2.fastq.gz
│ ├── SRR8288041_1.fastq.gz
│ ├── SRR8288041_2.fastq.gz
│ ├── SRR8288056_1.fastq.gz
│ ├── SRR8288056_2.fastq.gz
│ ├── SRR8288057_1.fastq.gz
│ ├── SRR8288057_2.fastq.gz
│ ├── SRR8288058_1.fastq.gz
│ ├── SRR8288058_2.fastq.gz
│ ├── SRR8288059_1.fastq.gz
│ ├── SRR8288059_2.fastq.gz
│ ├── SRR8288060_1.fastq.gz
│ ├── SRR8288060_2.fastq.gz
│ ├── SRR8288061_1.fastq.gz
│ └── SRR8288061_2.fastq.gz
└── insecta
└── mgulosa
└── venom
├── input.txt
├── raw_reads
│ ├── SRR6466797_1.fastq.gz
│ └── SRR6466797_2.fastq.gz
└── tsa.GGFG.1.fsa_nt.gz
Dependency | Tested Version |
---|---|
GNU bash |
v5.0.11(1) |
GNU awk |
v5.0.1 |
GNU sed |
v4.8 |
GNU grep |
v3.4 |
GNU make |
v4.3 |
GNU column |
2.36 |
Miller mlr |
5.4.0 |
bc |
v1.07.1 |
gzip |
v1.10 |
python |
v3.7.7 |
Rscript * |
v4.0.2 |
*requires tidyverse v1.3.0
, glue v1.4.2
, and docopt v0.7.1
.
Dependency | Tested Version |
---|---|
SRA toolkit | v2.10.5 |
EDirect | v13.8 |
fastp | v0.20.0 |
RNA-Bloom | v1.3.1 |
salmon | v1.3.0 |
TransDecoder | v5.5.0 |
HMMER | v3.3.1 |
cd-hit | v4.8.1 |
seqtk | v1.1-r91 |
SignalP | v3.0 |
ProP | v1.0c |
AMPlify | v1.1.0 |
ENTAP | v0.10.7-beta |
Exonerate | v2.4.0 |
SABLE | v4.0 |
Clustal Omega | v1.2.4 |
To download SignalP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show you one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl
or wget
.
After moving the downloaded signalp-3.0.Linux.tar.Z
file to src
, decompress it:
cd src/
cat signalp-3.0.Linux.tar.Z | uncompress | tar xvf -
The file to edit is src/signalp-3.0/signalp
:
Before | After |
---|---|
SIGNALP=/usr/opt/signalp-3.0 |
SIGNALP=$ROOT_DIR/src/signalp-3.0 |
AWK=nawk |
AWK=awk |
Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/signalp-3.0/signalp-3.0.readme
in detail.
The experimental scripts/helpers/install_prop.sh
can be used to install SignalP with the changes listed above, but more changes may be required. Make sure that SignalP works with the test datasets in its directory before running rAMPage, e.g.
cd src/signalp-3.0
./signalp -t euk test/test.seq
To download ProP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl
or wget
.
After moving the downloaded prop-1.0c.Linux.tar.Z
file to src
, decompress it:
cd src/
cat prop-1.0c.Linux.tar.Z | uncompress | tar xvf -
The file to edit is src/prop-1.0c/prop
:
Before | After |
---|---|
setenv PROPHOME /usr/cbs/packages/prop/1.0c/prop-1.0c |
setenv PROPHOME $ROOT_DIR/src/prop-1.0c |
*setenv SIGNALP /usr/cbs/bio/bin/signalp |
setenv SIGNALP $ROOT_DIR/src/signalp-3.0/signalp |
*edit the one corresponding to your system, Linux used in the example
Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/prop-1.0c/prop-1.0c.readme
in detail.
The experimental scripts/helpers/install_prop.sh
can be used to install ProP with the changes listed above, but more changes may be required. Make sure that ProP works with the test datasets in its directory before running rAMPage, e.g.
cd src/prop-1.0c
./prop -s test/EDA_HUMAN.fsa
Download and decompress the following databases:
Database | Example Download Code |
---|---|
RefSeq: Non-mammalian Vertebrates (for amphibia ) |
wget -O vertebrate_other_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_other/vertebrate_other.*.protein.faa.gz |
RefSeq: Invertebrates (for insecta ) |
wget -O invertebrate_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate/invertebrate.*.protein.faa.gz |
SwissProt | wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz |
NCBI nr |
wget -O nr.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz |
After decompression, the databases can be configured using scripts/config-entap.sh
:
scripts/config-entap.sh -t 8 invertebrate_protein.faa vertebrate_other_protein.faa uniprot_sprot.fasta nr.fasta
The script configures all the databases in the EnTAP-0.10.7-beta/bin
directory.
The file to edit is src/sable_v4_distr/run.sable
:
Before | After |
---|---|
remDir=$PWD; |
remDir=$PWD; THREADS=$1; |
export SABLE_DIR="/users/radamcza/work/newSable/sable_distr"; |
export SABLE_DIR="$ROOT_DIR/src/sable_v4_distr"; |
export BLAST_DIR="/usr/local/blast/2.2.28/bin"; |
export BLAST_DIR=$BLAST_DIR |
*export NR_DIR="/database/ncbi/nr" |
export NR_DIR=$ROOT_DIR/src/EnTAP-0.10.7-beta/bin/nr |
export PRIMARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/pfam_index" |
export PRIMARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/pfam_index" |
export SECONDARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/swissprot_index" |
export SECONDARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/swissprot_index"; |
mkdir $PBS_JOBID |
mkdir -p $PBS_JOBID |
/usr/bin/perl ${SABLE_DIR}/sable.pl |
perl ${SABLE_DIR}/sable.pl $THREADS |
*After downloading the nr
FASTA file (see below), it will need to be configured using BLAST+'s makeblastdb
.
Dependency | Tested Version |
---|---|
GNU wget |
v1.20.3 |
curl |
v7.72.0 |
pigz |
v2.4 |
A 2 or 3-column space-delimited text file named input.txt
, located in the working directory of each dataset.
Column | Attribute |
---|---|
1 | Pooling ID: generally a condition, tissue, or sex, etc. |
2 | Path to read 1 |
3 | Path to read 2 (if paired-end reads) |
Read paths in this input text file should be relative to the location of the input text file.
Need help downloading reads? The scripts/helpers/get-reads.sh
script can be used to download reads. These dependencies are required:
Dependency | Tested Version |
---|---|
SRA toolkit |
v2.10.5 |
EDirect | v13.8 |
The input runs.txt
should have one SRR accession on each line.
POOLING ID | READ 1 | READ 2 |
---|---|---|
venom | raw_reads/SRR6466797_1.fastq.gz | raw_reads/SRR6466797_2.fastq.gz |
insecta/mgulosa/venom/input.txt
:
venom raw_reads/SRR6466797_1.fastq.gz raw_reads/SRR6466797_2.fastq.gz
Using scripts/helpers/get-reads.sh
:
scripts/helpers/get-reads.sh -o insecta/mgulosa/venom/raw_reads -p insecta/mgulosa/venom/runs.txt
insecta/mgulosa/venom/runs.txt
:
SRR6466797
POOLING ID | READ 1 | READ 2 |
---|---|---|
liver | raw_reads/SRR8288040_1.fastq.gz | raw_reads/SRR8288040_2.fastq.gz |
skin | raw_reads/SRR8288041_1.fastq.gz | raw_reads/SRR8288041_2.fastq.gz |
liver | raw_reads/SRR8288056_1.fastq.gz | raw_reads/SRR8288056_2.fastq.gz |
skin | raw_reads/SRR8288057_1.fastq.gz | raw_reads/SRR8288057_2.fastq.gz |
liver | raw_reads/SRR8288058_1.fastq.gz | raw_reads/SRR8288058_2.fastq.gz |
skin | raw_reads/SRR8288059_1.fastq.gz | raw_reads/SRR8288059_2.fastq.gz |
liver | raw_reads/SRR8288060_1.fastq.gz | raw_reads/SRR8288060_2.fastq.gz |
skin | raw_reads/SRR8288061_1.fastq.gz | raw_reads/SRR8288061_2.fastq.gz |
amphibia/ptoftae/skin-liver/input.txt
:
liver raw_reads/SRR8288040_1.fastq.gz raw_reads/SRR8288040_2.fastq.gz
skin raw_reads/SRR8288041_1.fastq.gz raw_reads/SRR8288041_2.fastq.gz
liver raw_reads/SRR8288056_1.fastq.gz raw_reads/SRR8288056_2.fastq.gz
skin raw_reads/SRR8288057_1.fastq.gz raw_reads/SRR8288057_2.fastq.gz
liver raw_reads/SRR8288058_1.fastq.gz raw_reads/SRR8288058_2.fastq.gz
skin raw_reads/SRR8288059_1.fastq.gz raw_reads/SRR8288059_2.fastq.gz
liver raw_reads/SRR8288060_1.fastq.gz raw_reads/SRR8288060_2.fastq.gz
skin raw_reads/SRR8288061_1.fastq.gz raw_reads/SRR8288061_2.fastq.gz
Using scripts/helpers/get-reads.sh
:
scripts/helpers/get-reads.sh -o amphibia/ptoftae/skin-liver/raw_reads -p amphibia/ptoftae/skin-liver/runs.txt
amphibia/ptoftae/skin-liver/runs.txt
:
SRR8288040
SRR8288041
SRR8288056
SRR8288057
SRR8288058
SRR8288059
SRR8288060
SRR8288061
To use a reference transcriptome for the assembly stage with RNA-Bloom, put the reference in the working directory or use the -r
option of scripts/rAMPage.sh
.
insecta/mgulosa/venom
├── input.txt
├── raw_reads
│ ├── SRR6466797_1.fastq.gz
│ └── SRR6466797_2.fastq.gz
└── tsa.GGFG.1.fsa_nt.gzz
In this case, the reference transcriptome is a Transcriptome Shotgun Assembly for M. gulosa, downloaded from ftp://ftp.ncbi.nlm.nih.gov/genbank/tsa/G/tsa.GGFG.1.fsa_nt.gz
.
Multiple references can be used as long as they are placed in the working directory.
Representative Genomes can be found by searching the Genome database on NCBI, using these search terms (A. mellifera, for example):
"Apis mellifera"[orgn]
Transcriptome Shotgun Assemblies can be found by searching the Nucleotide database on NCBI, using these search terms:
tsa-master[prop] "Apis mellifera"[orgn] midgut[All Fields]
The rAMPage.sh
script in scripts/
runs the pipeline using a Makefile
.
PROGRAM: rAMPage.sh
DESCRIPTION:
Runs the rAMPage pipeline, using the Makefile.
USAGE(S):
rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
OPTIONS:
-a <address> email address for alerts
-c <class> taxonomic class of the dataset (default = top-level directory in $outdir)
-d debug mode of Makefile
-f force characterization even if no AMPs found
-h show help menu
-m <target> Makefile target (default = exonerate)
-n <species> taxonomic species or name of the dataset (default = second-level directory in $outdir)
-o <directory> output directory (default = directory of input reads TXT file)
-p run processes in parallel
-r <FASTA.gz> reference transcriptome (accepted multiple times, *.fna.gz *.fsa_nt.gz)
-s strand-specific library construction (default = false)
-t <int> number of threads (default = 48)
-v print version number
-E <e-value> E-value threshold for homology search (default = 1e-5)
-S <3.0103 to 80> AMPlify score threshold for amphibian AMPs (default = 10)
-L <int> Length threshold for AMPs (default = 30)
-C <int> Charge threshold for AMPs (default = 2)
-R Disable redundancy removal during transcript assembly
EXAMPLE(S):
rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt
INPUT EXAMPLE:
tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
MAKEFILE TARGETS:
01) check 08) homology
02) reads 09) cleavage
03) trim 10) amplify
04) readslist 11) annotation
05) assembly 12) exonerate
06) filtering 13) sable
07) translation 14) all
DESCRIPTION:
Runs the rAMPage pipeline, using the Makefile.
USAGE(S):
rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
OPTIONS:
-a <address> email address for alerts
-c <class> taxonomic class of the dataset (default = top-level directory in $outdir)
-d debug mode of Makefile
-f force characterization even if no AMPs found
-h show help menu
-m <target> Makefile target (default = exonerate)
-n <species> taxonomic species or name of the dataset (default = second-level directory in $outdir)
-o <directory> output directory (default = directory of input reads TXT file)
-p run processes in parallel
-r <FASTA.gz> reference transcriptome (accepted multiple times, *.fna.gz *.fsa_nt.gz)
-s strand-specific library construction (default = false)
-t <int> number of threads (default = 48)
-v print version number
-E <e-value> E-value threshold for homology search (default = 1e-5)
-S <3.0103 to 80> AMPlify score threshold for amphibian AMPs (default = 10)
-L <int> Length threshold for AMPs (default = 30)
-C <int> Charge threshold for AMPs (default = 2)
-R Disable redundancy removal during transcript assembly
EXAMPLE(S):
rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt
INPUT EXAMPLE:
tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
MAKEFILE TARGETS:
01) check 08) homology
02) reads 09) cleavage
03) trim 10) amplify
04) readslist 11) annotation
05) assembly 12) exonerate
06) filtering 13) sable
07) translation 14) all
The best way to choose score, length, and score thresholds is to plot the distribution of the reference AMPs.
scripts/helpers/plot-dist.sh -a amphibianAMPs.faa -i insectAMPs.faa -t 8 -o /path/to/output/dir -r
Example: M. gulosa (stranded library construction)
scripts/rAMPage.sh -v -s -o insecta/mgulosa/venom -r insecta/mgulosa/venom/tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa insecta/mgulosa/venom/input.txt
In the example above, the -o insecta/mgulosa/venom
argument is optional, since the default will be set as parent directory of the input.txt
file. This option is a safeguard for the scenario where input.txt
is not located in the working directory. In this case, the -o
option will move input.txt
and provided references to the working directory.
rAMPage will use all *.fsa_nt*
and *.fna*
files located in the working directory as references in the assembly stage, regardless of if the -r
option is used or not. This option is a safeguard for the scenario where the references provided are not located in the working directory. In this case, the -r
option will move the references to the working directory.
Example: M. gulosa (stranded library construction)
$ROOT_DIR/scripts/rAMPage.sh -s -r tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa input.txt
To run rAMPage on multiple datasets, you can use the stAMPede.sh
wrapper script. By default, stAMPede.sh
will run rAMPage on the datasets consecutively. If the -s
option is invoked, they will be run simultaenously in parallel. The -p
option allows parallelization of certain processes, such as trimming reads in parallel.
Note: This script is experimental and has fewer options than running rAMPage.sh
.
PROGRAM: stAMPede.sh
DESCRIPTION:
A wrapper around rAMPage.sh to allow running of multiple assemblies.
USAGE(S):
stAMPede.sh [-a <address>] [-d] [-h] [-m] [-p] [-s] [-t <int>] [-v] <accessions TXT file>
OPTION(S):
-a <address> email address for alerts
-d debug mode
-h show help menu
-m <target> Makefile target (default = exonerate)
-p allow parallel processes for each dataset
-s simultaenously run rAMPAge on all datasets (default if SLURM available)
-t <int> number of threads (default = 48)
-v verbose (uses /usr/bin/time -pv to time each rAMPage run)
-E <e-value> E-value threshold for homology search (default = 1e-5)
-S <3.0103 to 80> AMPlify score threshold for amphibian AMPs (default = 10)
-L <int> Length threshold for AMPs (default = 30)
-C <int> Charge threshold for AMPs (default = 2)
ACCESSIONS TXT FORMAT:
CLASS/SPECIES/TISSUE_OR_CONDITION/input.txt strandedness
amphibia/ptoftae/skin-liver/input.txt nonstranded
insecta/mgulosa/venom/input.txt stranded
EXAMPLE(S):
stAMPede.sh -a user@example.com -p -s -v accessions.txt
For running multiple datasets, the multi-input text file should be a 2-column text file:
Column | Attribute |
---|---|
1 | path to input.txt file |
2 | stranded or nonstranded |
Example: P. toftae and M. gulosa
input.txt |
strandedness |
---|---|
amphibia/ptoftae/skin-liver/input.txt |
nonstranded |
insecta/mgulosa/venom/input.txt |
stranded |
multi-input.txt
:
amphibia/ptoftae/skin-liver/input.txt nonstranded
insecta/mgulosa/venom/input.txt stranded
For reproducibility, clustering AMPs across datasets and choosing AMPs for synthesis are included in the scripts/stAMPede.sh
script, but manual clustering can be done using scripts/cluster.sh
:
scripts/cluster.sh -o /path/to/outdir amphibia/ptoftae/skin-liver/exonerate insecta/mgulosa/venom/exonerate
These are the 90 AMPs selected for synthesis, but only a subset of 21 have been validated in vitro thus far.
Example directory structure:
rAMPage
├── amphibia
│ └── ptoftae
│ └── skin-liver
├── amp_seqs
├── insecta
│ └── mgulosa
│ └── venom
├── scripts
└── src
Lin, D.; Sutherland, D.; Aninta, S.I.; Louie, N.; Nip, K.M.; Li, C.; Yanai, A.; Coombe, L.; Warren, R.L.; Helbing, C.C.; et al. Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage. Antibiotics 2022, 11, 952. https://doi.org/10.3390/antibiotics11070952