Skip to content

Toxin genes annotation in venom gland transcriptome assembly

License

Notifications You must be signed in to change notification settings

pedronachtigall/ToxCodAn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

toxcodan_logo

ToxCodAn

Published in Briefings in Bioinformatics

ToxCodAn is a computational tool designed to detect and annotate toxin genes in transcriptome assembly.

The guide for venom gland transcriptomics is available here

Getting Started

Installation

Download the master folder and follow the steps below:

unzip ToxCodAn-master.zip
export PATH=$PATH:path/to/ToxCodAn-master/bin/

OR git clone the ToxCodAn respository and add the bin folder into your PATH:

git clone https://github.com/pedronachtigall/ToxCodAn.git
export PATH=$PATH:path/to/ToxCodAn/bin/

Requirements

Ensure that all requirements are working properly.

⚠️ If the user wants to install ToxCodAn and all dependencies using Conda environment, follow the steps below:

  • Create the environment:

    • conda create -n toxcodan_env -c bioconda python=3.6 biopython=1.69 codan blast hmmer
  • Git clone the ToxCodAn repository and add to your PATH:

    • git clone https://github.com/pedronachtigall/ToxCodAn.git
    • export PATH=$PATH:path/to/ToxCodAn/bin/
  • Download the SignalP-4.1, decompress and add it to your PATH:

    • tar -xzf signalp-4.1g.Linux.tar.gz
    • export PATH=$PATH:path/to/signalp-4.1/
    • Change the line number 13 of "signalp" (path/to/signalp-4.1/signalp) to:
      • $ENV{SIGNALP} = 'path/to/signalp-4.1/';
  • It may be needed to apply "execution permission" to all bin executables in "CodAn/bin" and "ToxCodAn/bin/":

    • chmod 777 path/to/ToxCodAn/bin/*
  • Then, run ToxCodAn as described in the "Usage" section.

  • To activate the environment to run ToxCodAn just use the command: conda activate toxcodan_env

  • To deactivate the environment just use the command: conda deactivate

  • ⚠️Tip⚠️ Ensure that you have added all conda channels properly:

    • conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge

Models

The model folder contains specific gHMM models and the toxinDB used in the ToxCodAn pipeline.

Download the models.zip file, uncompress (unzip models.zip) and specify it to the -m option of ToxCodAn command line (-m path/to/models/).

Usage

Usage: toxcodan.py [options]

Options:
  -h, --help            show this help message and exit
  -s string, --sample=string
                        Optional - sample ID to be used in the output files
                        [default=toxcodan]
  -t fasta, --transcripts=fasta
                        Mandatory - transcripts in FASTA format,
                        /path/to/transcripts.fasta
  -o folder, --output=folder
                        Optional - output folder, /path/to/output_folder; if
                        not defined, the output folder will be set in the
                        current directory [ToxCodAn_output]
  -m path, --model=path
                        Mandatory - path to model folder, /path/to/models
  -p boolean value, --signalp=boolean value
                        Optional - turn on/off the signalP filtering step, use
                        True to turn on or False to turn off [default=True]
  -P boolean value, --partial=boolean value
                        Optional - turn on/off the partial filtering step, use
                        True to turn on or False to turn off [default=False]
  -n path, --nontoxinannotation=path
                        Optional - path to folder containing the protein DB
                        and CodAn model to be used in the NonToxin Annotation
                        pipeline [default=None]
  -c int, --cpu=int     Optional - number of threads to be used in each step
                        [default=1]
  -f int, --covprefilter=int
                        Optional - threshold value used as the minimum
                        coverage in the pre-filter step [default=90]
  -F int, --covtoxinfilter=int
                        Optional - threshold value used as the minimum
                        coverage in the toxin filter step [default=80]

Basic usage:

toxcodan.py -t transcripts.fa -m path/to/models

Check our tutorial to learn how to use ToxCodAn.

Inputs

ToxCodAn has the following inputs as mandatory:

  • Transcripts in fasta format through the -t option.
  • The uncompressed models folder through the -m option

Outputs

ToxCodAn outputs the following files:

SampleID_Toxins_cds.fasta
SampleID_Toxins_pep.fasta
SampleID_Toxins_annotation.gtf
SampleID_Toxins_contigs.fasta
SampleID_PutativeToxins_cds.fasta
SampleID_PutativeToxins_contigs.fasta
SampleID_NonToxins_contigs.fasta

SampleID_Toxins_cds_SPfiltered.fasta (optional step)
SampleID_Toxins_pep_SPfiltered.fasta (optional step)
SampleID_Toxins_contigs_SPfiltered.fasta (optional step)
SampleID_Toxins_cds_SPfiltered_RedundancyFiltered.fasta (optional step)

signalp_annotation.gff (optional step)
RemoveRedundancy.log

Description of the output files:

cds -> coding sequence of the predicted toxins
pep -> protein sequence of the predicted toxins
contigs -> whole contigs containing the predicted CDSs
Toxins -> sequences with very high probability of being toxins
PutativeToxins -> sequences with medium/high probability of being toxins
NonToxins -> sequences with very low probability of being toxins
RedundancyFiltered -> CDSs with 100% identity filtered
SPfiltered -> signalP filtered sequences (optional step)

Annotation of Non Toxin transcripts

The user can take advantages of a simple script designed to annotate Non Toxin transcripts named NonToxinAnnotation.py. Follow the steps below:

  • First, perform the CDS prediction with the "VERT_full" model using CodAn (reference Nachtigall et al. (2020))
    • codan.py -t path/to/NonToxins_contigs.fasta -m path/to/VERT_full/ -o path/to/output/NonToxins_codan/ -c N
    • We have a copy of the "VERT_full" in the "non_toxin_models" folder: cd path/to/non_toxin_models/ and gzip -d VERT_full
  • Then, use the NonToxinAnnotation.py on the predicted CDSs.
  • This script performs blast search (mandatory) and hmm search using BUSCO and Pfam models (optional).
  • The use of a protein DB pre-compiled or designed with makeblastdb can be set with the -d option.
    • The user can use a DB such as Swissprot and/or the designed protein DB available at the "non_toxin_models" folder (just uncompress the DB tar xjf pepDB.tar.bz2).
    • The user can set one or more DBs by using a comma "," among DBs, which can be any number (from 1 to N).
  • Optionally, the user can set any of the BUSCO models to perform hmm search by using the option -b.
  • Optionally, the user can set the Pfam models to perform hmm search by using the option -p. (link for download the pfam.hmm: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz and decompress the model with gunzip Pfam-A.hmm.gz)
    • Please, notice that you may need to design auxfiles for the Pfam models before first use: hmmpress Pfam-A.hmm
  • This script takes advantage of MultiThreading by using the option -c.
  • Usage: NonToxinAnnotation.py -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/db1,...,path/to/dbN -b path/to/busco/odb -p path/to/pfam.hmm -c N

⚠️ [Attention 1] If the user wants to speed up the process and use DIAMOND tool in the NonToxin annotation, just follow the steps below:

  • The diamond tool can be installed through the command: conda install -c bioconda diamond.
  • Design the diamond DB by using a set of protein sequences: diamond makedb --in proteins.fasta -d diamondDB.
  • Then, use the NonToxinAnnotation.py on the predicted CDSs by setting the option -s diamond and the diamond DB in the -b option.
    • NonToxinAnnotation.py -s diamond -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/diamondDB -c N.
    • Keep the -b path/to/busco/odb -p path/to/pfam.hmm options to perform the hmm search using BUSCO and Pfam models as described above.

⚠️ [Attention 2] Alternatively, if the user wants to directly perform the NonToxins annotation within the ToxCodAn pipeline just follow the steps below:

  • Enter in the "non_toxin_models"
    • cd path/to/toxcodan/non_toxin_models/
  • Uncompress the proteinDB (tar xjf pepDB.tar.bz2) and the CodAn model for Vertebrates (gzip -d VERT_full.zip)
  • Then, use the option -n in the ToxCodAn command line to automatically perform the NonToxin annotation:
    • toxcodan.py -s sampleID -t assembly.fasta -o out_toxcodan -m /path/to/models -c 4 -n path/to/non_toxin_models/

Reference

If you use or discuss ToxCodAn, its guide, or any script available at this repository, please cite:

Nachtigall et al. (2021) ToxCodAn: a new toxin annotator and guide to venom gland transcriptomics. Briefings in Bioinformatics. DOI:https://doi.org/10.1093/bib/bbab095

License

GNU GPLv3

Contact

🐛🆘💬

To report bugs, to ask for help and to give any feedback, please contact Pedro G. Nachtigall: pedronachtigall@gmail.com

Frequently Asked Questions (FAQ)

[Q1] What Operation System (OS) do I need to use ToxCodAn?

  • We tested ToxCodAn in Linux Ubuntu 16 and 18, and macOS Mojave and Catalina. However, we believe that ToxCodAn should work on any UNIX OS able to have all dependencies of ToxCodAn installed.

[Q2] How long will take to ToxCodAn finish the analysis?

  • We tested ToxCodAn using a personal computer (6-Core i7 with 16Gb memory) and 6 threads (-c 6), it took only 55 minutes to finish the analysis by using a de novo dataset with 146,077 sequences. If the user has more threads available for use, the running time will decrease.

[Q3] Is ToxCodAn only available for snake species? 🐍

  • Unfortunately, we only acquired sufficient trainning data for snake toxins. But we are working to get more training data to other venomous taxa and make them available soon. Stay tune!
  • If you are working with other venomous taxa and believe that your research group has enough training data to design specific models, please contact me. I will be happy to collaborate and make it happen.

[Q4] When was the ToxCodAn's Databases last updated?

  • Our models and databases used for annotations were last updated in September 2020.

About

Toxin genes annotation in venom gland transcriptome assembly

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published