GitHub - PedroMTQ/refdb_generator: Automated creation of a few different HMM and diamond reference

Provides automated creation of multiple databases:

Rhea HMMs implementation not finished
Reactome diamond database
Swissprot diamond database
Trembl diamond database
Enzyme EC HMMs
BIGG genes diamond database
Metacyc proteins diamond database

BIGG is generally used for metabolic modelling and these HMMs can thus be linked to metabolic models and metabolic model generation tools

All databases contain cross-linking metadata which is in a format compatible to Mantis - https://github.com/PedroMTQ/mantis

Arguments:

-db or --database,to choose whether to cluster with rhea, reactome, swissprot, trembl, ec, and bigg_genes
-o or --output_folder,to specify directory to save database in
-ms or --min_seqs, minimum sequences per HMM. Default is 10
-rf or --remove_files, to remove files from previous runs

The refdb_generator.yml provides a conda environment recipe with all the required packages.

To run this tool:

git clone git@github.com:PedroMTQ/hmm_updater.git
Go to cloned mantis folder and run conda env create -f reference_generator.yml
Run conda activate reference_generator
Run python Reference_Generator.py with the required arguments

To generate the Metacyc diamond database you need the academic license from Metacyc:

Request Metacyc license
Download the Metacyc flat files
Export the protseq.fsa and proteins.dat in data to an output_folder
Run python Reference_Generator.py -db metacyc -o output_folder

How are these references created?

The reference HMMs were created by clustering protein sequences based on a certain database ID (e.g., for enzyme ECs, we created HMMs from the protein sequences that were annotated as having the enzyme EC function), whereas the diamond databases were created to be able to do a sequence homology search and then inferring all the associated functions of the match to the unknown sequence.

Specifically, the reference HMMs were created in the following manner: \begin{enumerate} \item designate the clustering ID type (e.g. Rhea reactions) \item extract all protein sequences from the database - extracted directly or predicted with Prodigal \cite{hyatt_prodigal_2010} \item create fasta files where each file contains the protein sequences associated with a certain ID (e.g., all protein sequences associated with a certain Rhea reaction) \item Cluster the protein sequences in each fasta using MMSeqs2 \cite{mmseqs2} \item create new fasta files based on clustering results \item run a multiple sequence alignment on previous fasta files (MUSCLE \cite{muscle} for fasta files with 500 or less sequences, and Clustal Omega \cite{clustalomega} got files with more than 500 sequences) \item create HMMs using HMMER \cite{roberts_eddy_hmmer_nodate} \item index HMMs and create a corresponding metadata.tsv file \end{enumerate}

Diamond databases were created by extracting the SwissProt/Trembl sequences, and then creating an associated metadata.tsv file.

Software used

HMMER to build and press HMMs
Muscle for MSAs under 500 sequences
Clustal Omega for MSAs above 500 sequences
Diamond to build Diamond databases

Disclaimer: The author does not own any of the third-party tools or data. Please cite the respective tools and databases.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.idea		.idea
LICENSE		LICENSE
README.md		README.md
Reference_Generator.py		Reference_Generator.py
Web_Connector.py		Web_Connector.py
reference_generator.yml		reference_generator.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How are these references created?

Software used

About

Releases

Packages

Languages

License

PedroMTQ/refdb_generator

Folders and files

Latest commit

History

Repository files navigation

How are these references created?

Software used

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages