Skip to content

heitorsampaio/DeepPM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepPM

Deep convolutional neural network for mapping protein sequences to folds

Test Environment

MacOS Mojave & Xubuntu

Installation Steps

(A) Download and Unzip DeepPM source package

Create a working directory called 'DeepPM' where all scripts, programs and databases will reside:

cd ~
mkdir DeepPM_package

Download the DeepPM code:

cd ~/DeepPM_package/
git clone https://github.com/heitorsampaio/DeepPM.git
cd DeepPM

(B) Download feature dataset for training only

cd ~/DeepPM_package/DeepPM 
cd datasets 
mkdir features
cd features
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/datasets/features/Feature_aa_ss_sa.tar.gz
tar -zxf Feature_aa_ss_sa.tar.gz
rm Feature_aa_ss_sa.tar.gz

wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/datasets/features/PSSM_Fea.tar.gz
tar -zxf PSSM_Fea.tar.gz
rm PSSM_Fea.tar.gz

(C) Download software package for structure prediction (~14G)

cd ~/DeepPM_package/DeepPM  
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/software.tar.gz
tar -zxf software.tar.gz
rm software.tar.gz

(D) Install theano, Keras, and h5py and Update keras.json

(a) Create python virtual environment (if not installed)

virtualenv ~/python_virtualenv_DeepPM
source ~/python_virtualenv_DeepPM/bin/activate
pip install --upgrade pip

(b) Install Keras:

pip install keras==1.2.2

(c) Install theano and numpy:

pip install numpy==1.12.1
pip install theano==0.9.0

(d) Install the h5py library:

pip install h5py

(e) Install the matplotlib library:

pip install matplotlib

(f) Add the entry [“image_dim_ordering": "tf”,] to your keras..json file at ~/.keras/keras.json. After the update, your keras.json should look like the one below:

{
"epsilon": 1e-07,
"floatx": "float32",
"image_dim_ordering":"tf",
"image_data_format": "channels_last",
"backend": "theano"
}

(E) Configuration

perl configure.pl

(F) Testing

use Sequence similarity reduction dataset as training

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/Traindata.list ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out1 30
The top1_acc accuracy is 0.85734 (12602/14699)
The top5_acc accuracy is 0.97524 (14335/14699)
The top10_acc accuracy is 0.98925 (14541/14699)
The top15_acc accuracy is 0.99374 (14607/14699)
The top20_acc accuracy is 0.99599 (14640/14699)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id95againstTrain.list  ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.80378 (1618/2013)
The top5_acc accuracy is 0.93691 (1886/2013)
The top10_acc accuracy is 0.96225 (1937/2013)
The top15_acc accuracy is 0.97317 (1959/2013)
The top20_acc accuracy is 0.97963 (1972/2013)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id70againstTrain.list  ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.78221 (1117/1428)
The top5_acc accuracy is 0.92437 (1320/1428)
The top10_acc accuracy is 0.95378 (1362/1428)
The top15_acc accuracy is 0.96639 (1380/1428)
The top20_acc accuracy is 0.97409 (1391/1428)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id40againstTrain.list  ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.75812 (677/893)
The top5_acc accuracy is 0.90034 (804/893)
The top10_acc accuracy is 0.93617 (836/893)
The top15_acc accuracy is 0.95185 (850/893)
The top20_acc accuracy is 0.96193 (859/893)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id25againstTrain.list  ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.66948 (476/711)
The top5_acc accuracy is 0.87623 (623/711)
The top10_acc accuracy is 0.92124 (655/711)
The top15_acc accuracy is 0.94093 (669/711)
The top20_acc accuracy is 0.95218 (677/711)


THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/SCOP206.list ./models/model_SimilarityReduction.json  ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out1 30
The top1_acc accuracy is 0.72996 (1849/2533)
The top5_acc accuracy is 0.90249 (2286/2533)
The top10_acc accuracy is 0.94512 (2394/2533)
The top15_acc accuracy is 0.95973 (2431/2533)
The top20_acc accuracy is 0.96723 (2450/2533)

*** use Three-level homology reduction dataset as training***

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_family ./models/model_ThreeLevel.json  ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2  36
The top1_acc accuracy is 0.76179 (969/1272)
The top5_acc accuracy is 0.94497 (1202/1272)
The top10_acc accuracy is 0.97563 (1241/1272)
The top15_acc accuracy is 0.98428 (1252/1272)
The top20_acc accuracy is 0.98978 (1259/1272)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_superfamily ./models/model_ThreeLevel.json  ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2  36
The top1_acc accuracy is 0.50718 (636/1254)
The top5_acc accuracy is 0.77671 (974/1254)
The top10_acc accuracy is 0.86443 (1084/1254)
The top15_acc accuracy is 0.90431 (1134/1254)
The top20_acc accuracy is 0.92105 (1155/1254)

THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_fold ./models/model_ThreeLevel.json  ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2  36
The top1_acc accuracy is 0.40947 (294/718)
The top5_acc accuracy is 0.70474 (506/718)
The top10_acc accuracy is 0.82451 (592/718)
The top15_acc accuracy is 0.86908 (624/718)
The top20_acc accuracy is 0.89694 (644/718)

Training

cd training
sh P1_train.sh

The top1_acc accuracy is 0.84142 (12368/14699)
The top5_acc accuracy is 0.97007 (14259/14699)
The top10_acc accuracy is 0.98782 (14520/14699)
The top15_acc accuracy is 0.99299 (14596/14699)
The top20_acc accuracy is 0.99599 (14640/14699)

Evaluation

cd training
sh P1_evaluate.sh

The top1_acc accuracy is 0.84142 (12368/14699)
The top5_acc accuracy is 0.97007 (14259/14699)
The top10_acc accuracy is 0.98782 (14520/14699)
The top15_acc accuracy is 0.99299 (14596/14699)
The top20_acc accuracy is 0.99599 (14640/14699)

The top1_acc accuracy is 0.70549 (1787/2533)
The top5_acc accuracy is 0.89380 (2264/2533)
The top10_acc accuracy is 0.94039 (2382/2533)
The top15_acc accuracy is 0.95420 (2417/2533)
The top20_acc accuracy is 0.96210 (2437/2533)

The top1_acc accuracy is 0.78341 (1577/2013)
The top5_acc accuracy is 0.92697 (1866/2013)
The top10_acc accuracy is 0.95926 (1931/2013)
The top15_acc accuracy is 0.96870 (1950/2013)
The top20_acc accuracy is 0.97466 (1962/2013)

(G) Protein fold recognition and structure prediction

(a) Download the template database (~34G)

cd ~/DeepPM_package/DeepPM 
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/database.tar.gz
tar -zxf database.tar.gz
rm database.tar.gz

(b) Test required softwares

cd software/pspro2
mkdir test3
bin/predict_ss_sa_cm.sh test/test.fasta  test3

cd software/SCRATCH-1D_1.1
cd doc/
../bin/run_SCRATCH-1D_predictors.sh test.fasta test.out 4

(c) Run fold recognition only

source ~/python_virtualenv_DeepPM/bin/activate
perl scripts/DeepPM_fr.pl scripts/fr_option_adv_for_DeepPM test/test.fasta  test/out1  fold_only

The ranking of top SCOP folds are saved in test/out1/fold_rank_list.SCOP
The ranking of top ECOD_X folds are saved in test/out1/fold_rank_list.ECOD_X
The ranking of top ECOD_H folds are saved in test/out1/fold_rank_list.ECOD_H

The ranking of top selected templates are saved in  test/out1/test.template.rank

(d) Run fold recognition and structure prediction

source ~/python_virtualenv_DeepPM/bin/activate
perl scripts/DeepPM_fr.pl scripts/fr_option_adv_for_DeepPM test/test.fasta  test/out2

The ranking of top selected templates are saved in  test/out2/test.template.rank
The predicted models are saved in test/out2/TOP5/

About

Deep Learning for protein fold

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published