Deep convolutional neural network for mapping protein sequences to folds
MacOS Mojave & Xubuntu
(A) Download and Unzip DeepPM source package
Create a working directory called 'DeepPM' where all scripts, programs and databases will reside:
cd ~
mkdir DeepPM_package
Download the DeepPM code:
cd ~/DeepPM_package/
git clone https://github.com/heitorsampaio/DeepPM.git
cd DeepPM
(B) Download feature dataset for training only
cd ~/DeepPM_package/DeepPM
cd datasets
mkdir features
cd features
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/datasets/features/Feature_aa_ss_sa.tar.gz
tar -zxf Feature_aa_ss_sa.tar.gz
rm Feature_aa_ss_sa.tar.gz
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/datasets/features/PSSM_Fea.tar.gz
tar -zxf PSSM_Fea.tar.gz
rm PSSM_Fea.tar.gz
(C) Download software package for structure prediction (~14G)
cd ~/DeepPM_package/DeepPM
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/software.tar.gz
tar -zxf software.tar.gz
rm software.tar.gz
(D) Install theano, Keras, and h5py and Update keras.json
(a) Create python virtual environment (if not installed)
virtualenv ~/python_virtualenv_DeepPM
source ~/python_virtualenv_DeepPM/bin/activate
pip install --upgrade pip
(b) Install Keras:
pip install keras==1.2.2
(c) Install theano and numpy:
pip install numpy==1.12.1
pip install theano==0.9.0
(d) Install the h5py library:
pip install h5py
(e) Install the matplotlib library:
pip install matplotlib
(f) Add the entry [“image_dim_ordering": "tf”,] to your keras..json file at ~/.keras/keras.json. After the update, your keras.json should look like the one below:
{
"epsilon": 1e-07,
"floatx": "float32",
"image_dim_ordering":"tf",
"image_data_format": "channels_last",
"backend": "theano"
}
(E) Configuration
perl configure.pl
(F) Testing
use Sequence similarity reduction dataset as training
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/Traindata.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out1 30
The top1_acc accuracy is 0.85734 (12602/14699)
The top5_acc accuracy is 0.97524 (14335/14699)
The top10_acc accuracy is 0.98925 (14541/14699)
The top15_acc accuracy is 0.99374 (14607/14699)
The top20_acc accuracy is 0.99599 (14640/14699)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id95againstTrain.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.80378 (1618/2013)
The top5_acc accuracy is 0.93691 (1886/2013)
The top10_acc accuracy is 0.96225 (1937/2013)
The top15_acc accuracy is 0.97317 (1959/2013)
The top20_acc accuracy is 0.97963 (1972/2013)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id70againstTrain.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.78221 (1117/1428)
The top5_acc accuracy is 0.92437 (1320/1428)
The top10_acc accuracy is 0.95378 (1362/1428)
The top15_acc accuracy is 0.96639 (1380/1428)
The top20_acc accuracy is 0.97409 (1391/1428)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id40againstTrain.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.75812 (677/893)
The top5_acc accuracy is 0.90034 (804/893)
The top10_acc accuracy is 0.93617 (836/893)
The top15_acc accuracy is 0.95185 (850/893)
The top20_acc accuracy is 0.96193 (859/893)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D1_SimilarityReduction_dataset/Testdata_id25againstTrain.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out2 30
The top1_acc accuracy is 0.66948 (476/711)
The top5_acc accuracy is 0.87623 (623/711)
The top10_acc accuracy is 0.92124 (655/711)
The top15_acc accuracy is 0.94093 (669/711)
The top20_acc accuracy is 0.95218 (677/711)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/SCOP206.list ./models/model_SimilarityReduction.json ./models/model_SimilarityReduction.h5 datasets/features/ ./test/out1 30
The top1_acc accuracy is 0.72996 (1849/2533)
The top5_acc accuracy is 0.90249 (2286/2533)
The top10_acc accuracy is 0.94512 (2394/2533)
The top15_acc accuracy is 0.95973 (2431/2533)
The top20_acc accuracy is 0.96723 (2450/2533)
*** use Three-level homology reduction dataset as training***
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_family ./models/model_ThreeLevel.json ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2 36
The top1_acc accuracy is 0.76179 (969/1272)
The top5_acc accuracy is 0.94497 (1202/1272)
The top10_acc accuracy is 0.97563 (1241/1272)
The top15_acc accuracy is 0.98428 (1252/1272)
The top20_acc accuracy is 0.98978 (1259/1272)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_superfamily ./models/model_ThreeLevel.json ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2 36
The top1_acc accuracy is 0.50718 (636/1254)
The top5_acc accuracy is 0.77671 (974/1254)
The top10_acc accuracy is 0.86443 (1084/1254)
The top15_acc accuracy is 0.90431 (1134/1254)
The top20_acc accuracy is 0.92105 (1155/1254)
THEANO_FLAGS=floatX=float32,device=cpu python ./training/predict_single.py ./datasets/D2_Three_levels_dataset/test_dataset.list_fold ./models/model_ThreeLevel.json ./models/model_ThreeLevel.h5 datasets/features/ ./test/out2 36
The top1_acc accuracy is 0.40947 (294/718)
The top5_acc accuracy is 0.70474 (506/718)
The top10_acc accuracy is 0.82451 (592/718)
The top15_acc accuracy is 0.86908 (624/718)
The top20_acc accuracy is 0.89694 (644/718)
Training
cd training
sh P1_train.sh
The top1_acc accuracy is 0.84142 (12368/14699)
The top5_acc accuracy is 0.97007 (14259/14699)
The top10_acc accuracy is 0.98782 (14520/14699)
The top15_acc accuracy is 0.99299 (14596/14699)
The top20_acc accuracy is 0.99599 (14640/14699)
Evaluation
cd training
sh P1_evaluate.sh
The top1_acc accuracy is 0.84142 (12368/14699)
The top5_acc accuracy is 0.97007 (14259/14699)
The top10_acc accuracy is 0.98782 (14520/14699)
The top15_acc accuracy is 0.99299 (14596/14699)
The top20_acc accuracy is 0.99599 (14640/14699)
The top1_acc accuracy is 0.70549 (1787/2533)
The top5_acc accuracy is 0.89380 (2264/2533)
The top10_acc accuracy is 0.94039 (2382/2533)
The top15_acc accuracy is 0.95420 (2417/2533)
The top20_acc accuracy is 0.96210 (2437/2533)
The top1_acc accuracy is 0.78341 (1577/2013)
The top5_acc accuracy is 0.92697 (1866/2013)
The top10_acc accuracy is 0.95926 (1931/2013)
The top15_acc accuracy is 0.96870 (1950/2013)
The top20_acc accuracy is 0.97466 (1962/2013)
(G) Protein fold recognition and structure prediction
(a) Download the template database (~34G)
cd ~/DeepPM_package/DeepPM
wget http://sysbio.rnet.missouri.edu/bdm_download/DeepSF/database.tar.gz
tar -zxf database.tar.gz
rm database.tar.gz
(b) Test required softwares
cd software/pspro2
mkdir test3
bin/predict_ss_sa_cm.sh test/test.fasta test3
cd software/SCRATCH-1D_1.1
cd doc/
../bin/run_SCRATCH-1D_predictors.sh test.fasta test.out 4
(c) Run fold recognition only
source ~/python_virtualenv_DeepPM/bin/activate
perl scripts/DeepPM_fr.pl scripts/fr_option_adv_for_DeepPM test/test.fasta test/out1 fold_only
The ranking of top SCOP folds are saved in test/out1/fold_rank_list.SCOP
The ranking of top ECOD_X folds are saved in test/out1/fold_rank_list.ECOD_X
The ranking of top ECOD_H folds are saved in test/out1/fold_rank_list.ECOD_H
The ranking of top selected templates are saved in test/out1/test.template.rank
(d) Run fold recognition and structure prediction
source ~/python_virtualenv_DeepPM/bin/activate
perl scripts/DeepPM_fr.pl scripts/fr_option_adv_for_DeepPM test/test.fasta test/out2
The ranking of top selected templates are saved in test/out2/test.template.rank
The predicted models are saved in test/out2/TOP5/