RefDNN: a reference drug based neural network for more accurate prediction of anticancer drug resistance
We introduce a Reference Drug-based Neural Detwork model (RefDNN) for predicting anticancer drug resistance and identifying biomarkers contributing drug resistance.
RefDNN can predict accurately whether a cell line is resistant to a new drug on which the model was not trained.
RefDNN requires gene expression profiles and drug molecular fingerprint data.
For more detail, please refer to Choi. et al. "RefDNN: a reference drug based neural network for more accurate prediction of anticancer drug resistance." Scientific Reports 10 (2020):1-11
- Latest update: 21 April 2022
-
RefDNN requires system memory larger than 24GB.
-
If you want to use tensorflow-gpu, GPU memory of more than 8GB is required.
-
We recommend to install via Anaconda (https://www.anaconda.com/)
-
After installing Anaconda, please create a conda environment with the following commands:
git clone https://github.com/mathcom/RefDNN.git
cd RefDNN
~ only CPU ~
conda env create -f environment.yml
~ using GPU ~
conda env create -f environment_gpu.yml
-
We provide several source codes for tutorials.
-
1_nested_cv_RefDNN.py : Nested Cross-validation for RefDNN evaluation
-
1-1_nested_cv_elasticnet.py
-
1-2_nested_cv_randomforest.py
-
-
2_lococv_on_GDSC.py : Leave-one-Cancer-out Cross-validation for evaluating RefDNN on the GDSC dataset
-
3_lodocv_on_GDSC.py : Leave-one-Drug-out Cross-validation for evaluating RefDNN on the GDSC dataset
-
4_biomarker_identification.py : Biomarker identification of reference drugs using RefDNN trained by '1_nested_cv_RefDNN.py'
-
-
These tutorial files are available for reproducibility purposes using the default values.
-
An user can open them using the following commands:
conda activate RefDNN
python 1_nested_cv_RefDNN.py data/response_GDSC.csv data/expression_GDSC.csv data/fingerprint_GDSC.csv -o output_1_GDSC
python 1-1_nested_cv_elasticnet.py data/response_GDSC.csv data/expression_GDSC.csv data/fingerprint_GDSC.csv -o output_1-1_GDSC
python 2_lococv_on_GDSC.py data/response_GDSC.csv data/expression_GDSC.csv data/fingerprint_GDSC.csv data/cell_annotation_GDSC.txt -o output_2_lococv_GDSC
python 3_lodocv_on_GDSC.py data/response_GDSC.csv data/expression_GDSC.csv data/fingerprint_GDSC.csv -o output_3_lodocv_GDSC
python 4_biomarker_identification.py output_1_GDSC -o output_4_GDSC
conda deactivate
-
For each tutorial code, the option parameter '-h' shows help message.
-
The following is the help message of '1_nested_cv_RefDNN.py':
$ python 1_nested_cv_RefDNN.py -h
usage: 1_nested_cv_RefDNN.py [-h] [-o outputdir] [-b batchsize]
[-t numtrainingsteps]
[-s numbayesiansearch] [-k outerkfold]
[-l innerkfold] [-v verbose]
responseFile expressionFile
fingerprintFile
positional arguments:
responseFile A filepath of drug response data for TRAINING
expressionFile A filepath of gene expression data for TRAINING
fingerprintFile A filepath of fingerprint data for TRAINING
optional arguments:
-h, --help show this help message and exit
-o outputdir A directory path for saving outputs
(default:'output_1')
-b batchsize A size of batch on training process. The small size is
recommended if an available size of RAM is small
(default: 64)
-t numtrainingsteps Number of training steps on training process. It is
recommended that the steps is larger than (numpairs /
batchsize) (default: 5000)
-s numbayesiansearch Number of bayesian search for hyperparameter tuning
(default: 20)
-k outerkfold K for outer k-fold cross validation (default: 5)
-l innerkfold L for inner l-fold cross validation (default: 3)
-v verbose 0:No logging, 1:Basic logging to check process, 2:Full
logging for debugging (default:1)
-
This example shows how to run the nested cross validation on the CCLE dataset.
-
Since the default argument values of source code produce very very long process time(about 24 hours), we modify some values.
$ python 1_nested_cv_RefDNN.py data/response_CCLE.csv data/expression_CCLE.csv data/fingerprint_CCLE.csv -o output_1_CCLE -s 10 -t 1000 -b 32
-
The example requires about 30 minutes if GPU is exploited.
-
This example will make a directory named 'output_1_CCLE' and save a result file named 'metrics_hyperparameters.csv' in the directory.
-
The result file contains 3 metrics and hyperparameter values computed by the nested cross validation.
-
When running the example, logging information like below is found.
[START]
[DATA] NUM_PAIRS: 5724
[DATA] NUM_DRUGS: 12
[DATA] NUM_CELLS: 491
[DATA] NUM_GENES: 18926
[DATA] NUM_SENSITIVITY: 2322
[DATA] NUM_RESISTANCE: 3402
[TIME] [1] 2019-4-12 15:16:18
[TIME] [2] 2019-4-12 15:16:18
[OUTER] [1/5] NOW TUNING THE MODEL USING BAYESIAN OPTIMIZATION...
[OUTER] [1/5] BEST_HIDDEN_UNITS : 6
[OUTER] [1/5] BEST_LEARNING_RATE_FTRL : 1.000e-01
[OUTER] [1/5] BEST_LEARNING_RATE_ADAM : 6.827e-02
[OUTER] [1/5] BEST_L1_REGULARIZATION_STRENGTH : 7.324e+00
[OUTER] [1/5] BEST_L2_REGULARIZATION_STRENGTH : 1.052e-01
[OUTER] [1/5] BEST_TRAINING_ACCURACY : -0.908
[OUTER] [1/5] NOW TRAINING THE MODEL WITH BEST PARAMETERS...
[OUTER] [1/5] BEST_TEST_ACCURACY : 0.907
[OUTER] [1/5] BEST_TEST_AUCROC : 0.926
[OUTER] [1/5] BEST_TEST_AUCPR : 0.940
[TIME] [3] 2019-4-12 15:22:11
[OUTER] [2/5] NOW TUNING THE MODEL USING BAYESIAN OPTIMIZATION...
(...intermediate omission...)
[OUTER] [5/5] NOW TRAINING THE MODEL WITH BEST PARAMETERS...
[OUTER] [5/5] BEST_TEST_ACCURACY : 0.898
[OUTER] [5/5] BEST_TEST_AUCROC : 0.944
[OUTER] [5/5] BEST_TEST_AUCPR : 0.951
[TIME] [3] 2019-4-12 15:45:36
[TIME] [4] 2019-4-12 15:45:36
[FINISH]