Understanding the mechanisms underlying diseases and syndromes can be greatly enhanced by knowledge of gene-disease associations (GDAs). Such knowledge plays a crucial role in developing new approaches for diagnosis, prevention, and treatment. However, our current understanding of GDAs remains incomplete, with the majority of true associations yet to be discovered. The process of experimentally verifying new hypothetical GDAs is resource and time-intensive, impeding progress in discovery. To address this challenge, it may be beneficial to predict GDAs that are likely to exist, thereby reducing the resources expended on unsuccessful experiments.
The immense volume of genomic measurement data, combined with the vast body of medical knowledge, has surpassed the capacity for individual comprehension. Consequently, machine learning-based methods have emerged as essential tools in biomedical research. Among these methods, graph neural networks have gained prominence. These models can learn from examples and predict the presence or absence of meaningful relationships between pairs of entities.
ROC AUC | Recall@90% Specificity | Recall@99% Specificity | Recall@99.5% Specificity | Recall@99.9% Specificity | Recall@99.95% Specificity | Recall@99.99% Specificity | |
---|---|---|---|---|---|---|---|
NCN SAGE | 98.5±0.1 | 96.93±0.3 | 33.7±3.2 | 23.9±3.4 | 10.7±3.4 | 9.8±3.1 | 9.1±2.0 |
NCN Het MLP | 98.908±0.07 | 98.02±0.2 | 32.2±3.3 | 24.2±3.0 | 13.5±1.9 | 13.1±1.4 | 12.53±1. |
MF | 95.694±0.09 | 91.3±0.1 | 41.4±2.2 | 30.4±3.3 | 13.7±2.4 | 11.6±2.8 | 7.3±3.8 |
RWR | 98.516±0.08 | 98.19±0.1 | 83.66±0.9 | 71.5±1.2 | 28.8±5.4 | 11.3±2.3 | 1.9±1.3 |
AA | 86.68±0.1 | 100.0±0.0 | 16.2±2.8 | 10.4±3.4 | 3.5±1.9 | 3.5±1.9 | 3.5±1.9 |
Het MLP nofeat | 97.54±0.2 | 94.24±0.7 | 72.5±1.0 | 60.6±1.1 | 32.4±1.8 | 23.1±1.8 | 9.6±2.1 |
MLP | 98.10±0.2 | 95.85±0.5 | 76.7±1.3 | 65.6±2.4 | 37.4±2.2 | 28.2±2.9 | 12.5±3.8 |
SAGE | 98.644±0.06 | 97.47±0.2 | 81.36±0.8 | 70.5±1.3 | 37.7±4.5 | 22.9±6.7 | 3.7±5.4 |
Het SAGE | 98.51±0.1 | 97.13±0.3 | 79.0±1.3 | 67.7±2.1 | 39.7±3.6 | 27.1±2.9 | 12.4±4.0 |
Het MLP DistMult | 98.993±0.04 | 98.286±0.09 | 81.71±0.9 | 70.7±1.2 | 40.4±2.6 | 29.6±3.5 | 13.8±3.1 |
Het MLP | 98.85±0.2 | 97.90±0.5 | 82.3±1.5 | 72.±1.8 | 42.6±3.8 | 32.6±4.4 | 17.5±5.1 |
HetMLP +RWR | 98.736±0.08 | 98.33±0.2 | 89.24±0.5 | 81.1±1.3 | 50.9±2.6 | 36.1±5.6 | 10.±4.9 |
Table 1.: Binary classification performance of known positive gene-disease associations versus randomly sampled negative associations. Measured on a dataset of 36425 positives and 36425 negatives. Mean±stddev is shown of ten experiments with different cross validation splits. All numbers are percentages. For example the Recall = 0.509±0.026 @ Specificity = 0.999 metric, also known as Hits@36, corresponds to the following confusion matrix: TP = 18540±947, FN = 17885±947, FP = 36, TN = 36389.
Method | ROC AUC | Recall@90% Specificity | Recall@99% Specificity | Recall@99.5% Specificity | Recall@99.9% Specificity | Recall@99.95% Specificity | Recall@99.98% Specificity |
---|---|---|---|---|---|---|---|
RWR | 96.75 | 90.00 | 65.07 | 57.10 | 5.51 | 1.7694 | 1.20 |
RWR+MLP | 98.84 | 97.83 | 81.53 | 73.22 | 29.13 | 16.08 | 3.11 |
AA | 88.47 | 86.06 | 61.67 | 55.87 | 35.85 | 27.00 | 18.60 |
SAGE | 99.32 | 99.06 | 85.58 | 77.60 | 50.81 | 38.50 | 22.19 |
MLP | 98.45 | 95.71 | 83.61 | 76.18 | 49.38 | 38.60 | 25.60 |
Table 2.: Binary classification performance on the OGB DDI edge prediction benchmark. Measured on a dataset of 133489 positives and 95599 negatives. All numbers are percentages. The recall @ 99.98% metric is equivalent to the gold standard Hits@20 metric used to compare models on the official OGB leaderboard.
Install dependencies and set up the virtual environment with Poetry:
poetry env use 3.11
poetry install
cd gene_disease_pred
poetry shell
Creating the dataset
# Pleasse manually download required files in gene_disease_pred/input before running.
python create_dataset.py
Make predictions:
python main.py
Measure the performance of different models:
python run_tests.py
Measure performance on OGB data:
python ogb_test.py