Skip to content

Predicting gene-disease associations using heterogeneous graph neural networks

License

Notifications You must be signed in to change notification settings

GlavitsBalazs/GeneDiseaseGNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Gene-Disease Associations Using Graph Neural Networks

Understanding the mechanisms underlying diseases and syndromes can be greatly enhanced by knowledge of gene-disease associations (GDAs). Such knowledge plays a crucial role in developing new approaches for diagnosis, prevention, and treatment. However, our current understanding of GDAs remains incomplete, with the majority of true associations yet to be discovered. The process of experimentally verifying new hypothetical GDAs is resource and time-intensive, impeding progress in discovery. To address this challenge, it may be beneficial to predict GDAs that are likely to exist, thereby reducing the resources expended on unsuccessful experiments.

The immense volume of genomic measurement data, combined with the vast body of medical knowledge, has surpassed the capacity for individual comprehension. Consequently, machine learning-based methods have emerged as essential tools in biomedical research. Among these methods, graph neural networks have gained prominence. These models can learn from examples and predict the presence or absence of meaningful relationships between pairs of entities.

Results

ROC AUC Recall@90% Specificity Recall@99% Specificity Recall@99.5% Specificity Recall@99.9% Specificity Recall@99.95% Specificity Recall@99.99% Specificity
NCN SAGE 98.5±0.1 96.93±0.3 33.7±3.2 23.9±3.4 10.7±3.4 9.8±3.1 9.1±2.0
NCN Het MLP 98.908±0.07 98.02±0.2 32.2±3.3 24.2±3.0 13.5±1.9 13.1±1.4 12.53±1.
MF 95.694±0.09 91.3±0.1 41.4±2.2 30.4±3.3 13.7±2.4 11.6±2.8 7.3±3.8
RWR 98.516±0.08 98.19±0.1 83.66±0.9 71.5±1.2 28.8±5.4 11.3±2.3 1.9±1.3
AA 86.68±0.1 100.0±0.0 16.2±2.8 10.4±3.4 3.5±1.9 3.5±1.9 3.5±1.9
Het MLP nofeat 97.54±0.2 94.24±0.7 72.5±1.0 60.6±1.1 32.4±1.8 23.1±1.8 9.6±2.1
MLP 98.10±0.2 95.85±0.5 76.7±1.3 65.6±2.4 37.4±2.2 28.2±2.9 12.5±3.8
SAGE 98.644±0.06 97.47±0.2 81.36±0.8 70.5±1.3 37.7±4.5 22.9±6.7 3.7±5.4
Het SAGE 98.51±0.1 97.13±0.3 79.0±1.3 67.7±2.1 39.7±3.6 27.1±2.9 12.4±4.0
Het MLP DistMult 98.993±0.04 98.286±0.09 81.71±0.9 70.7±1.2 40.4±2.6 29.6±3.5 13.8±3.1
Het MLP 98.85±0.2 97.90±0.5 82.3±1.5 72.±1.8 42.6±3.8 32.6±4.4 17.5±5.1
HetMLP +RWR 98.736±0.08 98.33±0.2 89.24±0.5 81.1±1.3 50.9±2.6 36.1±5.6 10.±4.9

Table 1.: Binary classification performance of known positive gene-disease associations versus randomly sampled negative associations. Measured on a dataset of 36425 positives and 36425 negatives. Mean±stddev is shown of ten experiments with different cross validation splits. All numbers are percentages. For example the Recall = 0.509±0.026 @ Specificity = 0.999 metric, also known as Hits@36, corresponds to the following confusion matrix: TP = 18540±947, FN = 17885±947, FP = 36, TN = 36389.

Method ROC AUC Recall@90% Specificity Recall@99% Specificity Recall@99.5% Specificity Recall@99.9% Specificity Recall@99.95% Specificity Recall@99.98% Specificity
RWR 96.75 90.00 65.07 57.10 5.51 1.7694 1.20
RWR+MLP 98.84 97.83 81.53 73.22 29.13 16.08 3.11
AA 88.47 86.06 61.67 55.87 35.85 27.00 18.60
SAGE 99.32 99.06 85.58 77.60 50.81 38.50 22.19
MLP 98.45 95.71 83.61 76.18 49.38 38.60 25.60

Table 2.: Binary classification performance on the OGB DDI edge prediction benchmark. Measured on a dataset of 133489 positives and 95599 negatives. All numbers are percentages. The recall @ 99.98% metric is equivalent to the gold standard Hits@20 metric used to compare models on the official OGB leaderboard.

Usage

Install dependencies and set up the virtual environment with Poetry:

poetry env use 3.11
poetry install
cd gene_disease_pred
poetry shell

Creating the dataset

# Pleasse manually download required files in gene_disease_pred/input before running.
python create_dataset.py

Make predictions:

python main.py

Measure the performance of different models:

python run_tests.py

Measure performance on OGB data:

python ogb_test.py

About

Predicting gene-disease associations using heterogeneous graph neural networks

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages