Predicting Gene-Disease Associations Using Graph Neural Networks

Understanding the mechanisms underlying diseases and syndromes can be greatly enhanced by knowledge of gene-disease associations (GDAs). Such knowledge plays a crucial role in developing new approaches for diagnosis, prevention, and treatment. However, our current understanding of GDAs remains incomplete, with the majority of true associations yet to be discovered. The process of experimentally verifying new hypothetical GDAs is resource and time-intensive, impeding progress in discovery. To address this challenge, it may be beneficial to predict GDAs that are likely to exist, thereby reducing the resources expended on unsuccessful experiments.

The immense volume of genomic measurement data, combined with the vast body of medical knowledge, has surpassed the capacity for individual comprehension. Consequently, machine learning-based methods have emerged as essential tools in biomedical research. Among these methods, graph neural networks have gained prominence. These models can learn from examples and predict the presence or absence of meaningful relationships between pairs of entities.

Results

	ROC AUC	Recall@90% Specificity	Recall@99% Specificity	Recall@99.5% Specificity	Recall@99.9% Specificity	Recall@99.95% Specificity	Recall@99.99% Specificity
NCN SAGE	98.5±0.1	96.93±0.3	33.7±3.2	23.9±3.4	10.7±3.4	9.8±3.1	9.1±2.0
NCN Het MLP	98.908±0.07	98.02±0.2	32.2±3.3	24.2±3.0	13.5±1.9	13.1±1.4	12.53±1.
MF	95.694±0.09	91.3±0.1	41.4±2.2	30.4±3.3	13.7±2.4	11.6±2.8	7.3±3.8
RWR	98.516±0.08	98.19±0.1	83.66±0.9	71.5±1.2	28.8±5.4	11.3±2.3	1.9±1.3
AA	86.68±0.1	100.0±0.0	16.2±2.8	10.4±3.4	3.5±1.9	3.5±1.9	3.5±1.9
Het MLP nofeat	97.54±0.2	94.24±0.7	72.5±1.0	60.6±1.1	32.4±1.8	23.1±1.8	9.6±2.1
MLP	98.10±0.2	95.85±0.5	76.7±1.3	65.6±2.4	37.4±2.2	28.2±2.9	12.5±3.8
SAGE	98.644±0.06	97.47±0.2	81.36±0.8	70.5±1.3	37.7±4.5	22.9±6.7	3.7±5.4
Het SAGE	98.51±0.1	97.13±0.3	79.0±1.3	67.7±2.1	39.7±3.6	27.1±2.9	12.4±4.0
Het MLP DistMult	98.993±0.04	98.286±0.09	81.71±0.9	70.7±1.2	40.4±2.6	29.6±3.5	13.8±3.1
Het MLP	98.85±0.2	97.90±0.5	82.3±1.5	72.±1.8	42.6±3.8	32.6±4.4	17.5±5.1
HetMLP +RWR	98.736±0.08	98.33±0.2	89.24±0.5	81.1±1.3	50.9±2.6	36.1±5.6	10.±4.9

Table 1.: Binary classification performance of known positive gene-disease associations versus randomly sampled negative associations. Measured on a dataset of 36425 positives and 36425 negatives. Mean±stddev is shown of ten experiments with different cross validation splits. All numbers are percentages. For example the Recall = 0.509±0.026 @ Specificity = 0.999 metric, also known as Hits@36, corresponds to the following confusion matrix: TP = 18540±947, FN = 17885±947, FP = 36, TN = 36389.

Method	ROC AUC	Recall@90% Specificity	Recall@99% Specificity	Recall@99.5% Specificity	Recall@99.9% Specificity	Recall@99.95% Specificity	Recall@99.98% Specificity
RWR	96.75	90.00	65.07	57.10	5.51	1.7694	1.20
RWR+MLP	98.84	97.83	81.53	73.22	29.13	16.08	3.11
AA	88.47	86.06	61.67	55.87	35.85	27.00	18.60
SAGE	99.32	99.06	85.58	77.60	50.81	38.50	22.19
MLP	98.45	95.71	83.61	76.18	49.38	38.60	25.60

Table 2.: Binary classification performance on the OGB DDI edge prediction benchmark. Measured on a dataset of 133489 positives and 95599 negatives. All numbers are percentages. The recall @ 99.98% metric is equivalent to the gold standard Hits@20 metric used to compare models on the official OGB leaderboard.

Usage

Install dependencies and set up the virtual environment with Poetry:

poetry env use 3.11
poetry install
cd gene_disease_pred
poetry shell

Creating the dataset

# Pleasse manually download required files in gene_disease_pred/input before running.
python create_dataset.py

Make predictions:

python main.py

Measure the performance of different models:

python run_tests.py

Measure performance on OGB data:

python ogb_test.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
gene_disease_pred		gene_disease_pred
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Gene-Disease Associations Using Graph Neural Networks

Results

Usage

About

Releases 2

Packages

Languages

License

GlavitsBalazs/GeneDiseaseGNN

Folders and files

Latest commit

History

Repository files navigation

Predicting Gene-Disease Associations Using Graph Neural Networks

Results

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages