Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning.
We integrated the annotations from different sources:
- Gene ontology (GO)
- Mammalian Phenotype ontology (MP)
- Human Phenotype Ontology (HPO)
- Uber-anatomy ontology (UBERON)
- The code was developed and tested using Python 3.9.6
You need to use any version of Python > 3.9 and < 3.10
- We used (mOWL) library to process the input dataset as well as generate the embedding representation using different
embedding-based methods.
You need to have JAVA and JDK installed in your machine.
- Download all the files from data and place the uncompressed the file in the folder named
/data
. - Download the required database using CADD and follow the instructions to generate the TSV file with CADD scores for the input VCF file.
You can install the tool either from source or PyPi as follows:
python3 -m venv embedpvp_env
source ./embedpvp_env/bin/activate
git clone https://github.com/bio-ontology-research-group/EmbedPVP.git
cd EmbedPVP/
python setup.py install
mkdir output
embedpvp [args]
pip install embedpvp
mkdir output
embedpvp [args]
- Run the command
embedpvp --help
to display help and parameters:
Initializing the package
Usage: embedpvp [OPTIONS]
Options:
-d, --data-root TEXT Data root folder [required]
-i, --in_file TEXT Annotated Input file [required]
-p, --pathogenicity TEXT Path to the pathogenicity prediction file (CADD)
[required]
-hpo, --hpo TEXT List of phenotype codes separated by commas
[required]
-m, --model_type TEXT Ontology model, one of the following (go , mp ,
hp, uberon, union)
-e, --embedding TEXT Preferred embedding model (e.g. TransD, TransE,
TranR, ConvE ,DistMult, DL2vec, OWL2vc)
[required]
-dir, --outdir TEXT Path to the output directory
-o, --outfile TEXT Path to the results output file
--help Show this message and exit.
- Example:
embedpvp -d data/ -i example_annotation.vcf.hg38_multianno.txt -p example_cadd.tsv.gz -hpo HP:0004791,HP:0002020,HP:0100580,HP:0001428,HP:0011459 -m hp -e TransE -dir output/ -o example_output1.tsv
The script will output a ranking a score for the candidate caustive list of variants.
For further details or if you used EmbedPVP in your work, please refer to this article:
@article{althagafi2023prioritizing,
title={Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning},
author={Althagafi, Azza and Zhapa-Camacho, Fernando and Hoehndorf, Robert},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
For any questions or comments please contact azza.althagafi@kaust.edu.sa