Loading Patient Genomic Information (VCFs) into Knowledge Graphs

This is a project that was started during the October 2023 Data Management for Transformer Models Hackathon (hybrid).

The overall goal of this project is to find a way to integrate or structure patient genomics information stored in VCFs as a cohort into a knowledge graph.

Team Members:
- Chiao-Feng Lin (team lead)
- Rachit Kumar
- Soham Shirolkar
- Aniket Naik

Justification

We have reached the stage of being able to produce massive datasets and communicate insights rapidly. However, current clinical practice and research often struggles to keep up with this information, especially in the context of particular patients, in large part due to the ever-continuing evolution of knowledge and the quantity of that knowledge. To better equip researchers and clinicians, it is necessary to begin building frameworks that integrate our existing knowledge with patient-specific information at a cohort and individual level.

Disclaimer: This project was developed for a hackathon and is NOT at a production stage and should not be used to replace or augment clinical decisions in its current form.

Visual Workflow

Dataset

As described in the flowchart above, we used the COAD-CPTAC dataset as the proof-of-concept dataset. In theory, the pipeline should broadly be applicable to other similarly-structured datasets, though components of it may need to be changed (particularly which concept the patients are linked to and potentially some formatting around the loading of the data). The rest of the framework should be broadly portable, though, including the querying of the databases for unifying identifiers.

Graph Construction

The underlying relationships within the graph are to be constructed based on cohort-specific relationships (for example, whether patients have colon adenocarcinoma or not) as well as information acquired from existing clinical knowledgebases.

Concepts contained within the graph include:

Concept	Name	Number
Node Type	Sample	93
Node Type	Gene	3302
Node Type	Cancer Type	1
Edge Type	isCancerTypeOf (connects Sample -> Cancer Type)	93
Edge Type	hasHugoSymbol (connects Sample -> Gene)	15285

External Resources Used

BioPortal
Mondo Disease Ontology
- through BioOntology and
- through their direct release
ClinVar (note - this is used indirectly as the COAD-CPTAC dataset already provides these annotations)
HUGO Gene Nomenclature Committee Annotations to provide unique identifiers for genes

Graph Data Format (RDF)

To store the data, we used the RDF (Resource Description Framework) data format that represents graphs as nodes and triples connecting nodes. Details on this format can be found here, but in summary, the individual node types above are defined using specific ontologies or from the cohort and then the triples are pulled from information stored within the cohort (for example, which samples have mutations in what Hugo gene identifiers).

RDFLib offers a very natural and accessible framework for constructing and manipulating graphs using the RDF specifications in Python.

Installation Instructions

The notebook that implements the graph and creates the graph RDF file can be found in tcga_rdf.ipynb

To install the dependencies required, you can run the following command (ideally within a virtual Python environment like Conda or virtualenv) in your terminal:

python -m pip install rdflib requests pandas

(Note that the current code available was developed using Python 3.10 with rdflib 7.0.0 - if the latest version you are using breaks these, you can pin your Python version and the rdflib version rather than installing the latest one).

You will then need to provide your API key for BioOntology in the script tcga_rdf.ipynb. To get such an API key, make an account and login to BioPortal. You will find the API key in your account settings.

Note: this project was executed using a dataset from TCGA which already provided a converted MAF file, rather than a VCF. However, it is straightforward to convert VCF to MAF files, and you can do so using vcf2maf. We have not included this as part of the pipeline yet, but future plans include doing so.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
Presentation_Hackathon.pdf		Presentation_Hackathon.pdf
README.md		README.md
tcga_rdf.ipynb		tcga_rdf.ipynb
vcfs2kgs_flowchart.pdf		vcfs2kgs_flowchart.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loading Patient Genomic Information (VCFs) into Knowledge Graphs

Justification

Visual Workflow

Dataset

Graph Construction

External Resources Used

Graph Data Format (RDF)

Installation Instructions

About

Releases

Packages

Contributors 3

Languages

License

collaborativebioinformatics/vcfs2kgs

Folders and files

Latest commit

History

Repository files navigation

Loading Patient Genomic Information (VCFs) into Knowledge Graphs

Justification

Visual Workflow

Dataset

Graph Construction

External Resources Used

Graph Data Format (RDF)

Installation Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages