-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: initial dataflow description (#3)
- Loading branch information
Showing
3 changed files
with
248 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
.. _doc_dataflows: | ||
|
||
========= | ||
Dataflows | ||
========= | ||
|
||
This section describes the dataflows in the VarFish system. | ||
We split the description into the following parts. | ||
|
||
1. :ref:`doc_dataflows_downloader` describes the dataflow for preparing the bulk background used by the *Backing Services* from :ref:`doc_architecture`. | ||
2. :ref:`doc_dataflows_annotation` describes the annotation process that prepares variant VCF files for import into VarFish and the import itself. | ||
3. :ref:`doc_dataflows_query` describes how the VarFish Server handles queries. | ||
4. :ref:`doc_dataflows_periodic` describes the dataflows by the periodic background tasks. | ||
5. :ref:`doc_dataflows_interaction` describes the remaining dataflows done by the user annotation. | ||
|
||
.. _doc_dataflows_downloader: | ||
|
||
--------------------- | ||
Bulk Data Preparation | ||
--------------------- | ||
|
||
There are three parts to the bulk data preparation, depicted below. | ||
|
||
.. mermaid:: | ||
|
||
%%{init: {"flowchart": {"htmlLabels": false}} }%% | ||
flowchart LR | ||
|
||
publicData[Public\nData Sources] | ||
varfishDbDownloader[Varfish DB\nDownloader] | ||
s3Server[S3 Server] | ||
deployedServer[Deployed\nInstance] | ||
ncbiClinvar[NCBI ClinVar] | ||
clinvarDataJsonl[clinvar-data-jsonl] | ||
clinvarDataJsonlGithub[GitHub\nReleases] | ||
ensembl[ENSEMBL] | ||
refseq[RefSeq] | ||
cdot[CDOT] | ||
mehariDataTx[mehari-data-tx] | ||
mehariDataTxGithub[GitHub\nReleases] | ||
|
||
publicData --> varfishDbDownloader | ||
varfishDbDownloader --> s3Server | ||
s3Server --> deployedServer | ||
|
||
ncbiClinvar --> clinvarDataJsonl | ||
clinvarDataJsonl --> clinvarDataJsonlGithub | ||
clinvarDataJsonlGithub --> deployedServer | ||
|
||
ensembl --> cdot | ||
refseq --> cdot | ||
cdot --> mehariDataTx | ||
mehariDataTx --> mehariDataTxGithub | ||
mehariDataTxGithub --> deployedServer | ||
|
||
First, we use a Snakemake worfklow (called ``varfish-db-downloader``) that downloads the necessary public domain data from the internet for most of the data. | ||
The data is then processed with the workflow and the bulk data files are created that can be used by the Backing Services. | ||
|
||
The workflow is executed manually by the VarFish team. | ||
The results are uploaded to our public S3 servers. | ||
On deployment, the files are downloaded by downloader/installer scripts that the team provides. | ||
|
||
The workflow features a continuous integration test mode where file excerpts are used for smoke testing the functionality of the workflow. | ||
Further, the continuous integration checks availability of the upstream files. | ||
Using a Snakemake workflow together with using a conda environment for dependencies allows for reproducible data preparation. | ||
|
||
ClinVar data is prepared differently. | ||
Here, we have a software ``clinvar-this`` that is capable of converting ClinVar XML files and convert them into JSON lines (JSONL) format. | ||
These JSONL files can then be processed by the software packages also used in the Backing services. | ||
The GitHub repository ``clinvar-data-jsonl`` hosts continuous integration that downloads the weekly ClinVar releases, uses ``clinvar-this`` to transform the XML files to JSONL, and finally publish them as GitHub software releases. | ||
A third GitHub repository ``annonars-data-clinvar`` uses the output of ``clinvar-data-jsonl`` to prepare the per-gene aggregations and per-variant ClinVar files to be used by the Annonars Backing Service. | ||
These files are installed on deployment and can later be updated. | ||
|
||
Transcript data is also prepared differently. | ||
We use the output of the third-party CDOT project that provides RefSeq and ENSEMBL transcripts. | ||
The CI in the GitHub project ``mehari-data-tx`` downloads the transcripts from the CDOT releases and fetches the corresponding sequences form the NCBI and ENSEMBL servers. | ||
It then prepares the transcript data files for the genome releases with the Mehari software. | ||
The resulting files are then also published as GitHub software releases. | ||
As for the ClinVar files, these files are installed on deployment and can later be updated. | ||
|
||
.. _doc_dataflows_annotation: | ||
|
||
--------------------------- | ||
Annotation Process / Import | ||
--------------------------- | ||
|
||
Variant callers create variant call format (VCF) files that first must be annotated into tab separated value (TSV) files before import into VarFish. | ||
For this, we use the Mehari software. | ||
Mehari uses population frequency and transcript data files generated by the :ref:`doc_dataflows_downloader` step that must be downloaded once. | ||
|
||
|
||
.. mermaid:: | ||
|
||
%%{init: {"flowchart": {"htmlLabels": false}} }%% | ||
flowchart LR | ||
|
||
freqTx[Frequency /\nTranscript Data] | ||
vcf[Seqvar/Strucvar\nVCF Files] | ||
mehariAnnotate[Mehari Annotate] | ||
tsv[Annotated TSV File] | ||
varfishCli[VarFish CLI] | ||
varfishServer[VarFish Server] | ||
postgres[(Postgres)] | ||
importJob[ImportJob] | ||
|
||
freqTx --> mehariAnnotate | ||
vcf --> mehariAnnotate | ||
mehariAnnotate --> tsv | ||
tsv --> varfishCli | ||
varfishCli --> varfishServer | ||
varfishServer -- "(1) store data" --> postgres | ||
varfishServer -- "(2) create job" --> importJob | ||
postgres -- "(3) load data" --> importJob | ||
importJob -- "(4) write final" --> postgres | ||
|
||
|
||
The VarFish operator user then uses Mehari to annotate and aggregate each the sequence and the structural variant VCF files into on TSV file per variant type (seqvar/strucvar). | ||
These files are then uploaded via the VarFish Command Line Interface (CLI). | ||
|
||
The VarFish Server stores the uploaded data in the Postgres database and creates a background job for importing the data. | ||
When the import job is run, it will perform certain data processing such as computing quality control metrics and performing fingerprinting of the variant data to allow checking for family relationships. | ||
The resulting data is then stored in the final location in the Postgres database where it is available to the user. | ||
|
||
.. _doc_dataflows_query: | ||
|
||
---------------- | ||
Query Processing | ||
---------------- | ||
|
||
Query processing is straightforward and the same for seqvar and strucvar queries. | ||
|
||
.. mermaid:: | ||
|
||
%%{init: {"flowchart": {"htmlLabels": false}} }%% | ||
flowchart LR | ||
|
||
frontend[Frontend] | ||
varfishServer[VarFish Server] | ||
queryJob[Query Job] | ||
postgres[(Postgres)] | ||
|
||
frontend -- "(1.1) launch query" --> varfishServer | ||
frontend -- "(1.2) poll for query state" --> varfishServer | ||
varfishServer -- "(3) fetch results" --> frontend | ||
varfishServer -- "create job" --> queryJob | ||
queryJob -- "(2) execute query" --> postgres | ||
postgres -- "(3) query results" --> queryJob | ||
queryJob -- "(4) store result table" --> postgres | ||
varfishServer -- "(1.2) check state" --> postgres | ||
postgres -- "(3) fetch results" --> varfishServer | ||
|
||
The user ceates a new query in the frontend provided by VarFish Server. | ||
The server creates a query background job with the query specificaiton for execution in the background. | ||
|
||
When the job is executed, it loads the query, generates a Postgres SQL query and executes it. | ||
The resulting rows are inserted into the query results table for use by the user. | ||
|
||
The frontend polls the server for the state of the query. | ||
When the query is complete, the data is loaded into the frontend for interaction by the user. | ||
|
||
.. _doc_dataflows_periodic: | ||
|
||
------------------------- | ||
Periodic Background Tasks | ||
------------------------- | ||
|
||
There is a number of background tasks that work on the database. | ||
The most important maintenance task rebuilds the in-house background database. | ||
This is currently done by re-creating a materialized view in the Postgres database. | ||
|
||
.. _doc_dataflows_interaction: | ||
|
||
---------------- | ||
User Interaction | ||
---------------- | ||
|
||
Besides query processing, the user can interact in various ways. | ||
This interactive works leads to transactional/atomic updates in the database, e.g., by editing properties of a case or annotating case members with HPO terms. | ||
This is done with operations that appear blocking to the client and not in background tasks. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ VarFish Development Docs | |
:caption: Documents | ||
|
||
doc_architecture | ||
doc_dataflows |