Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: initial dataflow description #3

Merged
merged 1 commit into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 68 additions & 33 deletions src/doc_architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,50 +22,68 @@ Also, we use the Traefik reverse proxy for routing requests to the correct servi
%% end
%% key:::neutral

subgraph Services [Misc Services]
direction TB

annonars[Annonars]:::owned
mehari[Mehari]:::owned
viguno[Viguno]:::owned
nginx[NGINX]:::owned
fs[(File System)]:::neutral
cada-prio[CADA-Prio]:::owned
cadd-rest-api[CADD REST]:::thirdParty
exomiser[Exomiser]:::thirdParty
redis[Redis]:::thirdParty

annonars --> fs
mehari --> fs
viguno --> fs
nginx --> fs
cada-prio --> fs
cadd-rest-api --> fs
exomiser --> fs
subgraph X [.]
subgraph BackingServices [Backing Services]
direction LR

annonars[Annonars]:::owned
mehari[Mehari]:::owned
viguno[Viguno]:::owned
nginx[NGINX]:::owned
fs[(File System)]:::neutral
cadaPrio[CADA-Prio]:::owned
caddRestApi[CADD REST]:::thirdParty
exomiser[Exomiser]:::thirdParty
redis[Redis]:::thirdParty

annonars --> fs
mehari --> fs
viguno --> fs
nginx --> fs
cadaPrio --> fs
caddRestApi --> fs
exomiser --> fs
end

subgraph RemoteServices [Remote Services]
direction LR

pubtator[PubTator 3]:::thirdParty
variantValidator[VarianvtValidator]:::thirdParty
ga4ghBeacon[GA4GH\nBeacon Network]:::thirdParty
end
RemoteServices:::thirdParty
end
X:::transparent

subgraph Core [Server Core]
direction LR

varfish-server[VarFish Server]:::owned
varfish-celeryd[VarFish Celery]:::owned
varfishServer[VarFish Server]:::owned
varfishCeleryd[VarFish Celery]:::owned
postgres[(Postgres)]:::thirdParty
varfish-server --> varfish-celeryd
varfish-server --> postgres
varfish-celeryd --> postgres
varfishServer --> varfishCeleryd
varfishServer --> postgres
varfishCeleryd --> postgres
end

user([end user])

Core -- "use via HTTP APIs" --> Services
user --> varfish-server
Core -- "use via HTTP APIs" --> BackingServices
Core -- "use via HTTP APIs" --> RemoteServices
user --> varfishServer

classDef neutral fill:white
classDef owned fill:#c5effc
classDef thirdParty fill:#e9c5fc
classDef transparent fill:white,stroke:white,color:white

user
The user uses their web browser to connect to the Varfish Server and interacts with the system.
end user
The end user (data analyst) uses their web browser to connect to the Varfish Server and interacts with the system.

operator user
The user operating a VarFish instance interfaces with the system also via the web user interface.
Certain actions must be performed via REST APIs provided by the VarFish Server, in particular importing data for later analysis.

-----------
Core System
Expand Down Expand Up @@ -94,11 +112,11 @@ Postgres
Work is underway to move this to an internal object storage and run queries on this storage.
This will allow for more optimized queries and scaling as the Postgres system will not be the single bottleneck anymore.

-----------------------
Misceallaneous Services
-----------------------
----------------
Backing Services
----------------

There is a list of services that run in the background that the user does not interact with directly.
There is a list of services that run in the background within the VarFish instance that the user does not interact with directly.
They provide HTTP-based URLs to the core system then are stateless.
There is no interaction betwen these services.

Expand Down Expand Up @@ -155,3 +173,20 @@ Redis
.. note::

With recent versions of the HPO, information content is not very useful for variant prioritization.

---------------
Remote Services
---------------

VarFish also provides access to certain remote services run by third parties.
This reduces the complexity of local hosting and keeping data up to date and even is necessary for some kinds of services.
On the other hand, it makes the instance rely on the availability of these remote services.

PubTator 3
VarFish uses the PubTator 3 API for providing relevant literature information for genes.

VariantValidator
The VariantValidator.org service is used for providing gold standard HGVS descriptions for seqvars.

GA4GH Beacon Network
The GA4GH Beacon Network embeddable IFRAME is used for alllowing to query the GA4GH Beacon Network for variant information.
179 changes: 179 additions & 0 deletions src/doc_dataflows.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
.. _doc_dataflows:

=========
Dataflows
=========

This section describes the dataflows in the VarFish system.
We split the description into the following parts.

1. :ref:`doc_dataflows_downloader` describes the dataflow for preparing the bulk background used by the *Backing Services* from :ref:`doc_architecture`.
2. :ref:`doc_dataflows_annotation` describes the annotation process that prepares variant VCF files for import into VarFish and the import itself.
3. :ref:`doc_dataflows_query` describes how the VarFish Server handles queries.
4. :ref:`doc_dataflows_periodic` describes the dataflows by the periodic background tasks.
5. :ref:`doc_dataflows_interaction` describes the remaining dataflows done by the user annotation.

.. _doc_dataflows_downloader:

---------------------
Bulk Data Preparation
---------------------

There are three parts to the bulk data preparation, depicted below.

.. mermaid::

%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart LR

publicData[Public\nData Sources]
varfishDbDownloader[Varfish DB\nDownloader]
s3Server[S3 Server]
deployedServer[Deployed\nInstance]
ncbiClinvar[NCBI ClinVar]
clinvarDataJsonl[clinvar-data-jsonl]
clinvarDataJsonlGithub[GitHub\nReleases]
ensembl[ENSEMBL]
refseq[RefSeq]
cdot[CDOT]
mehariDataTx[mehari-data-tx]
mehariDataTxGithub[GitHub\nReleases]

publicData --> varfishDbDownloader
varfishDbDownloader --> s3Server
s3Server --> deployedServer

ncbiClinvar --> clinvarDataJsonl
clinvarDataJsonl --> clinvarDataJsonlGithub
clinvarDataJsonlGithub --> deployedServer

ensembl --> cdot
refseq --> cdot
cdot --> mehariDataTx
mehariDataTx --> mehariDataTxGithub
mehariDataTxGithub --> deployedServer

First, we use a Snakemake worfklow (called ``varfish-db-downloader``) that downloads the necessary public domain data from the internet for most of the data.
The data is then processed with the workflow and the bulk data files are created that can be used by the Backing Services.

The workflow is executed manually by the VarFish team.
The results are uploaded to our public S3 servers.
On deployment, the files are downloaded by downloader/installer scripts that the team provides.

The workflow features a continuous integration test mode where file excerpts are used for smoke testing the functionality of the workflow.
Further, the continuous integration checks availability of the upstream files.
Using a Snakemake workflow together with using a conda environment for dependencies allows for reproducible data preparation.

ClinVar data is prepared differently.
Here, we have a software ``clinvar-this`` that is capable of converting ClinVar XML files and convert them into JSON lines (JSONL) format.
These JSONL files can then be processed by the software packages also used in the Backing services.
The GitHub repository ``clinvar-data-jsonl`` hosts continuous integration that downloads the weekly ClinVar releases, uses ``clinvar-this`` to transform the XML files to JSONL, and finally publish them as GitHub software releases.
A third GitHub repository ``annonars-data-clinvar`` uses the output of ``clinvar-data-jsonl`` to prepare the per-gene aggregations and per-variant ClinVar files to be used by the Annonars Backing Service.
These files are installed on deployment and can later be updated.

Transcript data is also prepared differently.
We use the output of the third-party CDOT project that provides RefSeq and ENSEMBL transcripts.
The CI in the GitHub project ``mehari-data-tx`` downloads the transcripts from the CDOT releases and fetches the corresponding sequences form the NCBI and ENSEMBL servers.
It then prepares the transcript data files for the genome releases with the Mehari software.
The resulting files are then also published as GitHub software releases.
As for the ClinVar files, these files are installed on deployment and can later be updated.

.. _doc_dataflows_annotation:

---------------------------
Annotation Process / Import
---------------------------

Variant callers create variant call format (VCF) files that first must be annotated into tab separated value (TSV) files before import into VarFish.
For this, we use the Mehari software.
Mehari uses population frequency and transcript data files generated by the :ref:`doc_dataflows_downloader` step that must be downloaded once.


.. mermaid::

%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart LR

freqTx[Frequency /\nTranscript Data]
vcf[Seqvar/Strucvar\nVCF Files]
mehariAnnotate[Mehari Annotate]
tsv[Annotated TSV File]
varfishCli[VarFish CLI]
varfishServer[VarFish Server]
postgres[(Postgres)]
importJob[ImportJob]

freqTx --> mehariAnnotate
vcf --> mehariAnnotate
mehariAnnotate --> tsv
tsv --> varfishCli
varfishCli --> varfishServer
varfishServer -- "(1) store data" --> postgres
varfishServer -- "(2) create job" --> importJob
postgres -- "(3) load data" --> importJob
importJob -- "(4) write final" --> postgres


The VarFish operator user then uses Mehari to annotate and aggregate each the sequence and the structural variant VCF files into on TSV file per variant type (seqvar/strucvar).
These files are then uploaded via the VarFish Command Line Interface (CLI).

The VarFish Server stores the uploaded data in the Postgres database and creates a background job for importing the data.
When the import job is run, it will perform certain data processing such as computing quality control metrics and performing fingerprinting of the variant data to allow checking for family relationships.
The resulting data is then stored in the final location in the Postgres database where it is available to the user.

.. _doc_dataflows_query:

----------------
Query Processing
----------------

Query processing is straightforward and the same for seqvar and strucvar queries.

.. mermaid::

%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart LR

frontend[Frontend]
varfishServer[VarFish Server]
queryJob[Query Job]
postgres[(Postgres)]

frontend -- "(1.1) launch query" --> varfishServer
frontend -- "(1.2) poll for query state" --> varfishServer
varfishServer -- "(3) fetch results" --> frontend
varfishServer -- "create job" --> queryJob
queryJob -- "(2) execute query" --> postgres
postgres -- "(3) query results" --> queryJob
queryJob -- "(4) store result table" --> postgres
varfishServer -- "(1.2) check state" --> postgres
postgres -- "(3) fetch results" --> varfishServer

The user ceates a new query in the frontend provided by VarFish Server.
The server creates a query background job with the query specificaiton for execution in the background.

When the job is executed, it loads the query, generates a Postgres SQL query and executes it.
The resulting rows are inserted into the query results table for use by the user.

The frontend polls the server for the state of the query.
When the query is complete, the data is loaded into the frontend for interaction by the user.

.. _doc_dataflows_periodic:

-------------------------
Periodic Background Tasks
-------------------------

There is a number of background tasks that work on the database.
The most important maintenance task rebuilds the in-house background database.
This is currently done by re-creating a materialized view in the Postgres database.

.. _doc_dataflows_interaction:

----------------
User Interaction
----------------

Besides query processing, the user can interact in various ways.
This interactive works leads to transactional/atomic updates in the database, e.g., by editing properties of a case or annotating case members with HPO terms.
This is done with operations that appear blocking to the client and not in background tasks.
1 change: 1 addition & 0 deletions src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ VarFish Development Docs
:caption: Documents

doc_architecture
doc_dataflows
Loading