From 7a0bd95acd6288c8ef08756e5f950955c1675881 Mon Sep 17 00:00:00 2001 From: Manuel Holtgrewe Date: Wed, 28 Feb 2024 10:54:24 +0100 Subject: [PATCH] docs: initial dataflow description --- src/doc_architecture.rst | 101 ++++++++++++++-------- src/doc_dataflows.rst | 179 +++++++++++++++++++++++++++++++++++++++ src/index.rst | 1 + 3 files changed, 248 insertions(+), 33 deletions(-) create mode 100644 src/doc_dataflows.rst diff --git a/src/doc_architecture.rst b/src/doc_architecture.rst index 1038c81..c604867 100644 --- a/src/doc_architecture.rst +++ b/src/doc_architecture.rst @@ -22,50 +22,68 @@ Also, we use the Traefik reverse proxy for routing requests to the correct servi %% end %% key:::neutral - subgraph Services [Misc Services] - direction TB - - annonars[Annonars]:::owned - mehari[Mehari]:::owned - viguno[Viguno]:::owned - nginx[NGINX]:::owned - fs[(File System)]:::neutral - cada-prio[CADA-Prio]:::owned - cadd-rest-api[CADD REST]:::thirdParty - exomiser[Exomiser]:::thirdParty - redis[Redis]:::thirdParty - - annonars --> fs - mehari --> fs - viguno --> fs - nginx --> fs - cada-prio --> fs - cadd-rest-api --> fs - exomiser --> fs + subgraph X [.] + subgraph BackingServices [Backing Services] + direction LR + + annonars[Annonars]:::owned + mehari[Mehari]:::owned + viguno[Viguno]:::owned + nginx[NGINX]:::owned + fs[(File System)]:::neutral + cadaPrio[CADA-Prio]:::owned + caddRestApi[CADD REST]:::thirdParty + exomiser[Exomiser]:::thirdParty + redis[Redis]:::thirdParty + + annonars --> fs + mehari --> fs + viguno --> fs + nginx --> fs + cadaPrio --> fs + caddRestApi --> fs + exomiser --> fs + end + + subgraph RemoteServices [Remote Services] + direction LR + + pubtator[PubTator 3]:::thirdParty + variantValidator[VarianvtValidator]:::thirdParty + ga4ghBeacon[GA4GH\nBeacon Network]:::thirdParty + end + RemoteServices:::thirdParty end + X:::transparent subgraph Core [Server Core] direction LR - varfish-server[VarFish Server]:::owned - varfish-celeryd[VarFish Celery]:::owned + varfishServer[VarFish Server]:::owned + varfishCeleryd[VarFish Celery]:::owned postgres[(Postgres)]:::thirdParty - varfish-server --> varfish-celeryd - varfish-server --> postgres - varfish-celeryd --> postgres + varfishServer --> varfishCeleryd + varfishServer --> postgres + varfishCeleryd --> postgres end user([end user]) - Core -- "use via HTTP APIs" --> Services - user --> varfish-server + Core -- "use via HTTP APIs" --> BackingServices + Core -- "use via HTTP APIs" --> RemoteServices + user --> varfishServer classDef neutral fill:white classDef owned fill:#c5effc classDef thirdParty fill:#e9c5fc + classDef transparent fill:white,stroke:white,color:white -user - The user uses their web browser to connect to the Varfish Server and interacts with the system. +end user + The end user (data analyst) uses their web browser to connect to the Varfish Server and interacts with the system. + +operator user + The user operating a VarFish instance interfaces with the system also via the web user interface. + Certain actions must be performed via REST APIs provided by the VarFish Server, in particular importing data for later analysis. ----------- Core System @@ -94,11 +112,11 @@ Postgres Work is underway to move this to an internal object storage and run queries on this storage. This will allow for more optimized queries and scaling as the Postgres system will not be the single bottleneck anymore. ------------------------ -Misceallaneous Services ------------------------ +---------------- +Backing Services +---------------- -There is a list of services that run in the background that the user does not interact with directly. +There is a list of services that run in the background within the VarFish instance that the user does not interact with directly. They provide HTTP-based URLs to the core system then are stateless. There is no interaction betwen these services. @@ -155,3 +173,20 @@ Redis .. note:: With recent versions of the HPO, information content is not very useful for variant prioritization. + +--------------- +Remote Services +--------------- + +VarFish also provides access to certain remote services run by third parties. +This reduces the complexity of local hosting and keeping data up to date and even is necessary for some kinds of services. +On the other hand, it makes the instance rely on the availability of these remote services. + +PubTator 3 + VarFish uses the PubTator 3 API for providing relevant literature information for genes. + +VariantValidator + The VariantValidator.org service is used for providing gold standard HGVS descriptions for seqvars. + +GA4GH Beacon Network + The GA4GH Beacon Network embeddable IFRAME is used for alllowing to query the GA4GH Beacon Network for variant information. diff --git a/src/doc_dataflows.rst b/src/doc_dataflows.rst new file mode 100644 index 0000000..63e5f46 --- /dev/null +++ b/src/doc_dataflows.rst @@ -0,0 +1,179 @@ +.. _doc_dataflows: + +========= +Dataflows +========= + +This section describes the dataflows in the VarFish system. +We split the description into the following parts. + +1. :ref:`doc_dataflows_downloader` describes the dataflow for preparing the bulk background used by the *Backing Services* from :ref:`doc_architecture`. +2. :ref:`doc_dataflows_annotation` describes the annotation process that prepares variant VCF files for import into VarFish and the import itself. +3. :ref:`doc_dataflows_query` describes how the VarFish Server handles queries. +4. :ref:`doc_dataflows_periodic` describes the dataflows by the periodic background tasks. +5. :ref:`doc_dataflows_interaction` describes the remaining dataflows done by the user annotation. + +.. _doc_dataflows_downloader: + +--------------------- +Bulk Data Preparation +--------------------- + +There are three parts to the bulk data preparation, depicted below. + +.. mermaid:: + + %%{init: {"flowchart": {"htmlLabels": false}} }%% + flowchart LR + + publicData[Public\nData Sources] + varfishDbDownloader[Varfish DB\nDownloader] + s3Server[S3 Server] + deployedServer[Deployed\nInstance] + ncbiClinvar[NCBI ClinVar] + clinvarDataJsonl[clinvar-data-jsonl] + clinvarDataJsonlGithub[GitHub\nReleases] + ensembl[ENSEMBL] + refseq[RefSeq] + cdot[CDOT] + mehariDataTx[mehari-data-tx] + mehariDataTxGithub[GitHub\nReleases] + + publicData --> varfishDbDownloader + varfishDbDownloader --> s3Server + s3Server --> deployedServer + + ncbiClinvar --> clinvarDataJsonl + clinvarDataJsonl --> clinvarDataJsonlGithub + clinvarDataJsonlGithub --> deployedServer + + ensembl --> cdot + refseq --> cdot + cdot --> mehariDataTx + mehariDataTx --> mehariDataTxGithub + mehariDataTxGithub --> deployedServer + +First, we use a Snakemake worfklow (called ``varfish-db-downloader``) that downloads the necessary public domain data from the internet for most of the data. +The data is then processed with the workflow and the bulk data files are created that can be used by the Backing Services. + +The workflow is executed manually by the VarFish team. +The results are uploaded to our public S3 servers. +On deployment, the files are downloaded by downloader/installer scripts that the team provides. + +The workflow features a continuous integration test mode where file excerpts are used for smoke testing the functionality of the workflow. +Further, the continuous integration checks availability of the upstream files. +Using a Snakemake workflow together with using a conda environment for dependencies allows for reproducible data preparation. + +ClinVar data is prepared differently. +Here, we have a software ``clinvar-this`` that is capable of converting ClinVar XML files and convert them into JSON lines (JSONL) format. +These JSONL files can then be processed by the software packages also used in the Backing services. +The GitHub repository ``clinvar-data-jsonl`` hosts continuous integration that downloads the weekly ClinVar releases, uses ``clinvar-this`` to transform the XML files to JSONL, and finally publish them as GitHub software releases. +A third GitHub repository ``annonars-data-clinvar`` uses the output of ``clinvar-data-jsonl`` to prepare the per-gene aggregations and per-variant ClinVar files to be used by the Annonars Backing Service. +These files are installed on deployment and can later be updated. + +Transcript data is also prepared differently. +We use the output of the third-party CDOT project that provides RefSeq and ENSEMBL transcripts. +The CI in the GitHub project ``mehari-data-tx`` downloads the transcripts from the CDOT releases and fetches the corresponding sequences form the NCBI and ENSEMBL servers. +It then prepares the transcript data files for the genome releases with the Mehari software. +The resulting files are then also published as GitHub software releases. +As for the ClinVar files, these files are installed on deployment and can later be updated. + +.. _doc_dataflows_annotation: + +--------------------------- +Annotation Process / Import +--------------------------- + +Variant callers create variant call format (VCF) files that first must be annotated into tab separated value (TSV) files before import into VarFish. +For this, we use the Mehari software. +Mehari uses population frequency and transcript data files generated by the :ref:`doc_dataflows_downloader` step that must be downloaded once. + + +.. mermaid:: + + %%{init: {"flowchart": {"htmlLabels": false}} }%% + flowchart LR + + freqTx[Frequency /\nTranscript Data] + vcf[Seqvar/Strucvar\nVCF Files] + mehariAnnotate[Mehari Annotate] + tsv[Annotated TSV File] + varfishCli[VarFish CLI] + varfishServer[VarFish Server] + postgres[(Postgres)] + importJob[ImportJob] + + freqTx --> mehariAnnotate + vcf --> mehariAnnotate + mehariAnnotate --> tsv + tsv --> varfishCli + varfishCli --> varfishServer + varfishServer -- "(1) store data" --> postgres + varfishServer -- "(2) create job" --> importJob + postgres -- "(3) load data" --> importJob + importJob -- "(4) write final" --> postgres + + +The VarFish operator user then uses Mehari to annotate and aggregate each the sequence and the structural variant VCF files into on TSV file per variant type (seqvar/strucvar). +These files are then uploaded via the VarFish Command Line Interface (CLI). + +The VarFish Server stores the uploaded data in the Postgres database and creates a background job for importing the data. +When the import job is run, it will perform certain data processing such as computing quality control metrics and performing fingerprinting of the variant data to allow checking for family relationships. +The resulting data is then stored in the final location in the Postgres database where it is available to the user. + +.. _doc_dataflows_query: + +---------------- +Query Processing +---------------- + +Query processing is straightforward and the same for seqvar and strucvar queries. + +.. mermaid:: + + %%{init: {"flowchart": {"htmlLabels": false}} }%% + flowchart LR + + frontend[Frontend] + varfishServer[VarFish Server] + queryJob[Query Job] + postgres[(Postgres)] + + frontend -- "(1.1) launch query" --> varfishServer + frontend -- "(1.2) poll for query state" --> varfishServer + varfishServer -- "(3) fetch results" --> frontend + varfishServer -- "create job" --> queryJob + queryJob -- "(2) execute query" --> postgres + postgres -- "(3) query results" --> queryJob + queryJob -- "(4) store result table" --> postgres + varfishServer -- "(1.2) check state" --> postgres + postgres -- "(3) fetch results" --> varfishServer + +The user ceates a new query in the frontend provided by VarFish Server. +The server creates a query background job with the query specificaiton for execution in the background. + +When the job is executed, it loads the query, generates a Postgres SQL query and executes it. +The resulting rows are inserted into the query results table for use by the user. + +The frontend polls the server for the state of the query. +When the query is complete, the data is loaded into the frontend for interaction by the user. + +.. _doc_dataflows_periodic: + +------------------------- +Periodic Background Tasks +------------------------- + +There is a number of background tasks that work on the database. +The most important maintenance task rebuilds the in-house background database. +This is currently done by re-creating a materialized view in the Postgres database. + +.. _doc_dataflows_interaction: + +---------------- +User Interaction +---------------- + +Besides query processing, the user can interact in various ways. +This interactive works leads to transactional/atomic updates in the database, e.g., by editing properties of a case or annotating case members with HPO terms. +This is done with operations that appear blocking to the client and not in background tasks. diff --git a/src/index.rst b/src/index.rst index 9d7aa44..7152f78 100644 --- a/src/index.rst +++ b/src/index.rst @@ -8,3 +8,4 @@ VarFish Development Docs :caption: Documents doc_architecture + doc_dataflows