From 171efc20dbb6ba5c8f21414dcbbb72343ac94eb0 Mon Sep 17 00:00:00 2001
From: Manuel Holtgrewe <manuel.holtgrewe@bih-charite.de>
Date: Wed, 28 Feb 2024 14:47:55 +0100
Subject: [PATCH] docs: adding future plans (#4)

---
 src/doc_architecture.rst |   3 +
 src/doc_futureplans.rst  | 390 +++++++++++++++++++++++++++++++++++++++
 src/index.rst            |   1 +
 3 files changed, 394 insertions(+)
 create mode 100644 src/doc_futureplans.rst

diff --git a/src/doc_architecture.rst b/src/doc_architecture.rst
index c604867..ed7de69 100644
--- a/src/doc_architecture.rst
+++ b/src/doc_architecture.rst
@@ -190,3 +190,6 @@ VariantValidator
 
 GA4GH Beacon Network
     The GA4GH Beacon Network embeddable IFRAME is used for alllowing to query the GA4GH Beacon Network for variant information.
+
+Genomics England PanelApp
+    We use the GE PanelApp API for fetching up-to-date gene panel information.
diff --git a/src/doc_futureplans.rst b/src/doc_futureplans.rst
new file mode 100644
index 0000000..20d3727
--- /dev/null
+++ b/src/doc_futureplans.rst
@@ -0,0 +1,390 @@
+.. _doc_futureplans:
+
+============
+Future Plans
+============
+
+This section contains a description of upcoming high-level changes to the VarFish software.
+The sections below have been extrapolated from the current issue list.
+
+
+.. _doc_futureplans_technical_debt:
+
+--------------
+Technical Debt
+--------------
+
+There is some technical debt, some notable items.
+
+Automated Tests
+    We need more automated tests in various areas throughout the codebase.
+
+Python Type Annotations
+    We should modernize the codebase by comprehensive use of Python type annotations.
+    Notable, this will require type annotations for sodar-core.
+
+Backing Server Protobuf Migration
+    The backing services should expose JSON-serialized protobufs on their APIs.
+    This will allow for code generation of API clients.
+
+
+.. _doc_futureplans_grch38_migration:
+
+----------------
+GRCh38 Migration
+----------------
+
+We support instances with variants in GRCh37, GRCh38, or both genome build coordinates.
+However, there is no good upgrade path implemented yet.
+The plan here is to provide semi-automated ways to lift over user annotations from GRCh37 to GRCh38 and merely notify people about the cases where this fails.
+Variants will need reprocessing and reimport of the original data as we don't consider lift-over of variants to be reliable in the general case.
+
+
+.. _doc_futureplans_vuetify_migration:
+
+-----------------
+Vuetify Migration
+-----------------
+
+We need to finalize the migration to the Vuetify framework in the frontend.
+Further, we should have the VarFish-specific part of the site (outside of what is provided by sodar-core) become a true SPA without embedding into the HTML + Bootstrap CSS.
+
+
+.. _doc_futureplans_custom_query_presets:
+
+--------------------
+Custom Query Presets
+--------------------
+
+We already have some support for custom query presets.
+The seqvar query presets need fixing and extension.
+The strucvar query presets is missing large parts of implementation (all of the editing functionality).
+
+
+.. _doc_futureplans_case_management:
+
+---------------
+Case Management
+---------------
+
+We can attach states such as "closed as solved" to cases, note, and comments.
+However, assigning responsible persons to cases and implementing real "workflows" on the case level, e.g., with approval of supervising physicians, is missing.
+
+
+.. _doc_clinlvar_uploads:
+
+---------------
+ClinVar Uploads
+---------------
+
+ClinVar uploads are currently missing.
+We have the library to perform this already in place and a working implementation of UI can be found in REEV.
+This needs careful planning and integration with :ref:`doc_futureplans_case_management`.
+
+
+.. _doc_futureplans_useability_improvements:
+
+-----------------------
+Useability Improvements
+-----------------------
+
+The VarFish user interface is useable.
+However, there is the need for various improvements to improve the user experience.
+
+Faster Flagging
+    The aim here is to provide users with faster "time to flag as X" for variants.
+    This is particularly important for visual artifacts in IGV or variants that do not have a connected phenotype.
+    This includes both the decision making and access to setting flags.
+
+Better Details Access
+    The aim here is to provide users with the information that they need for assessing a variant faster.
+    This includes both gene and variant details.
+    Improvements can be done by selecting which information where in a smarter way.
+    Further improvements can be of technical nature (fewer clicks, faster load times).
+
+Smart Information Access
+    In all relevant places such as query result view, variant details, gene details.
+    Again, the aim is to provide users with the information that they need for assessing a variant faster.
+
+Blinded Case Analysis
+    The aim here is to provide a blinded four-eyes principle for case analysis.
+
+Other Carriers
+    The aim here is to provide a quick way to see other cases of the same/similar variants (same genomic position, same amino acid, same gene).
+    Also, a visualization of the variation landscape in the gene in the database vs. ClinVar vs. gnomAD would be useful.
+
+Second Hit
+    In the case of recessive disorders, it should be faster to find a second hit.
+    The idea is that based on a suspicious pathogenic variant, the second hit can be easily found.
+    This could be a strucvar overlapping the same gene or another sevar in the same gene that is harder to interpret.
+    E.g., a splicing, deep intronic, and/or UTR variant.
+
+Phased Variants
+    Haplotype-based variant callers provide phasing information, at least if one read can cover two variants.
+    We currently don't expose this information.
+
+
+.. _doc_futureplans_integrated_variants:
+
+---------------------------
+Integrated Variant Analysis
+---------------------------
+
+There currently is a strong separation between seqvar and strucvar analysis.
+We should implement several strategies for an integrated analysis, taking the case phenotype into consideration.
+
+
+.. _doc_futureplans_report_generation:
+
+-----------------
+Report Generation
+-----------------
+
+There are multiple aspects to report generation.
+This could consist of providing detailed pages for variants with a selected criteria (e.g., all flagged/considered variants).
+Alternatively, this could consist of automatically filling letter templates with variant information.
+
+
+.. _doc_futureplans_cohort_filter:
+
+----------------
+Cohort Filtering
+----------------
+
+A much requested feature is performing queries on a cohort level.
+We already had a version working earlier that had problems with performance.
+This needs to be re-tackled after the migration to the next-gen dataflows is complete.
+
+
+.. _doc_futureplans_ng_dataflows:
+
+------------------
+Next-Gen Dataflows
+------------------
+
+The classic location for variants in VarFish is the postgres database.
+This works quite well on fast baremetal NVME disk arrays but makes the database the single bottleneck.
+It is thus desirable to reengineer this part and work is already underway on this.
+
+We will rather work with data in object storage via the S3 protocol.
+By default, Varfish instances will come with an embedded MinIO server for this purpose but external servers can also be used.
+Users upload their case files to a location VarFish can access (e.g., S3, HTTPS, local file system) and VarFish is told the location and possibly the necessary credentials.
+For import, users only upload a Phenopackets YAML file with the case manifest.
+VarFish then imports the case in a background job.
+Only the essential files such as variant data (VCF) and QC files are actually read.
+Other files such as BAM files, coverage ``.wig`` files, etc. are registered in the database (this allows proxying to them and redisplaying as also mentioned in :ref:`doc_futureplans_genome_browsers`).
+
+VarFish then runs an ingest step that processes the raw caller VCF files and potentially merges VCF files from the same caller.
+The resulting ingested VCF files are then stored in the internal object storage.
+Further preprocessing can take place, e.g., prefiltering to certain variants such as near-exonic ones.
+QC data is imported into the database and potentially additional QC is computed.
+Filtration is also done directly on the VCF files from the internal S3 object storage.
+
+The data import is partially done in the server.
+We already have fast Rust-based executables for the variant ingest and query execution.
+There are unit tests for these components but no integration or system tests yet.
+Further, the integration in the server/frontend has not been started yet.
+
+The best way forward is to keep this "next-gen dataflow" in addition to the classic version.
+Cases imported in the new way get a tag "version=2" and the new (and yet to be implemented in some parts) code paths will be used for them while the legacy code paths will remain.
+
+
+
+.. _doc_futureplans_acmg_criteria:
+
+----------------
+ACMG Criteria UI
+----------------
+
+We currently have a working version of Richars et al. 2015 implemented.
+We need to bring this to the latest ACMG version, ideally both score- and rule-based with certain rule sets (e.g., ACGS, AMP, etc.).
+Further, we are completely lacking this for strucvars.
+For the latter, this strongly depends on :ref:`doc_futureplans_acmg_automation` as the rules are highly complex here.
+
+
+.. _doc_futureplans_acmg_automation:
+
+---------------
+ACMG Automation
+---------------
+
+We need to implement ACMG implementation.
+We have a working implementation (not widely tested) for strucvars that is only missing PVS1 automation.
+Seqvars is completely missing.
+
+
+.. _doc_futureplans_clingen_vcep:
+
+------------
+ClinGen VCEP
+------------
+
+There is a number of genes for which experts have developed complex rule sets.
+It would be very useful to have a "rule engine" (could just be some per-gene Python code maintained and deployed with VarFish server) that supports users in these well-known genes with complex rules.
+
+
+.. _doc_futureplans_additional_variant_types:
+
+------------------------
+Additional Variant Types
+------------------------
+
+We currently only support seqvars and strucvars.
+The following variant types are commonly called from NGS (short and long-read) data.
+
+Repeat Expansion
+    E.g., with ExpansionHunger from short-read data or directly from long-read data.
+
+ROH (Run of Homozygosity) / LOH (Loss of Heterozygosity)
+    Useful for computing scores such as autozygosity which provides insights into relationships and is useful for quality control.
+    ROH data is also often used for the identification of candidate regions.
+    It will be easy to implement a graphical tool for homozygosity mapping.
+
+SMA (Spinal Muscular Atrophy) Calling
+    There are specialized callers to call SMA mutations from NGS data which is challenging and included in DRAGEN output.
+    However, it is questionable how useful this is in a clinical setting as there are cheaper standard tests.
+
+CYP2D6 Caller
+    Similar to SMA calling, there are callers and one is included in DRAGEN output.
+    However, questionable how important this is.
+
+HLA Calling
+    HLA calling can be important in certain aspects and by now there are decent callers for NGS available.
+    Again, it is questionable how much demand there is for it.
+
+Methylation Calling
+    ONT sequencing provides methylation information.
+    Such information could also come from a matched methylation array.
+
+
+.. _doc_futureplans_long_reads:
+
+----------
+Long Reads
+----------
+
+We currently have "long read support" already as we can import variants from such data.
+However, we will need to adjust rule sets and extend the builtin presets.
+As outlined in :ref:`doc_futureplans_additional_variant_types`, it also gives support to methylation information.
+
+
+.. _doc_futureplans_rnaseq:
+
+-------
+RNA-Seq
+-------
+
+The integration of DNA variant data and RNA-seq expression data can be useful.
+However, there are not many proven cases for *ab initio* RNA-seq for gene prioritization.
+Maybe this is primarily useful for integrated analysis where RNA-seq is used for follow-up.
+
+
+.. _doc_futureplans_genome_browsers:
+
+---------------
+Genome Browsers
+---------------
+
+After implementing :ref:`doc_futureplans_ng_dataflows`, we also have information about the BAM files in external locations linked to from VarFish.
+We can then proxy HTTP requests to them via VarFish and generate IGV sessions or display them in integrated genome browsers such as IGV.js or alternatives.
+
+
+.. _doc_futureplans_local_pubtator:
+
+--------------
+Local PubTator
+--------------
+
+PubTator is very useful for semantic search of literature connected to a gene.
+The public API has a rate limit.
+It is open source and all data is available in monthly dumps.
+It might make sense to create a local mirror but this would increase the gap between publication and availability in VarFish to up to a month.
+An alternative would be to roll our own engine based on a full text search engine such as QuickWit and open source named entity recognition libraries and ingest the sub-daily releases of PubMed abstracts.
+
+
+.. _doc_futureplans_facial_gestalt:
+
+--------------------------
+Facial Gestalt Integration
+--------------------------
+
+Facial gestalt matching is a useful technique for variant priorization.
+There is a prototype integration with GestaltMatcher from Bonn.
+This integration needs work for a production-ready state but this can also lead into starting out with plugin extension points for VarFish for the deep integration of further external tools.
+
+
+.. _doc_futureplans_somatic_variant_analysis:
+
+------------------------
+Somatic Variant Analysis
+------------------------
+
+Alternative tools such as cBioPortal are well-suitable for the analysis of cancer variant data, in particular in a cohort fashion.
+However, in certain cases, the analysis of cancer cases with VarFish could be useful.
+
+
+.. _doc_futureplans_beacon_networks:
+
+---------------
+Beacon Networks
+---------------
+
+There is some implementation of connecting two VarFish instances via the Beacon API.
+This could be explored further or removed.
+
+
+.. _doc_futureplans_reev_community:
+
+--------------
+REEV Community
+--------------
+
+We have implemented a public single-variant interpretation tool called REEV.
+VarFish instances could be connected together by registering variant annotations and comments there and thus sharing knowledge and connecting to other users.
+More features could be implemented to create "groups" in REEV, such that consortia could use it as a connecting component for their local VarFish instances.
+
+
+.. _doc_futureplans_pipeline_integrations:
+
+---------------------
+Pipeline Integrations
+---------------------
+
+We could implement a feature that allows for integrating data processing pipelines with VarFish.
+Users could register meta data together with their FASTQ files or even flow cell raw data.
+The pipelines could then be started running mapping, variant calling, and QC etc.
+The results could then be imported into VarFish.
+VarFish would orchestrate the pipeline runs through existing external software.
+
+Potential existing pipelines include DRAGEN, ParaBricks, or custom Nextflow / Snakemake pipelines.
+
+
+.. _doc_futureplans_plugin_extension_points:
+
+-----------------------
+Plugin Extension Points
+-----------------------
+
+VarFish could serve as a platform for the integration of external tools.
+Working examples are the Exomiser for variant prioritization and an emerging one is the GestaltMatcher integration in :ref:`doc_futureplans_facial_gestalt`.
+Allowing further integration with other prediction tools or LIMS systems (Gepardo?) could offer the vendors of such tools to integrate well with VarFish.
+
+
+.. _doc_futureplans_comprehensive_apis:
+
+-------------------
+Comprehensive APIs
+-------------------
+
+Current API support focuses on what the frontend needs and we don't have comprehensive APIs yet.
+Having such APIs would be very useful though, and enable using VarFish as a backend for other tools and platforms.
+
+
+.. _doc_futureplans_scriptable:
+
+------------------
+Scriptable VarFish
+------------------
+
+In the inverse of :ref:`doc_futureplans_comprehensive_apis`, we could offer scripting of the query engine.
+This would allow advanced users to implement comprehensive analysis directly in VarFish.
diff --git a/src/index.rst b/src/index.rst
index 7152f78..5547d3a 100644
--- a/src/index.rst
+++ b/src/index.rst
@@ -9,3 +9,4 @@ VarFish Development Docs
 
     doc_architecture
     doc_dataflows
+    doc_futureplans