Given a set of variants, the code here will allow you to annotate a batch of variants using annotation resources available at a particular point in time. (In comparison, using interactive annotation, variants can be annotated on the fly with new annotation resources as they become available.)
This code uses Ensembl's Variant Effect Prediction (VEP) from McLaren et. al. 2016 (doi:10.1186/s13059-016-0974-4) to annotate variants in a BigQuery table.
It is horizontally scalable due to the use of dsub. A separate instance of VEP is run by dsub for each shard of each of the files passed on the command line. VEP is also configured to run with as many threads as the number of cores on the virtual machine instantiated by dsub.
VEP can be configured in many ways and can use as input a large variety of annotation sources. This code illustrates one possible configuration and could be modified to accomodate other configurations.
All steps are run in the cloud, but each individual step is launched manually.
The first step involves building the Docker container holding VEP and cached annotations for the desired build of the human genome reference.
A second container is built to curate and cache dbNSFP in Cloud Storage. This is done because dbNSFP is quite a large annotation resource and therefore we choose not to add it to the same Docker container that includes VEP.
Follow the tutorial to build the tools needed to annotate GRCh37 or GRCh38 of the human genome reference.
After the annotator has been built for the desired build of the human reference genome, it can be used to annotate variants from a single genome or a cohort of genomes in a BigQuery variant table.
Follow the tutorial to annotate Platinum Genomes variants called by DeepVariant and aligned to build GRCh38.