Skip to content

Querying Variant Data

Jacobo Coll Moragón edited this page Jun 9, 2016 · 33 revisions

Overview

The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).

All this layers and implementations will use the same specification defined in this document.

There are defined an small set of READ-ONLY methods to achieve all the required functionality.

  • Query Return all variants that matches with a given query
  • Count Count the result of a given query
  • GroupBy & Rank Group variants by some field and, optionally, creates a rank by number of variants.
  • Frequency Group variants by region and count. Useful to plot histograms.

Query filters

A filter is a pair of <key>, <value>, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.

There are some general rules that are applied for every case:

  1. Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).

  2. When a filter accepts a list of values, they can be separated with:

  • Comma , : Which will define an OR operation between the separated elements
  • Semicolon ; : Which will define an AND operation between the separated elements

So, the query chromosome: 2,3 will return all variants with 2 and 3, but chromosome: 2;3 will return all variants in chromosomes 2 and 3, but chromosome: 2;3 will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.

General filters

This general filters will match with fields from the VCF input files.

Key Description Format
ids Matches with the ID field List of values
region Matches with the chromosome and start position List of <chromosome>:<start>-<end>
chromosome Matches with the chromosome List of values
type Matches with the type of the variant List of values. Accepted values: [SNV, MNV, INDEL, SV, CNV]
reference Matches with the reference List of values
alternate Matches with the alternate List of values
studies Matches with variants that are in the specified studies List of values. Accept negations.
files
genotype {samp_1}:{gt_1}(,{gt_n}); HG0097:0/0;HG0098:0/1,1/1
qual [PENDING]
filter [PENDING]
info [PENDING]

Modifiers:

Key Description Format
limit
skip
sort
include
exclude
returnedStudies
returnedFiles
returnedSamples
unknownGenotype
Statistics filters

Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.

This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>, where the available comparators are: <, <=, >, >=, = and !=.

Key Description
maf Minor Allele Frequenc
mgf Minor Genotype Frequency
missingAlleles Number of missing alleles
missingGenotypes Number of missing genotype
Annotation filters
Key Description Format Example
gene List of genes List of values. Accept negations.
annotationExists true/false
annot-ct Consequence type SO term list. SO:0000045,SO:0000046
annot-xref External references
annot-biotype List of biotypes
polyphen Polyphen, protein substitution score. `[< >
sift Sift, protein substitution score. `[< >
protein_substitution Protein substitution score `{protein_score}[< >
conservation Conservation score. Phylop, phastCons or gerp. `{conservation_score}[< >
alternate_frequency Alternate Population Frequency `{study}:{population}[< >
reference_frequency Reference Population Frequency `{study}:{population}[< >
annot-population-maf Population minor allele frequency `{study}:{population}[< >
annot-transcription-flags List of transcript annotation flags CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno
annot-gene-trait-id List of gene trait association ids umls:C0007222 , OMIM:269600
annot-gene-trait-name List of gene trait association names Cardiovascular Diseases
annot-hpo List of HPO terms. HP:0000545
annot-go List of GO (Genome Ontology) terms. GO:0002020,GO:0006508
annot-protein-keywords List of protein variant annotation keywords
annot-drug List of drug names
annot-functional-score Functional score, like cadd `{functional_score}[< >

GroupBy and Rank

Histogram

Clone this wiki locally