Define data structures that would accommodate general purpose genetic workflows #15

eric-czech · 2020-03-18T15:32:22Z

It would be useful to document the capabilities of underlying data structures for existing genetic data file formats and processing platforms. This issue should serve as a place to collect findings on what existing data structures look like and what dimensions they support, without an emphasis on how they are implemented. At a minimum, this should cover:

Hail
scikit-allele
PLINK
BGEN

Some questions worth asking for these are:

What ploidy does it support, if any beyond 2?
How does it accommodate probabilities instead of hard calls?
Is phased data supported?
Are variants x samples always the primary dimensions? Do any formats/structures appeal to a more generic notions than these that are still more specific than rows/cols?
Does it include time as a dimension, or anything else unusual beyond variants, samples, allele, and haplotype/genotype?
Are genomic annotations included or used to enable some more useful analysis?
What other WES/WGS considerations are made?
How are copy number variants and gene dosage accounted for?
Are other structural variants supported (beyond indels)?

eric-czech · 2020-03-18T16:02:09Z

BGEN

File Layout:
- Offset from header block to data blocks
- Header block
  - Number of variants
  - Number of samples
  - Free data (e.g. custom identifying file info)
  - Flags
    - Are there sample ids in the file (which follow header block)?
    - Is SNP data gzip compressed?
    - Which spec version is in use?
- Sample ID block (optional)
- Variant ID block
  - rsid, chromosome, position
  - number of alleles (k) (16-bit)
  - individual alleles (k of them)
- Genotype data block (Layout 2)
  - Layout 1 is deprecated and limited to diploid calls
  - Supports up to 2^16 alleles and ploidy up to 2^6 (phased or unphased)
  - All samples are compressed for a single variant when compression is enabled
    - This implies format is not splittable in sample dimension, only variant dimension
  - Ploidy can differ for each sample (why would this ever be necessary?)
    - This is represented by N bytes (N = num samples) where 6 bits indicate ploidy and one bit indicates missingness
  - All probabilities are represented as unsigned integers X where the probability is X / max_int_value with max_int_value being 2^B - 1 for some bit-depth B
  - Phased
    - Probabilities are stored consecutively in the order of haplotypes (i.e. chromosome) then alleles
    - Only k-1 probabilities are stored for k alleles (kth allele proba = 1 - sum of others)
  - Unphased
    - Because the order of the probabilities does not matter, the simultaneous probability of all genotypes is stored instead (the product of individual genotype probabilities)
    - Again though, one combination of genotypes can be omitted and computed later as a subtraction from 1
Summary:
- Data dimensions are variant, sample, chromosome and allele; limits:
  - variant + sample = 2^32
  - chromosome i.e. ploidy = 2^6 (64)
  - allele = 2^16 (65,535)
- Phased calls and arbitrary ploidy are supported
- Only probabilities are supported, but you could just code hard calls as one genotype/haplotype w/ probability 1
- BGEN is only splittable along the variant dimension (all samples expected to fit in memory for one variant)
- Ploidy can actually change per-sample
  - This may be something of an escape hatch for copy number variants (haven't found an example of it being used that way yet though)

eric-czech · 2020-03-18T18:03:54Z

PLINK

Only bi-allelic, diploid organisms supported
Does not support allele probabilities
- These often result from sequencing data or imputed genotypes (viz. BEAGLE, MACH)
- From https://www.cog-genomics.org/plink/1.9/formats#gen:
Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are currently treated as missing, and the rest are treated as hard calls. (This behavior can be changed with --hard-call-threshold.) We plan to remove this limitation in the future.
- Probabilities are either used as sampling distributions for hard calls or inferred as missing if lowest probability is too high (for bi-allelic)
Dosages are also not supported (i.e. allele counts)
Copy number polymorphisms are supported but they are represented in separate gvar files
- This can be used as part of regression models that combine SNP and CNV data with special terms for SNPs within copy number segments (see here -- not sure what the state of this is in newer PLINK versions but it seems mostly incomplete)
While haplotype phasing is possible with PLINK, it does not appear that .bed files actually carry this information (as Hail does)

eric-czech · 2020-03-22T11:17:02Z

Scikit-Allele

Supports polyploid organisms, multi-allelic sites, phased calls, CNVs
Does not support genotype probabilities and no explicit mention of time and/or other structural variants is made in docs
GenotypeArray
- Calls are represented as 3D integer arrays with shape (variant, sample, ploidy) and values correspond to allele index present
- "0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, … and -1 (or any other negative integer) is a missing allele call"
- Note that the number of alleles per-site does not need to be the same
- An is_phased array with shape (variant, sample) maintains boolean flags for each call indicating whether or not it is phased
- Phased calls are not represented differently from unphased ones, the tacit assumption is that when is_phased==True the order of the calls in the 3rd dimension has meaning
GenotypeAlleleCountsArray
- Alternative representation of GenotypeArray that allows for variable effective ploidy (i.e. copy number variants)
- Phasing information is lost in this representation
- The shape is (variant, sample, allele) and each value corresponds to the number of occurrences of the corresponding allele
- The number of alleles for each variant + sample can differ so the last dimension is as large as the maximum number of alleles across all variants
  - There is a max_allele function that returns the number of alleles as a 2D array w/ shape (variant, sample)
HaplotypeArray
- These are a single slice along the 3rd dimension of a GenotypeArray (with a reduced api)
- This is what is required for functions that operate on phased data where individual haploids are required (viz. Population Selection methods)
VariantTable
- This is a table (np.recarray) that carries variant info like chromosome, position, read depths, quality scores, etc.
FeatureTable
- Genomic features table (very generic, provides little beyond using np.recarrays directly)

eric-czech · 2020-03-22T12:35:17Z

Hail

Supports polyploid organisms, multi-allelic sites, phased calls
Genotype probabilities are supported, though only for diploid datasets when imported from bgen (import_bgen)
- Gene dosage can be calculated from this (gp_dosage)
- Probabilities are stored as an array of genotype probabilities in the MatrixTable.GP property (this does not support phased probabilities)
Does not support large structural/nested variants
CNV data isn't mentioned in the docs and wouldn't be supported in the MatrixTable.GT entry field (as call expressions), but it may be possible to add custom entry fields that represent this
MatrixTable
- Is conceptually a 2D call array w/ shape (variant, sample)
- Variant and sample information is stored as 2D tables along both axes
- Calls can be attached as properties, typically GT, and are represented as a CallExpression
  - Stores allele index for each chromosome, making it identical in structure to scikit-allele GenotypeArray when embedded in MatrixTable
  - CallExpression also maintains phased flag and ploidy count
    - This makes is possible to mix calls w/ different phasing, ploidy, and allele count (it would make little sense to do so, but it is possible)
- Entry fields containing data for any variant + sample combination (e.g. GT) allow the user to extend the data model to arbitrary new data types (e.g. custom structural variants)
Genomic annotations can be attached to variant info table along row axis
- e.g. vep and nirvana

hammer · 2020-03-23T16:04:51Z

variable effective ploidy (i.e. copy number variants)

I'm a bit confused as to how copy number variants and ploidy are related. I would imagine coding the copy number as a distinct allele. For example, if at a certain locus across the population there are alleles with 1, 2, or 3 copies, then support for multiple alleles seems sufficient to capture this information. If these genomes are phased, then you can put a copy number on each copy of the chromosome, but the ploidy should not change, unless my understanding of the molecular biology of CNVs is off?

Also, on the support for multiple alleles: are there any explicit carve-outs for the MHC locus? If we want to use this software for infectious disease susceptibility analyses, for example, that locus becomes critical, and is obviously highly polymorphic so likely stresses whatever support is available for multiple alleles at a locus.

eric-czech · 2020-03-24T13:51:34Z

Good point. "Effective ploidy" is a term I first say in the scikit-allele docs and does seem to be used elsewhere as being synonymous with copy number variation (1 2). I agree that thinking of a variant on a single chromosome as having some kind of variable ploidy is weird though.

If I'm understanding the scikit-allele GenotypeAlleleCountsArray model correctly, I believe it is implying that, for example, a diploid organism could have two different alleles that are each copied some different number of times. In your example, are you saying that for 2 alleles and each of 1, 2 or 3 copies there would be 6 possible alleles (i.e. each allele + copy number combination)?

My guess would be that PLINK and scikit-allele use separate data structures for CNVs entirely since the analysis they run on them requires the count numbers in some matrix form anyhow, so they simply never need to find a way to represent them in a multi-allelic encoding like you mentioned.

Separately, this is probably worth a read at some point if it's not already on the paper queue: Whole-genome sequencing reveals high complexity of copy number variation at insecticide resistance loci in malaria mosquitoes

eric-czech · 2020-03-27T18:55:41Z

I am imagining that the best starting point for a data structure is an Nd array with shape (variants, samples, allele, ploidy) and with values equal to either dosages, probabilities, or copy number (entirely equal to 1 in most cases) -- whatever is appropriate for the analysis. Let me know if my understanding of copy number representation in the last comment doesn't match with yours.

I'm also planning to start on a prototype that defines this and some appropriate degenerate cases, e.g. haplotypes for diploid, bi-allelic hard calls, as Xarray subclasses. My plan is for the prototype to work like this, when it comes to data structures:

# Load genotype data, possible from PLINK/bgen
arr = xr.DataArray(...) 
# Build subclass object (this is where all preconditions will be applied)
gt = api.GeneticDataArray(arr) 

# Do something within the API, which will not operate on raw Xarray classes
gt = api.ld_prune(gt) 

# Do something outside the API
def my_custom_analysis(gt: xarray.Dataset) -> xarray.Dataset:
   ... # Do some stuff that only operates on xarray structures
   return gt
gt = my_custom_analysis(gt)

# Convert back to API structures if necessary by invoking the appropriate constructor
gt = api.GeneticDataArray(gt)

Any thoughts/concerns?

eric-czech · 2020-03-29T16:56:11Z

I've got what I think is a pretty solid start on this up now in this notebook.

The internals for it are in this folder.

eric-czech · 2020-03-30T11:28:45Z

I refactored this nearly from scratch into data_stuctures.ipynb (internals). This version is much better, with some notable differences being:

All data structures are now Dataset subclasses rather than DataArray
This includes models for phasing and missing data (Xarray does not support masking)
Dispatching was added to masked array backends (this is apparently not within the purview of __array_function__ protocol implementations)

GianDevi · 2020-04-06T17:32:53Z

I am working on the genome .my background is from computer science. plz, guide which dataset or type of dataset busing any tool like crisper or whatever..... plz guide..thanks in advance.

eric-czech · 2020-04-12T12:21:37Z

Closing and adding any more updates related to xarray prototyping in #5

eric-czech mentioned this issue Mar 27, 2020

Determine core operations necessary in general-purpose GWAS toolkits #16

Closed

eric-czech mentioned this issue Apr 12, 2020

Explore Xarray as the basis for a genetic toolkit API #5

Closed

eric-czech closed this as completed Apr 13, 2020

eric-czech mentioned this issue Apr 13, 2020

Build PyData prototype for GWAS analysis #20

Open

12 tasks

eric-czech added the pydata prototype label Apr 13, 2020

eric-czech mentioned this issue Apr 16, 2020

Define Xarray data structures for PyData prototype #22

Open

eric-czech mentioned this issue Jul 14, 2020

GWAS linear regression implementation sgkit-dev/sgkit#16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define data structures that would accommodate general purpose genetic workflows #15

Define data structures that would accommodate general purpose genetic workflows #15

eric-czech commented Mar 18, 2020 •

edited

Loading

eric-czech commented Mar 18, 2020 •

edited

Loading

eric-czech commented Mar 18, 2020

eric-czech commented Mar 22, 2020

eric-czech commented Mar 22, 2020 •

edited

Loading

hammer commented Mar 23, 2020

eric-czech commented Mar 24, 2020 •

edited

Loading

eric-czech commented Mar 27, 2020

eric-czech commented Mar 29, 2020 •

edited

Loading

eric-czech commented Mar 30, 2020 •

edited

Loading

GianDevi commented Apr 6, 2020

eric-czech commented Apr 12, 2020

Define data structures that would accommodate general purpose genetic workflows #15

Define data structures that would accommodate general purpose genetic workflows #15

Comments

eric-czech commented Mar 18, 2020 • edited Loading

eric-czech commented Mar 18, 2020 • edited Loading

BGEN

eric-czech commented Mar 18, 2020

PLINK

eric-czech commented Mar 22, 2020

Scikit-Allele

eric-czech commented Mar 22, 2020 • edited Loading

Hail

hammer commented Mar 23, 2020

eric-czech commented Mar 24, 2020 • edited Loading

eric-czech commented Mar 27, 2020

eric-czech commented Mar 29, 2020 • edited Loading

eric-czech commented Mar 30, 2020 • edited Loading

GianDevi commented Apr 6, 2020

eric-czech commented Apr 12, 2020

eric-czech commented Mar 18, 2020 •

edited

Loading

eric-czech commented Mar 18, 2020 •

edited

Loading

eric-czech commented Mar 22, 2020 •

edited

Loading

eric-czech commented Mar 24, 2020 •

edited

Loading

eric-czech commented Mar 29, 2020 •

edited

Loading

eric-czech commented Mar 30, 2020 •

edited

Loading