Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define data structures that would accommodate general purpose genetic workflows #15

Closed
eric-czech opened this issue Mar 18, 2020 · 11 comments

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Mar 18, 2020

It would be useful to document the capabilities of underlying data structures for existing genetic data file formats and processing platforms. This issue should serve as a place to collect findings on what existing data structures look like and what dimensions they support, without an emphasis on how they are implemented. At a minimum, this should cover:

  • Hail
  • scikit-allele
  • PLINK
  • BGEN

Some questions worth asking for these are:

  • What ploidy does it support, if any beyond 2?
  • How does it accommodate probabilities instead of hard calls?
  • Is phased data supported?
  • Are variants x samples always the primary dimensions? Do any formats/structures appeal to a more generic notions than these that are still more specific than rows/cols?
  • Does it include time as a dimension, or anything else unusual beyond variants, samples, allele, and haplotype/genotype?
  • Are genomic annotations included or used to enable some more useful analysis?
  • What other WES/WGS considerations are made?
  • How are copy number variants and gene dosage accounted for?
  • Are other structural variants supported (beyond indels)?
@eric-czech
Copy link
Collaborator Author

eric-czech commented Mar 18, 2020

BGEN

  • File Layout:
    • Offset from header block to data blocks
    • Header block
      • Number of variants
      • Number of samples
      • Free data (e.g. custom identifying file info)
      • Flags
        • Are there sample ids in the file (which follow header block)?
        • Is SNP data gzip compressed?
        • Which spec version is in use?
    • Sample ID block (optional)
    • Variant ID block
      • rsid, chromosome, position
      • number of alleles (k) (16-bit)
      • individual alleles (k of them)
    • Genotype data block (Layout 2)
      • Layout 1 is deprecated and limited to diploid calls
      • Supports up to 2^16 alleles and ploidy up to 2^6 (phased or unphased)
      • All samples are compressed for a single variant when compression is enabled
        • This implies format is not splittable in sample dimension, only variant dimension
      • Ploidy can differ for each sample (why would this ever be necessary?)
        • This is represented by N bytes (N = num samples) where 6 bits indicate ploidy and one bit indicates missingness
      • All probabilities are represented as unsigned integers X where the probability is X / max_int_value with max_int_value being 2^B - 1 for some bit-depth B
      • Phased
        • Probabilities are stored consecutively in the order of haplotypes (i.e. chromosome) then alleles
        • Only k-1 probabilities are stored for k alleles (kth allele proba = 1 - sum of others)
      • Unphased
        • Because the order of the probabilities does not matter, the simultaneous probability of all genotypes is stored instead (the product of individual genotype probabilities)
        • Again though, one combination of genotypes can be omitted and computed later as a subtraction from 1
  • Summary:
    • Data dimensions are variant, sample, chromosome and allele; limits:
      • variant + sample = 2^32
      • chromosome i.e. ploidy = 2^6 (64)
      • allele = 2^16 (65,535)
    • Phased calls and arbitrary ploidy are supported
    • Only probabilities are supported, but you could just code hard calls as one genotype/haplotype w/ probability 1
    • BGEN is only splittable along the variant dimension (all samples expected to fit in memory for one variant)
    • Ploidy can actually change per-sample
      • This may be something of an escape hatch for copy number variants (haven't found an example of it being used that way yet though)

@eric-czech
Copy link
Collaborator Author

PLINK

  • Only bi-allelic, diploid organisms supported
  • Does not support allele probabilities

    Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are currently treated as missing, and the rest are treated as hard calls. (This behavior can be changed with --hard-call-threshold.) We plan to remove this limitation in the future.

    • Probabilities are either used as sampling distributions for hard calls or inferred as missing if lowest probability is too high (for bi-allelic)
  • Dosages are also not supported (i.e. allele counts)
  • Copy number polymorphisms are supported but they are represented in separate gvar files
    • This can be used as part of regression models that combine SNP and CNV data with special terms for SNPs within copy number segments (see here -- not sure what the state of this is in newer PLINK versions but it seems mostly incomplete)
  • While haplotype phasing is possible with PLINK, it does not appear that .bed files actually carry this information (as Hail does)

@eric-czech
Copy link
Collaborator Author

Scikit-Allele

  • Supports polyploid organisms, multi-allelic sites, phased calls, CNVs
  • Does not support genotype probabilities and no explicit mention of time and/or other structural variants is made in docs
  • GenotypeArray
    • Calls are represented as 3D integer arrays with shape (variant, sample, ploidy) and values correspond to allele index present
    • "0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, … and -1 (or any other negative integer) is a missing allele call"
    • Note that the number of alleles per-site does not need to be the same
    • An is_phased array with shape (variant, sample) maintains boolean flags for each call indicating whether or not it is phased
    • Phased calls are not represented differently from unphased ones, the tacit assumption is that when is_phased==True the order of the calls in the 3rd dimension has meaning
  • GenotypeAlleleCountsArray
    • Alternative representation of GenotypeArray that allows for variable effective ploidy (i.e. copy number variants)
    • Phasing information is lost in this representation
    • The shape is (variant, sample, allele) and each value corresponds to the number of occurrences of the corresponding allele
    • The number of alleles for each variant + sample can differ so the last dimension is as large as the maximum number of alleles across all variants
      • There is a max_allele function that returns the number of alleles as a 2D array w/ shape (variant, sample)
  • HaplotypeArray
    • These are a single slice along the 3rd dimension of a GenotypeArray (with a reduced api)
    • This is what is required for functions that operate on phased data where individual haploids are required (viz. Population Selection methods)
  • VariantTable
    • This is a table (np.recarray) that carries variant info like chromosome, position, read depths, quality scores, etc.
  • FeatureTable
    • Genomic features table (very generic, provides little beyond using np.recarrays directly)

@eric-czech
Copy link
Collaborator Author

eric-czech commented Mar 22, 2020

Hail

  • Supports polyploid organisms, multi-allelic sites, phased calls
  • Genotype probabilities are supported, though only for diploid datasets when imported from bgen (import_bgen)
    • Gene dosage can be calculated from this (gp_dosage)
    • Probabilities are stored as an array of genotype probabilities in the MatrixTable.GP property (this does not support phased probabilities)
  • Does not support large structural/nested variants
  • CNV data isn't mentioned in the docs and wouldn't be supported in the MatrixTable.GT entry field (as call expressions), but it may be possible to add custom entry fields that represent this
  • MatrixTable
    • Is conceptually a 2D call array w/ shape (variant, sample)
    • Variant and sample information is stored as 2D tables along both axes
    • Calls can be attached as properties, typically GT, and are represented as a CallExpression
      • Stores allele index for each chromosome, making it identical in structure to scikit-allele GenotypeArray when embedded in MatrixTable
      • CallExpression also maintains phased flag and ploidy count
        • This makes is possible to mix calls w/ different phasing, ploidy, and allele count (it would make little sense to do so, but it is possible)
    • Entry fields containing data for any variant + sample combination (e.g. GT) allow the user to extend the data model to arbitrary new data types (e.g. custom structural variants)
  • Genomic annotations can be attached to variant info table along row axis

@hammer
Copy link

hammer commented Mar 23, 2020

variable effective ploidy (i.e. copy number variants)

I'm a bit confused as to how copy number variants and ploidy are related. I would imagine coding the copy number as a distinct allele. For example, if at a certain locus across the population there are alleles with 1, 2, or 3 copies, then support for multiple alleles seems sufficient to capture this information. If these genomes are phased, then you can put a copy number on each copy of the chromosome, but the ploidy should not change, unless my understanding of the molecular biology of CNVs is off?

Also, on the support for multiple alleles: are there any explicit carve-outs for the MHC locus? If we want to use this software for infectious disease susceptibility analyses, for example, that locus becomes critical, and is obviously highly polymorphic so likely stresses whatever support is available for multiple alleles at a locus.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Mar 24, 2020

Good point. "Effective ploidy" is a term I first say in the scikit-allele docs and does seem to be used elsewhere as being synonymous with copy number variation (1 2). I agree that thinking of a variant on a single chromosome as having some kind of variable ploidy is weird though.

If I'm understanding the scikit-allele GenotypeAlleleCountsArray model correctly, I believe it is implying that, for example, a diploid organism could have two different alleles that are each copied some different number of times. In your example, are you saying that for 2 alleles and each of 1, 2 or 3 copies there would be 6 possible alleles (i.e. each allele + copy number combination)?

My guess would be that PLINK and scikit-allele use separate data structures for CNVs entirely since the analysis they run on them requires the count numbers in some matrix form anyhow, so they simply never need to find a way to represent them in a multi-allelic encoding like you mentioned.

Separately, this is probably worth a read at some point if it's not already on the paper queue: Whole-genome sequencing reveals high complexity of copy number variation at insecticide resistance loci in malaria mosquitoes

@eric-czech
Copy link
Collaborator Author

I am imagining that the best starting point for a data structure is an Nd array with shape (variants, samples, allele, ploidy) and with values equal to either dosages, probabilities, or copy number (entirely equal to 1 in most cases) -- whatever is appropriate for the analysis. Let me know if my understanding of copy number representation in the last comment doesn't match with yours.

I'm also planning to start on a prototype that defines this and some appropriate degenerate cases, e.g. haplotypes for diploid, bi-allelic hard calls, as Xarray subclasses. My plan is for the prototype to work like this, when it comes to data structures:

# Load genotype data, possible from PLINK/bgen
arr = xr.DataArray(...) 
# Build subclass object (this is where all preconditions will be applied)
gt = api.GeneticDataArray(arr) 

# Do something within the API, which will not operate on raw Xarray classes
gt = api.ld_prune(gt) 

# Do something outside the API
def my_custom_analysis(gt: xarray.Dataset) -> xarray.Dataset:
   ... # Do some stuff that only operates on xarray structures
   return gt
gt = my_custom_analysis(gt)

# Convert back to API structures if necessary by invoking the appropriate constructor
gt = api.GeneticDataArray(gt) 

Any thoughts/concerns?

@eric-czech
Copy link
Collaborator Author

eric-czech commented Mar 29, 2020

I've got what I think is a pretty solid start on this up now in this notebook.

The internals for it are in this folder.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Mar 30, 2020

I refactored this nearly from scratch into data_stuctures.ipynb (internals). This version is much better, with some notable differences being:

  • All data structures are now Dataset subclasses rather than DataArray
  • This includes models for phasing and missing data (Xarray does not support masking)
  • Dispatching was added to masked array backends (this is apparently not within the purview of __array_function__ protocol implementations)

@GianDevi
Copy link

GianDevi commented Apr 6, 2020

I am working on the genome .my background is from computer science. plz, guide which dataset or type of dataset busing any tool like crisper or whatever..... plz guide..thanks in advance.

@eric-czech
Copy link
Collaborator Author

Closing and adding any more updates related to xarray prototyping in #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants