Explore Xarray as the basis for a genetic toolkit API #5

eric-czech · 2020-02-07T17:26:43Z

I'm not sure what advantages having labeled axes has for big call matrices, but xarray may make sense as a way to carry along the variant and sample metadata (as opposed to dask dfs). It will be worth a shot to see how useful that interface is.

eric-czech · 2020-04-12T11:26:05Z

Notable Xarray limitations found so far are:

There is no support for SQL joins
- It is possible to do a Hail-style "join", but this is really just index alignment (Self joins with non-unique indexes pydata/xarray#3791)
- I'll have to check, but I think index alignment may only work for a univariate index and not a multiindex (Explicit indexes in xarray's data-model (Future of MultiIndex) pydata/xarray#1603)
  - This would mean we'd have to combine fields like chromosome and position before indexing+joining
Coordinates need to fit in memory (Low memory/out-of-core index? pydata/xarray#1650)
- Even if data variables represent things like say contig, position, and alleles (a common primary key for variants), running an index alignment on external data for say rsID will require conversion to coordinates
Masking is not supported (Use masked arrays while preserving int pydata/xarray#1194). It is supported in Dask (dask/masked-arrays) but not in CuPy (Masked array analogue for CuPy cupy/cupy#2225)
No map_overlap support (Implementing map_overlap pydata/xarray#3147)
Chunk sizes for underlying dask arrays need to be uniform (Zarr Backend: check for non-uniform chunks is too strict pydata/xarray#2225)
- This could be important for having chunks broken on contig, which would simply methods like LD pruning
Dataset/DataArray polymorphism is not supported (Subclassing Dataset and DataArray pydata/xarray#706)
- The recommendation is to use accessors but these are attached globally to every single Dataset instance, which is awkward for trying to support multiple subclasses
Neither nominal nor structural subtype safety (via static analysis) is possible (Extending Xarray for domain-specific toolkits pydata/xarray#3959)

Note: some of these were aggregated from https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481

eric-czech · 2020-06-02T16:22:05Z

There isn't much remaining doubt about Xarray. Updates on some of the limitations above:

masked data: Whether we use masking or NA sentinels, this is a numpy problem so it needs to be solved at that level (so any PyData solution will have this issue)
nominal/structural type safety: This also needs to be solved on the numpy level
Dataset/DataArray subclassing: I couldn't make Xarray accessors useful in the way I wanted without monkey-patching, but they work well enough and there may be a cleaner way to do what I did
map_overlap support: This is in Dask now at least so it wouldn't be that hard to introduce in Xarray or otherwise work around
uniform chunk sizes: While this is true when a dataset needs to be written as zarr, rechunking before write may not be as big of performance issue as I thought (I've found that its effects are generally marginal compared to everything else)

The others are still legitimate limitations but don't affect core GWAS operations.

eric-czech changed the title ~~Try xarray over Dask in large dataset benchmarks~~ Explore Xarray as the basis for a genetic toolkit API Apr 12, 2020

eric-czech mentioned this issue Apr 12, 2020

Define data structures that would accommodate general purpose genetic workflows #15

Closed

ihnorton mentioned this issue Apr 12, 2020

Implement axes labels and allow queries using these labels TileDB-Inc/TileDB#201

Closed

eric-czech mentioned this issue Apr 13, 2020

Build PyData prototype for GWAS analysis #20

Open

12 tasks

eric-czech added the pydata prototype label Apr 13, 2020

eric-czech closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore Xarray as the basis for a genetic toolkit API #5

Explore Xarray as the basis for a genetic toolkit API #5

eric-czech commented Feb 7, 2020

eric-czech commented Apr 12, 2020

eric-czech commented Jun 2, 2020

Explore Xarray as the basis for a genetic toolkit API #5

Explore Xarray as the basis for a genetic toolkit API #5

Comments

eric-czech commented Feb 7, 2020

eric-czech commented Apr 12, 2020

eric-czech commented Jun 2, 2020