Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore Xarray as the basis for a genetic toolkit API #5

Closed
eric-czech opened this issue Feb 7, 2020 · 2 comments
Closed

Explore Xarray as the basis for a genetic toolkit API #5

eric-czech opened this issue Feb 7, 2020 · 2 comments

Comments

@eric-czech
Copy link
Collaborator

I'm not sure what advantages having labeled axes has for big call matrices, but xarray may make sense as a way to carry along the variant and sample metadata (as opposed to dask dfs). It will be worth a shot to see how useful that interface is.

@eric-czech eric-czech changed the title Try xarray over Dask in large dataset benchmarks Explore Xarray as the basis for a genetic toolkit API Apr 12, 2020
@eric-czech
Copy link
Collaborator Author

Notable Xarray limitations found so far are:

Note: some of these were aggregated from https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481

@eric-czech
Copy link
Collaborator Author

There isn't much remaining doubt about Xarray. Updates on some of the limitations above:

  • masked data: Whether we use masking or NA sentinels, this is a numpy problem so it needs to be solved at that level (so any PyData solution will have this issue)
  • nominal/structural type safety: This also needs to be solved on the numpy level
  • Dataset/DataArray subclassing: I couldn't make Xarray accessors useful in the way I wanted without monkey-patching, but they work well enough and there may be a cleaner way to do what I did
  • map_overlap support: This is in Dask now at least so it wouldn't be that hard to introduce in Xarray or otherwise work around
  • uniform chunk sizes: While this is true when a dataset needs to be written as zarr, rechunking before write may not be as big of performance issue as I thought (I've found that its effects are generally marginal compared to everything else)

The others are still legitimate limitations but don't affect core GWAS operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant