Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build PyData prototype for GWAS analysis #20

Open
12 tasks
eric-czech opened this issue Apr 13, 2020 · 3 comments
Open
12 tasks

Build PyData prototype for GWAS analysis #20

eric-czech opened this issue Apr 13, 2020 · 3 comments

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Apr 13, 2020

This issue tracks several more specific issues related to working towards a usable prototype.

Some things we should tackle for this are:

  • IO (PyData prototype IO #23)
    • Select IO libs to integrate and determine plugin system around them
  • Frontend (Explore Xarray as the basis for a genetic toolkit API #5)
    • Is Xarray really the right choice? Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
  • Backend Dispatch (PyData prototype backend dispatching #24)
    • How do we dispatch to duck array backends and IO plugins?
  • Data Structures (Define Xarray data structures for PyData prototype #22)
    • We'll need to survey a reasonable portion of the space of all possible input structures
  • Methods
  • Simulation tools (PyData prototype simulation methods #31)
  • Testing (PyData prototype testing  #21)
    • How can we framework a solution for validation against external software, namely Hail? This will be very tedious without some abstraction
  • Indexing
    • Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
    • Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
    • For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
  • Configuration
    • We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
    • Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
  • Dask DevOps
  • Sub-byte Representations
    • It might not be too ridiculous to support some simpler (ideally early) QC operations on bitpacked int arrays
    • Doing the packing at the dask/numpy level would look like this (an example from Matt)
    • Alistair has some related thoughts in this post
  • Enrichment
    • How do we add and represent data along axes (e.g. variants/samples)? The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing? What happens if there are non-unique values? Is relying on Pandas indexing going to cause excessive memory overhead?
  • Limitations
@mrocklin
Copy link

I saw this issue pointed to from dask/dask-blog#38 . Some small comments

We need to know how to use Dask at scale

FYI I'm making a company around this question. Let me know if you want to chat or be beta testers for cloud deployment products.

Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Benchmarks are hard to do honestly.

@eric-czech
Copy link
Collaborator Author

Hey Matt,

Let me know if you want to chat or be beta testers for cloud deployment products.

Will do, but deployment isn't a big concern quite yet.

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?

@mrocklin
Copy link

mrocklin commented Apr 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants