-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyData prototype simulation methods #31
Comments
Along these lines, G2P (from https://pubmed.ncbi.nlm.nih.gov/30848784/) is probably the most comprehensive tool I've seen so far for generating both genotypes and phenotypes with a good bit of configurability. PhenotypeSimulator (Mayer & Birney 2018) is another decent one for layering in relationships between given genotypes and generated phenotypes. What it supports is a good outline for what a simulator for association testing should do:
The examples are pretty good too, e.g.: A few other phenotype-only tools:
|
Another one that combines Balding-Nichols and the Pritchard-Stephens-Donnelly (PSD) for simulating admixture across populations to produce more realistic kinship matrices: bnpsd |
Note: If we do include simulation functions, they should definitely support synthetic missingness. That was a big hang-up I had in using simulated data to better understand how Hail works. |
Add this implementation of BN/PSD in dask at some point: dask/dask#6227 (comment) |
We should start thinking about how to simulate data as part of a public API. PLINK and Hail support this and I think we should think about it now because it will be an important part of improving unit testing. I was chatting with @ravwojdyla and we're both at about the same place in our testing -- we have more simplistic test cases now but would both benefit from synthetic data representing a single dimension of genetic structure, likely with some tunable level of complexity. Essentially we need a better version of Hypothesis and while we're at it, why not make it part of the API?
Some examples:
make_betas
andsimulate_phenotypes
would be useful for validation LMM and multi-trait models we work onIt may be that most users don't care about simulators that aren't representative of comprehensive genetic structure (e.g. hapgen), but I think being explicit about our simulations would improve understanding of the methods and that this should be something we coordinate on regardless, rather than making private versions for test cases on a per-method basis. This would also make it much easier to demonstrate what a method does without always having to appeal real datasets.
The text was updated successfully, but these errors were encountered: