Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyData prototype simulation methods #31

Open
eric-czech opened this issue May 19, 2020 · 5 comments
Open

PyData prototype simulation methods #31

eric-czech opened this issue May 19, 2020 · 5 comments

Comments

@eric-czech
Copy link
Collaborator

We should start thinking about how to simulate data as part of a public API. PLINK and Hail support this and I think we should think about it now because it will be an important part of improving unit testing. I was chatting with @ravwojdyla and we're both at about the same place in our testing -- we have more simplistic test cases now but would both benefit from synthetic data representing a single dimension of genetic structure, likely with some tunable level of complexity. Essentially we need a better version of Hypothesis and while we're at it, why not make it part of the API?

Some examples:

  • LD estimation/pruning: A useful simulator would generate a provided number of variants with LD that is either 0, 1, or some specific value in between.
  • Kinship estimation: Simulating near-perfect recombination within a provided pedigree would make our tests more realistic, provided that the kinship coefficients fall into cleanly separable modes
  • PCA: Balding-nichols would make this easy, and any PCA test could input high Fst values to get easily separated populations
  • Association Testing: Something like Hail's experimental ldscsim make_betas and simulate_phenotypes would be useful for validation LMM and multi-trait models we work on

It may be that most users don't care about simulators that aren't representative of comprehensive genetic structure (e.g. hapgen), but I think being explicit about our simulations would improve understanding of the methods and that this should be something we coordinate on regardless, rather than making private versions for test cases on a per-method basis. This would also make it much easier to demonstrate what a method does without always having to appeal real datasets.

@eric-czech
Copy link
Collaborator Author

Along these lines, G2P (from https://pubmed.ncbi.nlm.nih.gov/30848784/) is probably the most comprehensive tool I've seen so far for generating both genotypes and phenotypes with a good bit of configurability.

PhenotypeSimulator (Mayer & Birney 2018) is another decent one for layering in relationships between given genotypes and generated phenotypes. What it supports is a good outline for what a simulator for association testing should do:

  • multiple phenotypes
  • independent and infinitesimal SNP effects (important for LMM testing)
  • non-genetic effects
  • effects from population structure and relatedness
  • pleiotropy
  • it can simulate genotypes to an extent but these are generally imported from elsewhere (resampling methods seem like a promising way to do this)
  • No binary/categorical phenotypes though unfortunately

The examples are pretty good too, e.g.:

A few other phenotype-only tools:

@eric-czech
Copy link
Collaborator Author

Another one that combines Balding-Nichols and the Pritchard-Stephens-Donnelly (PSD) for simulating admixture across populations to produce more realistic kinship matrices: bnpsd

@eric-czech
Copy link
Collaborator Author

Note: If we do include simulation functions, they should definitely support synthetic missingness. That was a big hang-up I had in using simulated data to better understand how Hail works.

@eric-czech
Copy link
Collaborator Author

@eric-czech
Copy link
Collaborator Author

Add this implementation of BN/PSD in dask at some point: dask/dask#6227 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant