PyData prototype simulation methods #31

eric-czech · 2020-05-19T17:49:37Z

We should start thinking about how to simulate data as part of a public API. PLINK and Hail support this and I think we should think about it now because it will be an important part of improving unit testing. I was chatting with @ravwojdyla and we're both at about the same place in our testing -- we have more simplistic test cases now but would both benefit from synthetic data representing a single dimension of genetic structure, likely with some tunable level of complexity. Essentially we need a better version of Hypothesis and while we're at it, why not make it part of the API?

Some examples:

LD estimation/pruning: A useful simulator would generate a provided number of variants with LD that is either 0, 1, or some specific value in between.
Kinship estimation: Simulating near-perfect recombination within a provided pedigree would make our tests more realistic, provided that the kinship coefficients fall into cleanly separable modes
PCA: Balding-nichols would make this easy, and any PCA test could input high Fst values to get easily separated populations
Association Testing: Something like Hail's experimental ldscsim make_betas and simulate_phenotypes would be useful for validation LMM and multi-trait models we work on

It may be that most users don't care about simulators that aren't representative of comprehensive genetic structure (e.g. hapgen), but I think being explicit about our simulations would improve understanding of the methods and that this should be something we coordinate on regardless, rather than making private versions for test cases on a per-method basis. This would also make it much easier to demonstrate what a method does without always having to appeal real datasets.

The text was updated successfully, but these errors were encountered:

eric-czech · 2020-05-19T19:25:33Z

Along these lines, G2P (from https://pubmed.ncbi.nlm.nih.gov/30848784/) is probably the most comprehensive tool I've seen so far for generating both genotypes and phenotypes with a good bit of configurability.

PhenotypeSimulator (Mayer & Birney 2018) is another decent one for layering in relationships between given genotypes and generated phenotypes. What it supports is a good outline for what a simulator for association testing should do:

multiple phenotypes
independent and infinitesimal SNP effects (important for LMM testing)
non-genetic effects
effects from population structure and relatedness
pleiotropy
it can simulate genotypes to an extent but these are generally imported from elsewhere (resampling methods seem like a promising way to do this)
No binary/categorical phenotypes though unfortunately

The examples are pretty good too, e.g.:

A few other phenotype-only tools:

https://github.com/bcm-uga/NaturalGWAS
https://github.com/spiros/tofu
https://github.com/chr1swallace/simGWAS - Directly simulates summary stats for phenotypes rather than ever needing genotypes

eric-czech · 2020-05-20T14:20:56Z

Another one that combines Balding-Nichols and the Pritchard-Stephens-Donnelly (PSD) for simulating admixture across populations to produce more realistic kinship matrices: bnpsd

eric-czech · 2020-05-21T12:25:41Z

Note: If we do include simulation functions, they should definitely support synthetic missingness. That was a big hang-up I had in using simulated data to better understand how Hail works.

eric-czech · 2020-05-22T21:07:24Z

see also: https://discourse.smadstatgen.org/t/common-patterns-in-human-population-simulation/48

eric-czech · 2020-05-23T17:17:11Z

Add this implementation of BN/PSD in dask at some point: dask/dask#6227 (comment)

eric-czech mentioned this issue May 19, 2020

Build PyData prototype for GWAS analysis #20

Open

12 tasks

eric-czech mentioned this issue May 20, 2020

PyData prototype LD prune implementation #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyData prototype simulation methods #31

PyData prototype simulation methods #31

eric-czech commented May 19, 2020

eric-czech commented May 19, 2020

eric-czech commented May 20, 2020

eric-czech commented May 21, 2020

eric-czech commented May 22, 2020

eric-czech commented May 23, 2020

PyData prototype simulation methods #31

PyData prototype simulation methods #31

Comments

eric-czech commented May 19, 2020

eric-czech commented May 19, 2020

eric-czech commented May 20, 2020

eric-czech commented May 21, 2020

eric-czech commented May 22, 2020

eric-czech commented May 23, 2020