Simulate genotypes for unit tests #23

eric-czech · 2020-07-08T14:23:26Z

I'm trying to simulate some genotype calls for unit tests using a simple model from msprime. At @jeromekelleher's suggestion, I tried this:

# Using 1.x code, not yet released
!pip install git+https://github.com/tskit-dev/msprime.git#egg=msprime

import msprime
model = msprime.Demography.island_model(2, migration_rate=1e-3, Ne=10**4)
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
     samples=samples,
     demography=model, # what should this be?
     mutation_rate=1e-8,
     recombination_rate=1e-8
)
list(ts.variants())
[]

What is demography supposed to be? I saw a "FIXME" for documenting it in simulate but I wasn't sure what to set it to if not the Demography instance that comes back from island_model.

Also, is this why I'm not getting any variants back out?

I am ultimately trying to simulate hardy weinberg equilibrium. I'd like to have some fraction of the variants be in or near perfect equilibrium and the remainder way out of it. If you get a chance, could I get an assist @jeromekelleher?

The text was updated successfully, but these errors were encountered:

ravwojdyla · 2020-07-08T16:18:33Z

+1 @eric-czech when we figure out the right toolset for (test) data generation it could be valuable to expose them as strategies in hypothesis (this could be a separate test module within this repo). wdyt?

jeromekelleher · 2020-07-08T18:35:28Z

I'm impressed you've gotten this far with the new undocumented APIs @eric-czech! You're just missing a "length" parameter to tell it how much sequence to simulate:

model = msprime.Demography.island_model(2, migration_rate=1e-3, Ne=10**4)
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
    samples=samples,
    demography=model,
    length=10**6,
    recombination_rate=1e-8
)
ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
print(ts.num_sites)
for variant in ts.variants():
    print(int(variant.site.position), variant.alleles, variant.genotypes, sep="\t")

This gives ~3000 sites, that look like

641872  ('G', 'A')      [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0]
641885  ('C', 'G')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1]
641901  ('C', 'A')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1]
641965  ('C', 'G')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1]
642507  ('A', 'C', 'G') [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2]
642584  ('G', 'T')      [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
642696  ('C', 'T')      [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
642714  ('G', 'T')      [0 1 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1]

Using the new APIs we can simulate mutations at discrete sites, with lots of different mutation models. We're using Jukes-Cantor here, as it's likely fine for the vast majority of stuff. We can easily bias towards different bases, etc, if this ever becomes something we want. These discrete mutation models mean that we can have multiple mutations per site. This has happened at site 642507 above, which has three alleles, but only two of them are actually inherited by the samples. This is something to watch out for, I think.

Re HWE, I'll have a bit of a think about that.

eric-czech · 2020-07-08T18:59:21Z

Aha! Thanks, works for me.

These discrete mutation models mean that we can have multiple mutations per site

Good to know, will definitely watch out for that.

Re HWE, I'll have a bit of a think about that.

Most of the tests I've been thinking of (not just HWE) would ideally operate by knowing going into the simulation which variants or samples would have certain properties coming out of it.

For example, I'd like to say that variants with indexes 1000 through 2000 end up with genotypes across samples that are in HWE so that I can check that HWE tests for those same variants come back not significant. Is that kind of a priori knowledge about samples/variants at odds with how coalescent simulations work? I haven't had a chance to learn much about them yet so forgive my ignorance.

eric-czech · 2020-07-08T19:03:55Z

+1 @eric-czech when we figure out the right toolset for (test) data generation it could be valuable to expose them as strategies in hypothesis (this could be a separate test module within this repo). wdyt?

Yea, I don't see why not

jeromekelleher · 2020-07-08T19:41:21Z

For example, I'd like to say that variants with indexes 1000 through 2000 end up with genotypes across samples that are in HWE so that I can check that HWE tests for those same variants come back not significant. Is that kind of a priori knowledge about samples/variants at odds with how coalescent simulations work? I haven't had a chance to learn much about them yet so forgive my ignorance.

You can't really say anything about particular sites if they've been thrown randomly on the trees - each site is draw from the underlying mutation process.

For HWE, I guess if you pair up the haploid samples within a population which has been isolated for a long time, then they should come out in HWE. Whereas, if you pair up samples between populations, they should not. Something like this:

import msprime
import numpy as np

model = msprime.Demography.island_model(2, migration_rate=0, Ne=10**4)
# The populations are isolated until 100k generations ago
model.events.append(
    msprime.MassMigration(time=100000, source=1, dest=0))
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
    samples=samples,
    demography=model,
    length=10**6,
    recombination_rate=1e-8
)

ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
G = ts.genotype_matrix()
# Within pop-diploids:
W = [(G[:,2 * j], G[:, 2 * j + 1]) for j in range(10)]
# Between pop-diploids:
B = [(G[:,j], G[:, 10 + j]) for j in range(10)]


for j in range(10):
    print("B", np.sum(B[j][0] != B[j][1]))
    print("W", np.sum(W[j][0] != W[j][1]))

Here I'm just printing out the number of differences between the two haploids in our psuedo diploids. Unsurprisingly, we get way more differences between the between-pop diploids. I think something like this should be helpful?

It's these sort of scenarious we want to create anyway, with old splits between the populations ensuring that there's lots of divergence, but still with a fair bit of shared diversity because of deep coalescence.

For creating sites that are in HWE and not, you could mix and match the bits of the genotype matrix afterwards to create the patterns that you want, I think.

Does this help?

eric-czech · 2020-07-25T16:57:26Z

Thanks @jeromekelleher! Very helpful example.

For HWE, I guess if you pair up the haploid samples within a population which has been isolated for a long time, then they should come out in HWE. Whereas, if you pair up samples between populations, they should not

Makes sense, I think it would certainly be interesting to relate exact HWE test statistic distributions to much more intuitive changes in an underlying simulation model like that.

model.events.append(msprime.MassMigration(time=100000, source=1, dest=0))

I'm assuming this determines when the two populations split right? What happens if you don't specify this for a 2 population island model?

For creating sites that are in HWE and not, you could mix and match the bits of the genotype matrix afterwards to create the patterns that you want, I think.

I think there are several different characteristics of an allele that HWE filters are attempting to eliminate and I would say that example is simulating an extreme form of something like assortative mating (which is one of those characteristics)? I can imagine how I might be able to emulate some others by mixing up the genotype matrix, but the two more common ones I'm aware of are selection and inbreeding. It would be useful to see how HWE tests behave as those things are increased/decreased along some continuous spectrum. Would you happen to know off-hand of good examples of how I could parameterize those things in a model?

This getting away from the simple unit testing case a bit but I can see how this could shape up into some deeper tests. Or it could make for great examples of how something like an HWE statistic works and what it can do (for our docs).

jeromekelleher · 2020-08-14T07:48:15Z

Sorry @eric-czech this one sunk to the bottom for a while.

I'm assuming this determines when the two populations split right? What happens if you don't specify this for a 2 population island model?

That's right, yeah. In an island model the samples will coalesce eventually, because as lineages migrate between the two populations they both have to be in the same population to be able to coalesce. Basically, the branches of the trees will be longer, and the signals of geographic structure will be much less obvious.

... but the two more common ones I'm aware of are selection and inbreeding. It would be useful to see how HWE tests behave as those things are increased/decreased along some continuous spectrum. Would you happen to know off-hand of good examples of how I could parameterize those things in a model?

Neither of these are things we can do directly with a simple coalescent model, but approximations are available. The easiest thing to do is to use a forwards-time simulator like SLiM or fwdpy11. I wouldn't imagine picking out signals with this sort of subtlety is something that we could do in unit tests though - this seems more like something we'd do as part of a validation suite. For unit tests, I think we would be happy if we can get fairly crude expected results when we have extreme structure/very strong signals. For a validation suite, we'd want to have a more nuanced set of tests, where results might be plots of various forms rather than simple yes-no assertions.

Do we have an issue talking about a validation suite (as opposed to performance benchmarks)?

eric-czech · 2020-08-16T19:43:17Z

Do we have an issue talking about a validation suite (as opposed to performance benchmarks)?

Not that I know of. Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo (making them all related to https://github.com/pystatgen/sgkit/pull/9)?

hammer · 2020-08-17T12:25:45Z

Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo

That’s my preferred outcome though I am not a developer so if there are good reasons to keep it in a folder in this repo that’s fine too.

tomwhite · 2020-08-20T14:40:30Z

Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo

I initially suggested putting them in this repo, but I'm not opposed to moving them out. I can see that if there were large datasets in the validation repo it would be good to keep it separate so new developers to the main repo can submit a bugfix PR without having to check out lots of unrelated stuff. But that may not be a very compelling reason.

We have yet to build Sphinx documentation from multiple repos, so it would be good to figure out how to do that as a part of setting up a separate repo.

hammer · 2020-08-20T14:50:22Z

We have yet to build Sphinx documentation from multiple repos, so it would be good to figure out how to do that as a part of setting up a separate repo.

I agree that's critical for the IO repos but for I'm not sure we need to include the validation repo documentation on the main documentation website. It may end up making sense to do so, though.

hammer · 2020-10-15T14:51:26Z

We have simulate_genotype_call_dataset now. Should we close this issue out and file more specific issues for improvements to simulated data as they arise?

jeromekelleher · 2020-10-20T10:41:23Z

I think we can - closing.

hammer added core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. process + tools labels Jul 9, 2020

eric-czech mentioned this issue Sep 29, 2020

PCA implementation #262

Merged

18 tasks

jeromekelleher closed this as completed Oct 20, 2020

eric-czech mentioned this issue Jan 18, 2021

Create a public GWAS workflow for collaborative optimization #438

Open

tomwhite mentioned this issue Nov 14, 2022

Hypothesis VCF strategies #951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulate genotypes for unit tests #23

Simulate genotypes for unit tests #23

eric-czech commented Jul 8, 2020 •

edited

Loading

ravwojdyla commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 8, 2020

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 25, 2020

jeromekelleher commented Aug 14, 2020

eric-czech commented Aug 16, 2020

hammer commented Aug 17, 2020

tomwhite commented Aug 20, 2020

hammer commented Aug 20, 2020

hammer commented Oct 15, 2020

jeromekelleher commented Oct 20, 2020

Simulate genotypes for unit tests #23

Simulate genotypes for unit tests #23

Comments

eric-czech commented Jul 8, 2020 • edited Loading

ravwojdyla commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 8, 2020

eric-czech commented Jul 8, 2020

jeromekelleher commented Jul 8, 2020

eric-czech commented Jul 25, 2020

jeromekelleher commented Aug 14, 2020

eric-czech commented Aug 16, 2020

hammer commented Aug 17, 2020

tomwhite commented Aug 20, 2020

hammer commented Aug 20, 2020

hammer commented Oct 15, 2020

jeromekelleher commented Oct 20, 2020

eric-czech commented Jul 8, 2020 •

edited

Loading