Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulate genotypes for unit tests #23

Closed
eric-czech opened this issue Jul 8, 2020 · 13 comments
Closed

Simulate genotypes for unit tests #23

eric-czech opened this issue Jul 8, 2020 · 13 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. process + tools

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Jul 8, 2020

I'm trying to simulate some genotype calls for unit tests using a simple model from msprime. At @jeromekelleher's suggestion, I tried this:

# Using 1.x code, not yet released
!pip install git+https://github.com/tskit-dev/msprime.git#egg=msprime

import msprime
model = msprime.Demography.island_model(2, migration_rate=1e-3, Ne=10**4)
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
     samples=samples,
     demography=model, # what should this be?
     mutation_rate=1e-8,
     recombination_rate=1e-8
)
list(ts.variants())
[]

What is demography supposed to be? I saw a "FIXME" for documenting it in simulate but I wasn't sure what to set it to if not the Demography instance that comes back from island_model.

Also, is this why I'm not getting any variants back out?

I am ultimately trying to simulate hardy weinberg equilibrium. I'd like to have some fraction of the variants be in or near perfect equilibrium and the remainder way out of it. If you get a chance, could I get an assist @jeromekelleher?

@ravwojdyla
Copy link
Collaborator

+1 @eric-czech when we figure out the right toolset for (test) data generation it could be valuable to expose them as strategies in hypothesis (this could be a separate test module within this repo). wdyt?

@jeromekelleher
Copy link
Collaborator

I'm impressed you've gotten this far with the new undocumented APIs @eric-czech! You're just missing a "length" parameter to tell it how much sequence to simulate:

model = msprime.Demography.island_model(2, migration_rate=1e-3, Ne=10**4)
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
    samples=samples,
    demography=model,
    length=10**6,
    recombination_rate=1e-8
)
ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
print(ts.num_sites)
for variant in ts.variants():
    print(int(variant.site.position), variant.alleles, variant.genotypes, sep="\t")

This gives ~3000 sites, that look like

641872  ('G', 'A')      [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0]
641885  ('C', 'G')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1]
641901  ('C', 'A')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1]
641965  ('C', 'G')      [0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1]
642507  ('A', 'C', 'G') [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2]
642584  ('G', 'T')      [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
642696  ('C', 'T')      [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
642714  ('G', 'T')      [0 1 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1]

Using the new APIs we can simulate mutations at discrete sites, with lots of different mutation models. We're using Jukes-Cantor here, as it's likely fine for the vast majority of stuff. We can easily bias towards different bases, etc, if this ever becomes something we want. These discrete mutation models mean that we can have multiple mutations per site. This has happened at site 642507 above, which has three alleles, but only two of them are actually inherited by the samples. This is something to watch out for, I think.

Re HWE, I'll have a bit of a think about that.

@eric-czech
Copy link
Collaborator Author

Aha! Thanks, works for me.

These discrete mutation models mean that we can have multiple mutations per site

Good to know, will definitely watch out for that.

Re HWE, I'll have a bit of a think about that.

Most of the tests I've been thinking of (not just HWE) would ideally operate by knowing going into the simulation which variants or samples would have certain properties coming out of it.

For example, I'd like to say that variants with indexes 1000 through 2000 end up with genotypes across samples that are in HWE so that I can check that HWE tests for those same variants come back not significant. Is that kind of a priori knowledge about samples/variants at odds with how coalescent simulations work? I haven't had a chance to learn much about them yet so forgive my ignorance.

@eric-czech
Copy link
Collaborator Author

+1 @eric-czech when we figure out the right toolset for (test) data generation it could be valuable to expose them as strategies in hypothesis (this could be a separate test module within this repo). wdyt?

Yea, I don't see why not

@jeromekelleher
Copy link
Collaborator

For example, I'd like to say that variants with indexes 1000 through 2000 end up with genotypes across samples that are in HWE so that I can check that HWE tests for those same variants come back not significant. Is that kind of a priori knowledge about samples/variants at odds with how coalescent simulations work? I haven't had a chance to learn much about them yet so forgive my ignorance.

You can't really say anything about particular sites if they've been thrown randomly on the trees - each site is draw from the underlying mutation process.

For HWE, I guess if you pair up the haploid samples within a population which has been isolated for a long time, then they should come out in HWE. Whereas, if you pair up samples between populations, they should not. Something like this:

import msprime
import numpy as np

model = msprime.Demography.island_model(2, migration_rate=0, Ne=10**4)
# The populations are isolated until 100k generations ago
model.events.append(
    msprime.MassMigration(time=100000, source=1, dest=0))
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
    samples=samples,
    demography=model,
    length=10**6,
    recombination_rate=1e-8
)

ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
G = ts.genotype_matrix()
# Within pop-diploids:
W = [(G[:,2 * j], G[:, 2 * j + 1]) for j in range(10)]
# Between pop-diploids:
B = [(G[:,j], G[:, 10 + j]) for j in range(10)]


for j in range(10):
    print("B", np.sum(B[j][0] != B[j][1]))
    print("W", np.sum(W[j][0] != W[j][1]))

Here I'm just printing out the number of differences between the two haploids in our psuedo diploids. Unsurprisingly, we get way more differences between the between-pop diploids. I think something like this should be helpful?

It's these sort of scenarious we want to create anyway, with old splits between the populations ensuring that there's lots of divergence, but still with a fair bit of shared diversity because of deep coalescence.

For creating sites that are in HWE and not, you could mix and match the bits of the genotype matrix afterwards to create the patterns that you want, I think.

Does this help?

@hammer hammer added core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. process + tools labels Jul 9, 2020
@eric-czech
Copy link
Collaborator Author

Thanks @jeromekelleher! Very helpful example.

For HWE, I guess if you pair up the haploid samples within a population which has been isolated for a long time, then they should come out in HWE. Whereas, if you pair up samples between populations, they should not

Makes sense, I think it would certainly be interesting to relate exact HWE test statistic distributions to much more intuitive changes in an underlying simulation model like that.

model.events.append(msprime.MassMigration(time=100000, source=1, dest=0))

I'm assuming this determines when the two populations split right? What happens if you don't specify this for a 2 population island model?

For creating sites that are in HWE and not, you could mix and match the bits of the genotype matrix afterwards to create the patterns that you want, I think.

I think there are several different characteristics of an allele that HWE filters are attempting to eliminate and I would say that example is simulating an extreme form of something like assortative mating (which is one of those characteristics)? I can imagine how I might be able to emulate some others by mixing up the genotype matrix, but the two more common ones I'm aware of are selection and inbreeding. It would be useful to see how HWE tests behave as those things are increased/decreased along some continuous spectrum. Would you happen to know off-hand of good examples of how I could parameterize those things in a model?

This getting away from the simple unit testing case a bit but I can see how this could shape up into some deeper tests. Or it could make for great examples of how something like an HWE statistic works and what it can do (for our docs).

@jeromekelleher
Copy link
Collaborator

Sorry @eric-czech this one sunk to the bottom for a while.

I'm assuming this determines when the two populations split right? What happens if you don't specify this for a 2 population island model?

That's right, yeah. In an island model the samples will coalesce eventually, because as lineages migrate between the two populations they both have to be in the same population to be able to coalesce. Basically, the branches of the trees will be longer, and the signals of geographic structure will be much less obvious.

... but the two more common ones I'm aware of are selection and inbreeding. It would be useful to see how HWE tests behave as those things are increased/decreased along some continuous spectrum. Would you happen to know off-hand of good examples of how I could parameterize those things in a model?

Neither of these are things we can do directly with a simple coalescent model, but approximations are available. The easiest thing to do is to use a forwards-time simulator like SLiM or fwdpy11. I wouldn't imagine picking out signals with this sort of subtlety is something that we could do in unit tests though - this seems more like something we'd do as part of a validation suite. For unit tests, I think we would be happy if we can get fairly crude expected results when we have extreme structure/very strong signals. For a validation suite, we'd want to have a more nuanced set of tests, where results might be plots of various forms rather than simple yes-no assertions.

Do we have an issue talking about a validation suite (as opposed to performance benchmarks)?

@eric-czech
Copy link
Collaborator Author

Do we have an issue talking about a validation suite (as opposed to performance benchmarks)?

Not that I know of. Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo (making them all related to https://github.com/pystatgen/sgkit/pull/9)?

@hammer
Copy link
Contributor

hammer commented Aug 17, 2020

Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo

That’s my preferred outcome though I am not a developer so if there are good reasons to keep it in a folder in this repo that’s fine too.

@tomwhite
Copy link
Collaborator

Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo

I initially suggested putting them in this repo, but I'm not opposed to moving them out. I can see that if there were large datasets in the validation repo it would be good to keep it separate so new developers to the main repo can submit a bugfix PR without having to check out lots of unrelated stuff. But that may not be a very compelling reason.

We have yet to build Sphinx documentation from multiple repos, so it would be good to figure out how to do that as a part of setting up a separate repo.

@hammer
Copy link
Contributor

hammer commented Aug 20, 2020

We have yet to build Sphinx documentation from multiple repos, so it would be good to figure out how to do that as a part of setting up a separate repo.

I agree that's critical for the IO repos but for I'm not sure we need to include the validation repo documentation on the main documentation website. It may end up making sense to do so, though.

@eric-czech eric-czech mentioned this issue Sep 29, 2020
18 tasks
@hammer
Copy link
Contributor

hammer commented Oct 15, 2020

We have simulate_genotype_call_dataset now. Should we close this issue out and file more specific issues for improvements to simulated data as they arise?

@jeromekelleher
Copy link
Collaborator

I think we can - closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. process + tools
Projects
None yet
Development

No branches or pull requests

5 participants