Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWE Test Implementation #76

Merged
merged 13 commits into from
Aug 17, 2020
Merged

HWE Test Implementation #76

merged 13 commits into from
Aug 17, 2020

Conversation

eric-czech
Copy link
Collaborator

@eric-czech eric-czech commented Jul 29, 2020

This PR contains a new implementation for https://github.com/pystatgen/sgkit/issues/28.

Summary of changes:

  • Introduces numba as a dependency
  • Adds a sgkit/stats/hwe.py module
  • Adds test data exported from a reference implementation
    • I will add the code that generates the export in a separate PR
  • Wraps jit-compiled functions as separate variables and tests them where appropriate, which makes it possible to handle these scenarios:
    • A part of the unit tests is ensuring that results don't overflow with large numbers, but the speed of the test is dependent on the size of those numbers so the no-jit function should never be run with these same inputs
    • Codecov can't follow execution of the compiled functions and simply thinks they're not covered, so a limited set of tests can be run with the no-jit function to hit coverage thresholds

There may be better ways to organize jit-compiled function tests, see https://github.com/pystatgen/sgkit/issues/77.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, +1. The testing is especially thorough. Just a couple of minor changes then I think this can be merged.

(The numba coverage questions can be addressed in #77.)

Additionally, both covariate and trait arrays will be rechunked to have blocks
along the sample (row) dimension but not the column dimension (i.e.
they must be tall and skinny).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this warning removed for a reason?

Copy link
Collaborator Author

@eric-czech eric-czech Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was duplicated.

hardy_weinberg_p_value_vec_jit = njit(hardy_weinberg_p_value_vec, fastmath=True)


def hardy_weinberg_test(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be added to the top-level __init__.py so it's a part of the public API.

Also, +1 on the name (even though it's not a verb!). I think using the abbreviation hwe internally is fine too (as you have done).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n = len(obs_hets)
p = np.empty(n, dtype=np.float64)
for i in range(n):
p[i] = hardy_weinberg_p_value_jit(obs_hets[i], obs_hom1[i], obs_hom2[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: we're using Dask for parallelization here (across blocks) which is fine, but we may want to consider numba's parallel=True in the future for this loop as another dispatch option.

# Otherwise compute genotype counts from calls
else:
# TODO: Use API genotype counting function instead, e.g.
# https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to track this please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def test_hwep__raise_on_negative():
args = [[-1, 0, 0], [0, -1, 0], [0, 0, -1]]
for arg in args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might be slighter neater to parameterise the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# TODO: Use API genotype counting function instead, e.g.
# https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069
mask = ds["call_genotype_mask"].any(dim="ploidy")
gtc = xr.where(mask, -1, ds["call_genotype"].sum(dim="ploidy")) # type: ignore[no-untyped-call]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit/question: we are being a bit inconsistent on case for these variables. E.g. in count_alleles the variables are G, CTS, etc. Also, in the summary stats PR I used G for genotypes, but n_het for heterogeneous counts.

Are there some rules we can use? E.g. use lowercase except when we are replicating a paper which uses X, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pystatgen/sgkit@466859e#diff-2ec013d1f086d9558934c38ac90411d1R173

I like the capital convention so array names don't conflict with scalars when there are a lot of both. When there aren't a lot of either, switching to lower case name sounds good and is what I would prefer too. I don't know how to make a threshold for that clear, so I've been trying to err on the side of sticking with the capital letter convention for consistency (thanks for flagging this one).

I filed https://github.com/pystatgen/sgkit/issues/117 to make a record of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants