HWE Test Implementation #76

eric-czech · 2020-07-29T13:19:35Z

This PR contains a new implementation for https://github.com/pystatgen/sgkit/issues/28.

Summary of changes:

Introduces numba as a dependency
Adds a sgkit/stats/hwe.py module
Adds test data exported from a reference implementation
- I will add the code that generates the export in a separate PR
Wraps jit-compiled functions as separate variables and tests them where appropriate, which makes it possible to handle these scenarios:
- A part of the unit tests is ensuring that results don't overflow with large numbers, but the speed of the test is dependent on the size of those numbers so the no-jit function should never be run with these same inputs
- Codecov can't follow execution of the compiled functions and simply thinks they're not covered, so a limited set of tests can be run with the no-jit function to hit coverage thresholds

There may be better ways to organize jit-compiled function tests, see https://github.com/pystatgen/sgkit/issues/77.

tomwhite

Overall looks great, +1. The testing is especially thorough. Just a couple of minor changes then I think this can be merged.

(The numba coverage questions can be addressed in #77.)

tomwhite · 2020-08-11T08:09:12Z

sgkit/stats/association.py

-    Additionally, both covariate and trait arrays will be rechunked to have blocks
-    along the sample (row) dimension but not the column dimension (i.e.
-    they must be tall and skinny).
-


Was this warning removed for a reason?

It was duplicated.

tomwhite · 2020-08-11T08:10:47Z

sgkit/stats/hwe.py

+hardy_weinberg_p_value_vec_jit = njit(hardy_weinberg_p_value_vec, fastmath=True)
+
+
+def hardy_weinberg_test(


This function should be added to the top-level __init__.py so it's a part of the public API.

Also, +1 on the name (even though it's not a verb!). I think using the abbreviation hwe internally is fine too (as you have done).

pystatgen/sgkit@466859e#diff-b7dd01040721465bb47249dc10bb65f2R28

tomwhite · 2020-08-11T08:19:21Z

sgkit/stats/hwe.py

+    n = len(obs_hets)
+    p = np.empty(n, dtype=np.float64)
+    for i in range(n):
+        p[i] = hardy_weinberg_p_value_jit(obs_hets[i], obs_hom1[i], obs_hom2[i])


Observation: we're using Dask for parallelization here (across blocks) which is fine, but we may want to consider numba's parallel=True in the future for this loop as another dispatch option.

tomwhite · 2020-08-11T08:20:08Z

sgkit/stats/hwe.py

+    # Otherwise compute genotype counts from calls
+    else:
+        # TODO: Use API genotype counting function instead, e.g.
+        # https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069


Can you open an issue to track this please.

https://github.com/pystatgen/sgkit/issues/115

tomwhite · 2020-08-11T08:23:38Z

sgkit/tests/test_hwe.py

+
+def test_hwep__raise_on_negative():
+    args = [[-1, 0, 0], [0, -1, 0], [0, 0, -1]]
+    for arg in args:


Nit: might be slighter neater to parameterise the test?

pystatgen/sgkit@466859e#diff-8778a759770b6a59711bea6328271b0bR66

tomwhite · 2020-08-11T08:34:40Z

sgkit/stats/hwe.py

+        # TODO: Use API genotype counting function instead, e.g.
+        # https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069
+        mask = ds["call_genotype_mask"].any(dim="ploidy")
+        gtc = xr.where(mask, -1, ds["call_genotype"].sum(dim="ploidy"))  # type: ignore[no-untyped-call]


Style nit/question: we are being a bit inconsistent on case for these variables. E.g. in count_alleles the variables are G, CTS, etc. Also, in the summary stats PR I used G for genotypes, but n_het for heterogeneous counts.

Are there some rules we can use? E.g. use lowercase except when we are replicating a paper which uses X, etc?

pystatgen/sgkit@466859e#diff-2ec013d1f086d9558934c38ac90411d1R173

I like the capital convention so array names don't conflict with scalars when there are a lot of both. When there aren't a lot of either, switching to lower case name sounds good and is what I would prefer too. I don't know how to make a threshold for that clear, so I've been trying to err on the side of sticking with the capital letter convention for consistency (thanks for flagging this one).

I filed https://github.com/pystatgen/sgkit/issues/117 to make a record of this.

eric-czech added 9 commits July 7, 2020 18:40

HWE exact test implementation for scalar genotype counts

a8857d9

Formatting

d04b586

Formatting

59290e7

Adding more tests

9d9de8a

Adding tests for full coverage

fcdf6b5

Fixing conflicts

5ff3f2c

Refactoring tests to match new conventions

d2ec9a5

Cleaning up docs

eaec6f3

Fixing conflicts

1cdb802

eric-czech mentioned this pull request Jul 29, 2020

Decide how to test numba functions #77

Closed

eric-czech added 3 commits August 7, 2020 12:09

Fixing conflicts

b61906f

Fix typo in test name

1ef1072

Update variable names for new convention

835e745

tomwhite approved these changes Aug 11, 2020

View reviewed changes

timothymillar mentioned this pull request Aug 16, 2020

[WIP] count_allele_calls #114

Merged

Small changes

466859e

This was referenced Aug 17, 2020

Have HWE function use genotype counting function #115

Closed

Add array/scalar variable naming convention notes to dev docs #117

Closed

eric-czech merged commit 305ce19 into sgkit-dev:master Aug 17, 2020

eric-czech deleted the hwe branch August 17, 2020 09:18

This was referenced Aug 17, 2020

Add HWE validation code #118

Closed

Add numba dep to other locations #120

Closed

Add HWE test implementation #28

Closed

Document data generation for unit test sgkit-dev/sgkit-plink#29

Open

daletovar mentioned this pull request Aug 26, 2020

Add variant/sample summary statistic methods #29

Closed

7 tasks

This was referenced Aug 27, 2020

PR 114: [WIP] count_allele_calls #145

Closed

PR 114: [WIP] count_allele_calls #155

Closed

PR 114: [WIP] count_allele_calls #165

Closed

PR 114: [WIP] count_allele_calls #175

Closed

PR 114: [WIP] count_allele_calls #185

Closed

This was referenced Aug 27, 2020

PR 114: [WIP] count_allele_calls #195

Closed

PR 114: [WIP] count_allele_calls #205

Closed

eric-czech mentioned this pull request Apr 20, 2021

PyData prototype genetics method implementations related-sciences/gwas-analysis#30

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HWE Test Implementation #76

HWE Test Implementation #76

eric-czech commented Jul 29, 2020 •

edited

Loading

tomwhite left a comment

tomwhite Aug 11, 2020

eric-czech Aug 17, 2020 •

edited

Loading

tomwhite Aug 11, 2020

eric-czech Aug 17, 2020

tomwhite Aug 11, 2020

tomwhite Aug 11, 2020

eric-czech Aug 17, 2020

tomwhite Aug 11, 2020

eric-czech Aug 17, 2020

tomwhite Aug 11, 2020

eric-czech Aug 17, 2020

		hardy_weinberg_p_value_vec_jit = njit(hardy_weinberg_p_value_vec, fastmath=True)


		def hardy_weinberg_test(

HWE Test Implementation #76

HWE Test Implementation #76

Conversation

eric-czech commented Jul 29, 2020 • edited Loading

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Jul 29, 2020 •

edited

Loading

eric-czech Aug 17, 2020 •

edited

Loading