Generic cohort_statistic function #776

timothymillar · 2021-12-07T09:10:36Z

Fixes #730
This is inspired by the window_statistic function and assumes that it is reasonable to iterate over the number of cohorts in plain python.
Existing methods like count_cohort_alleles could also be simplified using cohort_statistic though there would probably be some reduction in performance.

codecov-commenter · 2021-12-07T09:18:20Z

Codecov Report

Merging #776 (01d29f7) into main (31dc606) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #776   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines         3024      3030    +6     
=========================================
+ Hits          3024      3030    +6

Impacted Files	Coverage Δ
sgkit/stats/popgen.py	`100.00% <100.00%> (ø)`
sgkit/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31dc606...01d29f7. Read the comment docs.

tomwhite · 2021-12-07T10:24:00Z

LGTM. Maybe put cohort_statistic in cohorts.py?

jeromekelleher

LGTM also - nice that we can get rid of some guvectorised functions also. (I assume there's no major loss of performance here?)

timothymillar · 2021-12-08T09:03:55Z

Maybe put cohort_statistic in cohorts.py?

Yes ... that would be the obvious place for it 😄

no major loss of performance here?

Actually, it's worse than I initially estimated. It's quite variable depending on chunking and the number of cohorts. With chunking that favors cohort_statistic (i.e. samples pre-chunked by cohort) the performance hit can be as low as 10% (in part due to the chunking slowing the numba version). Without chunking in the samples dimension, I'm usually seeing a 300-400% slowdown. The slowdown is primarily due to slicing the variables array into an array per cohort which is unavoidable if we want to use generic reduction functions.

I'm happy to take the alternate approach of implementing cohort-aware reductions in numba if we wish to maximize performance? I imagine that cohort_sum, cohort_nansum, cohort_mean and cohort_nanmean would cover most use-cases. cohort_statistic could be left to cover edge-cases.

jeromekelleher · 2021-12-08T10:28:56Z

I think it's OK to have a perf reduction here @timothymillar, we should just make a note in the source that the current version could be improved with numba-fication. I don't have a strong opinion either way, I just wanted to raise the issue and keep a note of the perf implications of the change.

timothymillar · 2021-12-09T05:16:06Z

Agreed @jeromekelleher we don't need to optimize this unless it becomes an issue. Though I think we need to keep an eye out for functions with similar reductions to avoid writing equivalent optimized methods in multiple places.

timothymillar · 2021-12-09T05:39:50Z

Looks like CI on Linux is failing for an unrelated reason (test_vcf_to_zarr__empty_region), same issue in #777

jeromekelleher · 2021-12-09T16:26:45Z

sgkit/stats/popgen.py

-        drop_axis=1,
-        new_axis=1,
-        dtype=np.float64,
+    # NOTE: Performance of cohort_statistic is substantially slower than a numba


Chatting about this in the meeting today @timothymillar, we thought it'd be good if we could put a git hash in here to allow future travellers to easily find this fast implementation.

Good idea, done

timothymillar · 2021-12-10T04:31:45Z

Thanks @jeromekelleher.
Just for clarity, this PR was intended to add a simple method to applying summary statistics to cohorts. I then updated observed_heterozygosity because it seemed redundant for it to have it's own 'cohort_nanmean' function, but I don't have strong feelings about it if others might be concerned about performance? I guess I'm not quite sure where the sgkit team sit with regard to performant vs simple code for small things like this.

For reference, I also played around with a decorator to simplify wrapping gufunctions for cohort reductions here.

tomwhite · 2021-12-10T09:39:16Z

Thanks @timothymillar.

I guess I'm not quite sure where the sgkit team sit with regard to performant vs simple code for small things like this.

We discussed it a bit yesterday (#781), and I think the general conclusion was to go with the simpler implementation until there's a need for better performance. Regarding observed_heterozygosity in particular, since you are probably the main user of this function we'll be led by whichever implementation you'd prefer.

timothymillar · 2021-12-11T05:33:20Z

Thanks for the clarification @tomwhite.

Well this has been a lesson in benchmarking carefully with dask, especially with implementations that produce different numbers of chunks! The plot below shows performance in relation to number of cohorts (the main factor). Cohorts were randomly assigned to each sample without chunking in the sample dimension. This is the worst case scenario for the new "numpy" implementation but probably quite realistic. Note that the new method actually performs worse with multi-threading (chunks = (10000,) * 16, (100,)) on 8 cores).

I was a bit naive by benchmarking with only a few cohorts previously. I don't imagine many user would have more than a few cohorts, but it's probably best to not make that assumption.

I think we should keep the existing version for now and I'll revisit #730 another time. Sorry for taking up your time with this!

tomwhite · 2021-12-13T11:01:32Z

Thanks for the investigation on this one @timothymillar!

jeromekelleher approved these changes Dec 7, 2021

View reviewed changes

timothymillar added 2 commits December 8, 2021 21:13

Add cohort_statistic function pystatgen#730

778d71b

Use cohort_statistic in observed_heterozygosity

8461b22

timothymillar force-pushed the f/730-cohort-mean branch from 01d29f7 to 8461b22 Compare December 8, 2021 08:20

jeromekelleher reviewed Dec 9, 2021

View reviewed changes

Note sub-optimal performance of cohort_statistic

078a31d

timothymillar force-pushed the f/730-cohort-mean branch from 9b1324b to 078a31d Compare December 10, 2021 04:24

timothymillar closed this Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic cohort_statistic function #776

Generic cohort_statistic function #776

timothymillar commented Dec 7, 2021

codecov-commenter commented Dec 7, 2021 •

edited

Loading

tomwhite commented Dec 7, 2021

jeromekelleher left a comment

timothymillar commented Dec 8, 2021

jeromekelleher commented Dec 8, 2021

timothymillar commented Dec 9, 2021

timothymillar commented Dec 9, 2021

jeromekelleher Dec 9, 2021

timothymillar Dec 10, 2021

timothymillar commented Dec 10, 2021

tomwhite commented Dec 10, 2021

timothymillar commented Dec 11, 2021

tomwhite commented Dec 13, 2021

Generic cohort_statistic function #776

Generic cohort_statistic function #776

Conversation

timothymillar commented Dec 7, 2021

codecov-commenter commented Dec 7, 2021 • edited Loading

Codecov Report

tomwhite commented Dec 7, 2021

jeromekelleher left a comment

Choose a reason for hiding this comment

timothymillar commented Dec 8, 2021

jeromekelleher commented Dec 8, 2021

timothymillar commented Dec 9, 2021

timothymillar commented Dec 9, 2021

jeromekelleher Dec 9, 2021

Choose a reason for hiding this comment

timothymillar Dec 10, 2021

Choose a reason for hiding this comment

timothymillar commented Dec 10, 2021

tomwhite commented Dec 10, 2021

timothymillar commented Dec 11, 2021

tomwhite commented Dec 13, 2021

codecov-commenter commented Dec 7, 2021 •

edited

Loading