Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic cohort_statistic function #776

Closed
wants to merge 3 commits into from

Conversation

timothymillar
Copy link
Collaborator

Fixes #730
This is inspired by the window_statistic function and assumes that it is reasonable to iterate over the number of cohorts in plain python.
Existing methods like count_cohort_alleles could also be simplified using cohort_statistic though there would probably be some reduction in performance.

@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2021

Codecov Report

Merging #776 (01d29f7) into main (31dc606) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #776   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines         3024      3030    +6     
=========================================
+ Hits          3024      3030    +6     
Impacted Files Coverage Δ
sgkit/stats/popgen.py 100.00% <100.00%> (ø)
sgkit/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31dc606...01d29f7. Read the comment docs.

@tomwhite
Copy link
Collaborator

tomwhite commented Dec 7, 2021

LGTM. Maybe put cohort_statistic in cohorts.py?

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also - nice that we can get rid of some guvectorised functions also. (I assume there's no major loss of performance here?)

@timothymillar
Copy link
Collaborator Author

Maybe put cohort_statistic in cohorts.py?

Yes ... that would be the obvious place for it 😄

no major loss of performance here?

Actually, it's worse than I initially estimated. It's quite variable depending on chunking and the number of cohorts. With chunking that favors cohort_statistic (i.e. samples pre-chunked by cohort) the performance hit can be as low as 10% (in part due to the chunking slowing the numba version). Without chunking in the samples dimension, I'm usually seeing a 300-400% slowdown. The slowdown is primarily due to slicing the variables array into an array per cohort which is unavoidable if we want to use generic reduction functions.

I'm happy to take the alternate approach of implementing cohort-aware reductions in numba if we wish to maximize performance? I imagine that cohort_sum, cohort_nansum, cohort_mean and cohort_nanmean would cover most use-cases. cohort_statistic could be left to cover edge-cases.

@jeromekelleher
Copy link
Collaborator

I think it's OK to have a perf reduction here @timothymillar, we should just make a note in the source that the current version could be improved with numba-fication. I don't have a strong opinion either way, I just wanted to raise the issue and keep a note of the perf implications of the change.

@timothymillar
Copy link
Collaborator Author

Agreed @jeromekelleher we don't need to optimize this unless it becomes an issue. Though I think we need to keep an eye out for functions with similar reductions to avoid writing equivalent optimized methods in multiple places.

@timothymillar
Copy link
Collaborator Author

Looks like CI on Linux is failing for an unrelated reason (test_vcf_to_zarr__empty_region), same issue in #777

drop_axis=1,
new_axis=1,
dtype=np.float64,
# NOTE: Performance of cohort_statistic is substantially slower than a numba
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatting about this in the meeting today @timothymillar, we thought it'd be good if we could put a git hash in here to allow future travellers to easily find this fast implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done

@timothymillar
Copy link
Collaborator Author

Thanks @jeromekelleher.
Just for clarity, this PR was intended to add a simple method to applying summary statistics to cohorts. I then updated observed_heterozygosity because it seemed redundant for it to have it's own 'cohort_nanmean' function, but I don't have strong feelings about it if others might be concerned about performance? I guess I'm not quite sure where the sgkit team sit with regard to performant vs simple code for small things like this.

For reference, I also played around with a decorator to simplify wrapping gufunctions for cohort reductions here.

@tomwhite
Copy link
Collaborator

Thanks @timothymillar.

I guess I'm not quite sure where the sgkit team sit with regard to performant vs simple code for small things like this.

We discussed it a bit yesterday (#781), and I think the general conclusion was to go with the simpler implementation until there's a need for better performance. Regarding observed_heterozygosity in particular, since you are probably the main user of this function we'll be led by whichever implementation you'd prefer.

@timothymillar
Copy link
Collaborator Author

Thanks for the clarification @tomwhite.

Well this has been a lesson in benchmarking carefully with dask, especially with implementations that produce different numbers of chunks! The plot below shows performance in relation to number of cohorts (the main factor). Cohorts were randomly assigned to each sample without chunking in the sample dimension. This is the worst case scenario for the new "numpy" implementation but probably quite realistic. Note that the new method actually performs worse with multi-threading (chunks = (10000,) * 16, (100,)) on 8 cores).
bench2
I was a bit naive by benchmarking with only a few cohorts previously. I don't imagine many user would have more than a few cohorts, but it's probably best to not make that assumption.

I think we should keep the existing version for now and I'll revisit #730 another time. Sorry for taking up your time with this!

@tomwhite
Copy link
Collaborator

Thanks for the investigation on this one @timothymillar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generic functions for cohort means and sums
4 participants