-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic cohort_statistic function #776
Conversation
Codecov Report
@@ Coverage Diff @@
## main #776 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 36
Lines 3024 3030 +6
=========================================
+ Hits 3024 3030 +6
Continue to review full report at Codecov.
|
LGTM. Maybe put |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also - nice that we can get rid of some guvectorised functions also. (I assume there's no major loss of performance here?)
01d29f7
to
8461b22
Compare
Yes ... that would be the obvious place for it 😄
Actually, it's worse than I initially estimated. It's quite variable depending on chunking and the number of cohorts. With chunking that favors I'm happy to take the alternate approach of implementing cohort-aware reductions in numba if we wish to maximize performance? I imagine that |
I think it's OK to have a perf reduction here @timothymillar, we should just make a note in the source that the current version could be improved with numba-fication. I don't have a strong opinion either way, I just wanted to raise the issue and keep a note of the perf implications of the change. |
Agreed @jeromekelleher we don't need to optimize this unless it becomes an issue. Though I think we need to keep an eye out for functions with similar reductions to avoid writing equivalent optimized methods in multiple places. |
Looks like CI on Linux is failing for an unrelated reason ( |
drop_axis=1, | ||
new_axis=1, | ||
dtype=np.float64, | ||
# NOTE: Performance of cohort_statistic is substantially slower than a numba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting about this in the meeting today @timothymillar, we thought it'd be good if we could put a git hash in here to allow future travellers to easily find this fast implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done
9b1324b
to
078a31d
Compare
Thanks @jeromekelleher. For reference, I also played around with a decorator to simplify wrapping gufunctions for cohort reductions here. |
Thanks @timothymillar.
We discussed it a bit yesterday (#781), and I think the general conclusion was to go with the simpler implementation until there's a need for better performance. Regarding |
Thanks for the clarification @tomwhite. Well this has been a lesson in benchmarking carefully with dask, especially with implementations that produce different numbers of chunks! The plot below shows performance in relation to number of cohorts (the main factor). Cohorts were randomly assigned to each sample without chunking in the sample dimension. This is the worst case scenario for the new "numpy" implementation but probably quite realistic. Note that the new method actually performs worse with multi-threading (chunks = I think we should keep the existing version for now and I'll revisit #730 another time. Sorry for taking up your time with this! |
Thanks for the investigation on this one @timothymillar! |
Fixes #730
This is inspired by the
window_statistic
function and assumes that it is reasonable to iterate over the number of cohorts in plain python.Existing methods like
count_cohort_alleles
could also be simplified usingcohort_statistic
though there would probably be some reduction in performance.