-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(clump): refactored window based clumping #492
Conversation
… into do_clump_perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything that would block the merge, however I have a few minor comments, questions.
@@ -27,7 +27,6 @@ def __init__( | |||
clumped_study_locus_path: str, | |||
study_index_path: Optional[str] = None, | |||
ld_index_path: Optional[str] = None, | |||
locus_collect_distance: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these changes are accepted, this step cannot be configured in a way that would allow for locus collection, instead a separate step would be needed. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't think this step is used at all... I have split the logic into two different steps to resolve ambiguity. We should remove this file altoghether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right. The step has been removed.
All the modifications have been implemented as discussed with @DSuveges during the review. I think it's very modular and it should be a good template for what's about to come. @Daniel-Considine you might want to look at the new function |
After different failing attempts to implement the window-based clumping prioritising windows over joins. I had a longer discussion with @DSuveges on rescueing the current implementation based on joins. This seems to be the best solution and it's very conservative with the current implementation. My personal conclusion is that the task of clumping is not well suited for windows, because we don't really care about the vast majority of the summary statistics and windows are good for computing on all rows not a sub-selection of them.
At the end, the only problem with the current implementation was that the broadcast does not work in a "right" join. When the workload increased, the join had to deal with lots of shuffling and eventually failed due to the lack of resources. This PR minimally modifies this part of the window-based clumping to resolve that join.
Using the logic in this PR, clumps on all NFE GWAS Catalog summary statistics were calculated. The job took 18 min to compute and resulted on clumps with collected locus
gs://ot-team/dochoa/sl_11_3_24.parquet
(~25Gb).spark.sql.shuffle
was set to 3200 for this run.On top of the business logic, I also modified slightly the interface with the function and the arguments required (default). I think it makes a bit easier to understand the function. These changes are a lot more subjective, so happy to revert if we don't find an agreement.