Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(clump): refactored window based clumping #492

Merged
merged 24 commits into from
Mar 20, 2024
Merged

perf(clump): refactored window based clumping #492

merged 24 commits into from
Mar 20, 2024

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Feb 18, 2024

After different failing attempts to implement the window-based clumping prioritising windows over joins. I had a longer discussion with @DSuveges on rescueing the current implementation based on joins. This seems to be the best solution and it's very conservative with the current implementation. My personal conclusion is that the task of clumping is not well suited for windows, because we don't really care about the vast majority of the summary statistics and windows are good for computing on all rows not a sub-selection of them.

At the end, the only problem with the current implementation was that the broadcast does not work in a "right" join. When the workload increased, the join had to deal with lots of shuffling and eventually failed due to the lack of resources. This PR minimally modifies this part of the window-based clumping to resolve that join.

Using the logic in this PR, clumps on all NFE GWAS Catalog summary statistics were calculated. The job took 18 min to compute and resulted on clumps with collected locus gs://ot-team/dochoa/sl_11_3_24.parquet (~25Gb). spark.sql.shuffle was set to 3200 for this run.

On top of the business logic, I also modified slightly the interface with the function and the arguments required (default). I think it makes a bit easier to understand the function. These changes are a lot more subjective, so happy to revert if we don't find an agreement.

@github-actions github-actions bot added documentation Improvements or additions to documentation size-L labels Mar 4, 2024
@d0choa d0choa self-assigned this Mar 11, 2024
@d0choa d0choa removed the Refactor label Mar 13, 2024
@d0choa d0choa marked this pull request as ready for review March 13, 2024 10:54
@d0choa d0choa requested a review from DSuveges March 13, 2024 10:54
Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that would block the merge, however I have a few minor comments, questions.

@@ -27,7 +27,6 @@ def __init__(
clumped_study_locus_path: str,
study_index_path: Optional[str] = None,
ld_index_path: Optional[str] = None,
locus_collect_distance: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these changes are accepted, this step cannot be configured in a way that would allow for locus collection, instead a separate step would be needed. Is this intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think this step is used at all... I have split the logic into two different steps to resolve ambiguity. We should remove this file altoghether.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. The step has been removed.

src/gentropy/dataset/summary_statistics.py Outdated Show resolved Hide resolved
src/gentropy/method/window_based_clumping.py Outdated Show resolved Hide resolved
src/gentropy/dataset/summary_statistics.py Show resolved Hide resolved
src/gentropy/window_based_clumping.py Show resolved Hide resolved
@github-actions github-actions bot removed the size-L label Mar 13, 2024
@d0choa
Copy link
Collaborator Author

d0choa commented Mar 14, 2024

All the modifications have been implemented as discussed with @DSuveges during the review. I think it's very modular and it should be a good template for what's about to come. @Daniel-Considine you might want to look at the new function annotate_locus_statistics in StudyLocus.

@d0choa d0choa assigned DSuveges and unassigned d0choa Mar 14, 2024
@d0choa d0choa merged commit ad50c15 into dev Mar 20, 2024
4 checks passed
@d0choa d0choa deleted the do_clump_perf branch March 20, 2024 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants