perf(clump): refactored window based clumping #492

d0choa · 2024-02-18T17:44:44Z

After different failing attempts to implement the window-based clumping prioritising windows over joins. I had a longer discussion with @DSuveges on rescueing the current implementation based on joins. This seems to be the best solution and it's very conservative with the current implementation. My personal conclusion is that the task of clumping is not well suited for windows, because we don't really care about the vast majority of the summary statistics and windows are good for computing on all rows not a sub-selection of them.

At the end, the only problem with the current implementation was that the broadcast does not work in a "right" join. When the workload increased, the join had to deal with lots of shuffling and eventually failed due to the lack of resources. This PR minimally modifies this part of the window-based clumping to resolve that join.

Using the logic in this PR, clumps on all NFE GWAS Catalog summary statistics were calculated. The job took 18 min to compute and resulted on clumps with collected locus gs://ot-team/dochoa/sl_11_3_24.parquet (~25Gb). spark.sql.shuffle was set to 3200 for this run.

On top of the business logic, I also modified slightly the interface with the function and the arguments required (default). I think it makes a bit easier to understand the function. These changes are a lot more subjective, so happy to revert if we don't find an agreement.

…_clump_perf

… into do_clump_perf

DSuveges

I don't see anything that would block the merge, however I have a few minor comments, questions.

DSuveges · 2024-03-13T12:44:08Z

src/gentropy/clump.py

@@ -27,7 +27,6 @@ def __init__(
        clumped_study_locus_path: str,
        study_index_path: Optional[str] = None,
        ld_index_path: Optional[str] = None,
-        locus_collect_distance: Optional[int] = None,


If these changes are accepted, this step cannot be configured in a way that would allow for locus collection, instead a separate step would be needed. Is this intentional?

Also, I don't think this step is used at all... I have split the logic into two different steps to resolve ambiguity. We should remove this file altoghether.

You were right. The step has been removed.

src/gentropy/dataset/summary_statistics.py

src/gentropy/method/window_based_clumping.py

src/gentropy/dataset/summary_statistics.py

src/gentropy/window_based_clumping.py

d0choa · 2024-03-14T13:14:57Z

All the modifications have been implemented as discussed with @DSuveges during the review. I think it's very modular and it should be a good template for what's about to come. @Daniel-Considine you might want to look at the new function annotate_locus_statistics in StudyLocus.

d0choa and others added 4 commits February 18, 2024 17:28

perf: refactored window based clumping

cbb4830

fix: gwas catalog clump step adjusted

7974325

Merge branch 'dev' into do_clump_perf

509c7c0

Merge branch 'dev' into do_clump_perf

0558d5a

github-actions bot added documentation Improvements or additions to documentation size-L labels Mar 4, 2024

d0choa and others added 2 commits March 4, 2024 15:51

chore: merge branch 'dev' into do_clump_perf

60e98f3

Merge branch 'dev' into do_clump_perf

9ad81de

github-actions bot added Method Performance Refactor Dataset Step labels Mar 4, 2024

d0choa added 2 commits March 6, 2024 21:30

Merge branch 'dev' into do_clump_perf

d30af71

Merge branch 'dev' into do_clump_perf

3b8824b

d0choa self-assigned this Mar 11, 2024

d0choa added 6 commits March 12, 2024 17:07

Merge branch 'dev' of https://github.com/opentargets/gentropy into do…

71f94f4

…_clump_perf

Merge branch 'do_clump_perf' of https://github.com/opentargets/gentropy…

097d41b

… into do_clump_perf

fix: broadcast logic

4b9bf30

revert: coalesce instruction

bf49f4a

fix: add alias

875184b

docs: enhance docs

7649940

d0choa removed the Refactor label Mar 13, 2024

d0choa marked this pull request as ready for review March 13, 2024 10:54

d0choa requested a review from DSuveges March 13, 2024 10:54

d0choa added 2 commits March 13, 2024 11:03

test: accidentally removed test

20f6fb6

test: rescue missing test

16301bc

DSuveges approved these changes Mar 13, 2024

View reviewed changes

chore: remove unused step

941d091

github-actions bot removed the size-L label Mar 13, 2024

github-actions bot added the size-M label Mar 13, 2024

d0choa added 6 commits March 13, 2024 14:37

refactor: increased modularisation

dbb6b14

chore: merge dev

0562943

feat: restructure data model

e7a4d21

test: structural test for annotate function

6a41dbe

fix: prevent locus column could be duplicated

0e10f9e

chore: up-to-date with pre-commit

44bb759

d0choa assigned DSuveges and unassigned d0choa Mar 14, 2024

Merge branch 'dev' into do_clump_perf

88b252a

d0choa merged commit ad50c15 into dev Mar 20, 2024
4 checks passed

d0choa deleted the do_clump_perf branch March 20, 2024 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(clump): refactored window based clumping #492

perf(clump): refactored window based clumping #492

d0choa commented Feb 18, 2024 •

edited

Loading

DSuveges left a comment

DSuveges Mar 13, 2024

DSuveges Mar 13, 2024

d0choa Mar 14, 2024

d0choa commented Mar 14, 2024 •

edited

Loading

perf(clump): refactored window based clumping #492

perf(clump): refactored window based clumping #492

Conversation

d0choa commented Feb 18, 2024 • edited Loading

DSuveges left a comment

Choose a reason for hiding this comment

DSuveges Mar 13, 2024

Choose a reason for hiding this comment

DSuveges Mar 13, 2024

Choose a reason for hiding this comment

d0choa Mar 14, 2024

Choose a reason for hiding this comment

d0choa commented Mar 14, 2024 • edited Loading

d0choa commented Feb 18, 2024 •

edited

Loading

d0choa commented Mar 14, 2024 •

edited

Loading