perf(l2g): streamline feature generation #544

ireneisdoomed · 2024-03-15T11:21:50Z

✨ Context

The increase of V2G relationships (#543) caused L2G to crash.
The partitioning strategy was not optimised to extract that many feature annotations, which meant that executors had to deal with big chunks of data and would eventually die.

This job is heavy in the number of aggregation tasks it has to do, so having the data properly partitioned is important.

🛠 What does this PR implement

Lighter joins in the extraction of VEP features (c31786b). In credible_set_w_variant_consequences, I am selecting the data to be shuffled in the join.
Lock the default number of returned partitions after a join to 800 (6340e89). In the current solution, I saw that the eCAVIAr features set had 980 partitions, the distance features 200 and the VEP features 17.
Other minor refinements

🙈 Missing

In an upcoming PR I'll address the biggest gain i've identified:

Minimise for loops. If look at _get_max_coloc_per_credible_set, we run this function twice:
- Once to extract COLOC, another time to extract eCAVIAR. However, we could apply the logic using the method as a grouping field so that we don't split the logic into 2 threads. The complexity here is that this is implemented this way because the name of the feature is based on the method, so we need to take that into account.
- Very very similarly, we iterate over each type of QTL to perform the aggregation.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-fm-improvements

tests/gentropy/conftest.py

DSuveges

To me it looks absolutely reasonable.

tests/gentropy/conftest.py

ireneisdoomed added 7 commits March 12, 2024 12:31

refactor(l2g): streamline coloc feature factory

7c114d0

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

7723453

…-fm-improvements

perf(l2g): make joins in _get_vep_features lighter

c31786b

fix(l2g): use weighted scores for _get_vep_features

c0db573

refactor(l2g): minor improvements

ebae33f

perf(l2g): adapt session to set partition number for shuffling to 800

6340e89

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

f368731

…-fm-improvements

github-actions bot added size-M Method Performance Dataset Step labels Mar 15, 2024

ireneisdoomed marked this pull request as ready for review March 15, 2024 12:39

ireneisdoomed requested review from d0choa and DSuveges March 15, 2024 12:39

This was referenced Mar 15, 2024

Optimise feature matrix management to accelerate L2G Training and Prediction opentargets/issues#3252

Closed

feat(l2g): distance features based on weighted score #545

Merged

DSuveges reviewed Mar 18, 2024

View reviewed changes

tests/gentropy/conftest.py Show resolved Hide resolved

DSuveges approved these changes Mar 18, 2024

View reviewed changes

tests/gentropy/conftest.py Show resolved Hide resolved

Merge branch 'dev' into il-fm-improvements

b79dcdb

DSuveges merged commit 77976b5 into dev Mar 18, 2024
4 checks passed

This was referenced Mar 20, 2024

perf(l2g): optimise extraction of features from colocalisation results #553

Merged

L2G model evaluation is very slow opentargets/issues#3263

Closed

ireneisdoomed deleted the il-fm-improvements branch July 15, 2024 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(l2g): streamline feature generation #544

perf(l2g): streamline feature generation #544

ireneisdoomed commented Mar 15, 2024 •

edited

Loading

DSuveges left a comment

perf(l2g): streamline feature generation #544

perf(l2g): streamline feature generation #544

Conversation

ireneisdoomed commented Mar 15, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

DSuveges left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Mar 15, 2024 •

edited

Loading