Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(l2g): streamline feature generation #544

Merged
merged 8 commits into from
Mar 18, 2024
Merged

perf(l2g): streamline feature generation #544

merged 8 commits into from
Mar 18, 2024

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Mar 15, 2024

✨ Context

The increase of V2G relationships (#543) caused L2G to crash.
The partitioning strategy was not optimised to extract that many feature annotations, which meant that executors had to deal with big chunks of data and would eventually die.

This job is heavy in the number of aggregation tasks it has to do, so having the data properly partitioned is important.

🛠 What does this PR implement

  • Lighter joins in the extraction of VEP features (c31786b). In credible_set_w_variant_consequences, I am selecting the data to be shuffled in the join.
  • Lock the default number of returned partitions after a join to 800 (6340e89). In the current solution, I saw that the eCAVIAr features set had 980 partitions, the distance features 200 and the VEP features 17.
  • Other minor refinements

🙈 Missing

In an upcoming PR I'll address the biggest gain i've identified:

  • Minimise for loops. If look at _get_max_coloc_per_credible_set, we run this function twice:
    • Once to extract COLOC, another time to extract eCAVIAR. However, we could apply the logic using the method as a grouping field so that we don't split the logic into 2 threads. The complexity here is that this is implemented this way because the name of the feature is based on the method, so we need to take that into account.
    • Very very similarly, we iterate over each type of QTL to perform the aggregation.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks absolutely reasonable.

tests/gentropy/conftest.py Show resolved Hide resolved
@DSuveges DSuveges merged commit 77976b5 into dev Mar 18, 2024
4 checks passed
@ireneisdoomed ireneisdoomed deleted the il-fm-improvements branch July 15, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants