-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job patterns for partitioning lists and mapping onto them #2297
job patterns for partitioning lists and mapping onto them #2297
Conversation
Can one of the admins verify this patch? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2297 +/- ##
=======================================
Coverage 99.07% 99.07%
=======================================
Files 82 83 +1
Lines 3445 3464 +19
=======================================
+ Hits 3413 3432 +19
Misses 32 32 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @zulissimeta! This is an interesting PR.
If I understand correctly, it seems like this is logic that would need to be put within the @flow
s themselves, right? So, the ideal use case here is someone is importing a pre-made @job
(e.g. quacc.recipes.mlp.core import relax_job
) and making a custom @flow
for themselves from that, right? It would perhaps be a bit difficult to justify when/where to add this logic within pre-made @flow
s in quacc itself (i.e. in quacc.recipes
).
Some (non-substantial) comments below while I await your reply.
Yes, exactly! If you want a flow that does many jobs in parallel (for example, if you made a bulk_to_adsorbates_flow that generated hundreds of possible adsorbate configuration and had ML potentials to make relaxations fast), this would be helpful. Partitioning and batching would also be helpful if doing lots of inference; then you could partition a list, and apply a function to take batches and do ML inference quickly, rather than running one MLP relaxation as a separate job. |
Got it, thanks! Since it is pretty independent from existing recipe logic, I don't have much reservation about this. We will just want to add a brief section to the documentation somewhere highlighting how it can be used since it's somewhat of an "advanced" (but useful!) feature. I am happy to take care of the docs though. |
Thanks! Setting this to auto-merge now. |
Summary of Changes
This draft PR adds job patterns that are common in high-throughput workflows. When running many jobs on the same flow (say
result = map(my_job, range(1000))
) , many workflow managers will get sad (network/db/load issues) with too many jobs. This PR is inspired by the dask bag partitions and map operator.For example:
should yield:
but run in only two jobs (instead of 5).
In addition, by using a specified number of partitions, logic flows can quickly move from one step to the next without waiting for any intermediate results.
while
is fine since it is clear there is a first job, a second partition job that yields 10 tasks, and a final mapping job that has 10 jobs (one per partition).