Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement repartitioning #250

Closed
lgray opened this issue Apr 27, 2023 · 3 comments
Closed

Implement repartitioning #250

lgray opened this issue Apr 27, 2023 · 3 comments

Comments

@lgray
Copy link
Collaborator

lgray commented Apr 27, 2023

HEP analysis workflows often place highly-reducing cuts on their data before proceeding with the rest of an analysis.
It is useful to both be able to save this data (which is already achieved by dask.persist and dak.to_parquet for varying lifecycles), and it be able to re-partition the data so that later data ingestion is balanced and array-based operations are efficient. Skims are often used since they reduce total data volume and total number of files opened, reducing job failure rates / analysis turn-around robustness.

Especially for searches, partitions can end up entirely empty or with very few events. This can have significant impact on cluster performance in its logical (and practical) extrema.

I would rate this in priority equal to improving optimization speed for large analysis graphs, but there's certainly already a desire for this in the CMS community.

@martindurant
Copy link
Collaborator

Are you asking for tightly scoped functionality in an IO layer (parquet) to be able to merge several files/chunks of incoming data into a partition, or a more general repartition function that can be applied at any point in the analysis?

@lgray
Copy link
Collaborator Author

lgray commented Apr 27, 2023

Eventually both, histogram fills get inefficient quickly if there is little data (ditto for ML inference). However, an implementation specific to dak.to_parquet would be much higher priority.

@douglasdavis
Copy link
Collaborator

this is in with #253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants