-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement repartitioning #250
Comments
Are you asking for tightly scoped functionality in an IO layer (parquet) to be able to merge several files/chunks of incoming data into a partition, or a more general repartition function that can be applied at any point in the analysis? |
Eventually both, histogram fills get inefficient quickly if there is little data (ditto for ML inference). However, an implementation specific to |
this is in with #253 |
HEP analysis workflows often place highly-reducing cuts on their data before proceeding with the rest of an analysis.
It is useful to both be able to save this data (which is already achieved by
dask.persist
anddak.to_parquet
for varying lifecycles), and it be able to re-partition the data so that later data ingestion is balanced and array-based operations are efficient. Skims are often used since they reduce total data volume and total number of files opened, reducing job failure rates / analysis turn-around robustness.Especially for searches, partitions can end up entirely empty or with very few events. This can have significant impact on cluster performance in its logical (and practical) extrema.
I would rate this in priority equal to improving optimization speed for large analysis graphs, but there's certainly already a desire for this in the CMS community.
The text was updated successfully, but these errors were encountered: