Implement repartitioning #250

lgray · 2023-04-27T15:36:55Z

HEP analysis workflows often place highly-reducing cuts on their data before proceeding with the rest of an analysis.
It is useful to both be able to save this data (which is already achieved by dask.persist and dak.to_parquet for varying lifecycles), and it be able to re-partition the data so that later data ingestion is balanced and array-based operations are efficient. Skims are often used since they reduce total data volume and total number of files opened, reducing job failure rates / analysis turn-around robustness.

Especially for searches, partitions can end up entirely empty or with very few events. This can have significant impact on cluster performance in its logical (and practical) extrema.

I would rate this in priority equal to improving optimization speed for large analysis graphs, but there's certainly already a desire for this in the CMS community.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-04-27T16:16:38Z

Are you asking for tightly scoped functionality in an IO layer (parquet) to be able to merge several files/chunks of incoming data into a partition, or a more general repartition function that can be applied at any point in the analysis?

lgray · 2023-04-27T16:21:22Z

Eventually both, histogram fills get inefficient quickly if there is little data (ditto for ML inference). However, an implementation specific to dak.to_parquet would be much higher priority.

douglasdavis · 2023-07-06T17:30:19Z

this is in with #253

lgray mentioned this issue May 13, 2023

Implement inner/outer/left/right joins #262

Open

douglasdavis closed this as completed Jul 6, 2023

martindurant mentioned this issue Jun 18, 2024

Performance issue with repartition #509

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement repartitioning #250

Implement repartitioning #250

lgray commented Apr 27, 2023 •

edited

Loading

martindurant commented Apr 27, 2023

lgray commented Apr 27, 2023 •

edited

Loading

douglasdavis commented Jul 6, 2023

Implement repartitioning #250

Implement repartitioning #250

Comments

lgray commented Apr 27, 2023 • edited Loading

martindurant commented Apr 27, 2023

lgray commented Apr 27, 2023 • edited Loading

douglasdavis commented Jul 6, 2023

lgray commented Apr 27, 2023 •

edited

Loading

lgray commented Apr 27, 2023 •

edited

Loading