-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with repartition
#509
Comments
@martindurant when you have time - some thought here would be nice. If we do this in two steps, i.e. write out files with no repartitioning and then repartition those files the memory issues vanish. This is odd since the repartition shouldn't care about data that we've cut out, but some how it's acting like it needs all the data before cuts in the tasks doing the partition aggregation. This really sounds like some data lifecycle problem. |
It seems that the in-memory size of each (unfiltered) partition really is at least a few GB. This must be completely loaded, in fact each output partition will need inputs from multiple files. Only then do you do the filtering. As far as I know, it's no possible to filter during the process of streaming data from a file. I read-and-filter operation would be great! But this basic process is the reason that parquet partitions (for an example I am more familiar with) are usually <<100MB. In addition, this whole workflow is entirely disk bound, since filtering is very fast. That means that any parallel tasks are trying to read various parts of various files and write too, all over the same bus. I don't really expect dask to be able to do anything for this case. (I realise that this is not the real workflow you want to run) Nevertheless, this reminds me, that I think we considered at some point a type of repartition that is exactly every N inputs -> 1 output; that at least simplifies each task. |
I might be wrong - the dask profile shows decompression is taking 93% of time |
btw: for me, the output only has 2805 rows, so it's a single partition. |
I wasn't able to replicate this example [1] with just dask/uproot, but it looks like if we expose the dask_awkward structure before performing a repartition, this does help with the memory issue. I also tried to use the [1] scikit-hep/coffea#1100 |
I guess I'm not entirely clear as to why the repartition on the filtered data needs to have any knowledge at all about the filtered data... and why the filtered data is hanging around that long in dask worker. At the point it isn't needed any more it should just be dropped, and there's nothing in this workflow that needs to know about the original partitions for the repartition step, iiuc. |
When you say filtered.repartition(num_rows=), dask-awkward needs to know the number of rows per input partition, in order to know which of them belong in a given output partition. This means loading the whole of the filter predicate, partition by partition. It then loads the (whole of) each input partition and filters them. In the given example, all of the input partitions end up in the same output partition.
I don't know what this does. |
I see... after running the same process with a parquet file. I'm seeing this behavior as well, so this is a limit of dask, not some strange interaction with uproot. So if I'm understanding this restriction correctly, this is mainly because repartitions must be evaluated during the generation of the task graph, so it needs to have all inputs available. Would it be possible to have a different method of obtaining the repartition scheme, where we aggregate filtered results as they arrive? Or this is fundamentally at odds with the paradigm of dask? The step_size="100MB" attempts to limit each partition extracted by uproot to be no larger than 100MB. |
Right but once it is done loading the input partitions and filtering them it doesn't need to keep the original input data in memory. What's really weird is that if I put a |
Given the new options in #250 , can we try again? |
Hi @martindurant, just tried out on the latest master branch. The current implementation does not play nicely with This I'm guessing can be a "simple" fix in If I add this additional new check to uproot, we do get much more reasonable memory usage (a new hundred MB usage rather than a few GB). |
If the assignment of [1] https://github.com/scikit-hep/uproot5/blob/main/src/uproot/writing/_dask_write.py#L36 |
Since parquet does not have that line, what is the problem there? |
(remembering that in your previous version, the number of output partitions was actually 1) |
|
OK, so the actual problem is, that the output of the repartition shows as having 0 partitions? |
Yes, arrays after a |
I just had time to test a bit more. In my test, the original array will be separated into 52 partition.
|
#517 should fix those issues for you (tests included). In your case, you wanted n_to_one=52 which is the same as npartitions=1. |
This new implementation seems to have problems when loading from files: import dask_awkward as dak
array = dak.from_lists([[[1, 2, 3], [], [4, 5]]] * 100)
array.to_parquet("test.parquet")
array2 = dak.from_parquet("test.parquet")
array2.repartition(n_to_one=10) # Fails on this line: The full error message is:
|
Can you try with #518 ? |
This now seems to be working! Without any modifications to uproot required. I think the only last piece that I wanted to confirm is that there seems to be a massive discrepancy in memory usage if Is this a behavior we should expect? I wanted to have a record to help future users avoid some hidden gotchas, in case things change in the future (I've attached the calculated tasks graphs and recorded memory footprint with the conversion placed before and after the repartition call in this thread) re2conv_opt_new.pdf |
@yimuchen You may be able to just drop it? |
I tried without the |
I'm fine with tracking it down a bit more here so we understand. Memory problems are pretty crucial for us so we should make sure it isn't somehow related. |
Pardon me, was caught up with other items, but I managed to get a script that better illustrates the issue. This script can be ran with just the master branch of import dask_awkward as dak
import numpy as np
import uproot
from dask.distributed import Client
import awkward as ak
def make_events(rng: np.random.Generator, n_events: int):
# Event-level variables
events = ak.zip({f"prop_{idx}": rng.random(size=n_events) for idx in range(10)})
# Large collection with many entries per event
n_entries = rng.poisson(lam=300, size=n_events)
large_col = ak.zip(
{f"prop_{idx}": rng.random(size=ak.sum(n_entries)) for idx in range(10)}
)
events["large_col"] = ak.unflatten(large_col, n_entries)
# Small collections with a handul of entries per event
n_entries = rng.poisson(lam=10, size=n_events)
small_col = ak.zip(
{f"prop_{idx}": rng.random(size=ak.sum(n_entries)) for idx in range(10)}
)
events["small_col"] = ak.unflatten(small_col, n_entries)
return events
def make_skimmed_events(events):
return events[events.prop_0 < 0.1] # Random 10% file reduction
if __name__ == "__main__":
# Creating single thread client to better monitor performance
client = Client(processes=False, n_workers=1, threads_per_worker=1)
rng = np.random.default_rng(seed=123456)
for file_idx in range(10):
print("Making file", file_idx, "...")
events = make_events(rng, 10_000) # Each file will take ~200MB
ak.to_parquet(events, f"unskimmed_{file_idx}.parquet")
with uproot.recreate(f"unskimmed_{file_idx}.root") as f:
f["Event"] = {k: events[k] for k in events.fields}
print("Skimming with parquet file inputs")
events = dak.from_parquet("unskimmed_*.parquet")
events = make_skimmed_events(events)
events = events.repartition(
n_to_one=20
) # Given our basic estimate, everything should fit in one file
dak.to_parquet(events, "skimmed.parquet")
print("Skimming with root file inputs")
events = uproot.dask("unskimmed_*.root")
events = events.repartition(n_to_one=20)
events = make_skimmed_events(events)
uproot.dask_write(events, "skimmed.root") Using Basically, we still get a very large spike in memory usage, despite attempting to merge the results. This is true both for parquet and uproot file writing. The magnitude of these memory usage peaks appears regardless of the filtering efficiency. If I disable the repartition line, I get a much more reasonable memory usage, but with much more fragmented outputs. Let me know if any more testing could provide more insight. |
Another discover that might help pin down what is causing this memory consumption is that if we attempt to strip the event before repartitioning: events = events[["small_col"]] # New line to strip down save content
events = events.repartition(n_to_one=20) This reduces the memory usage for [1] https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/parquet.py#L511 |
I was testing a workflow of file skimming, and to account for the possibility that the rate that events of interest is very low in the skimming scheme, I attempted to use array.repartition to reduce the number of files that would be generated, as all file writing methods that I know of creates 1 file per partition.
I've provided a code to generate a set of dummy data that roughly matches the data schema (jagged arrays with very mismatch collection sizes), and performing a simple skim operation. What is observed is that during the is repartition is specified, the memory is pinned at ~5-7GB regardless of the partitioning scheme that is used defined by
uproot
. A suggestion to usedask.array.persist
makes the computation of thearray.repartition
step takes a very long time and just as much memory.This is how I am attempting to skim the files in question:
The data in question can be generated using this script (each file will be about 2.5GB in size)
The text was updated successfully, but these errors were encountered: