-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add mapfilter
decorator
#551
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #551 +/- ##
==========================================
- Coverage 93.06% 91.81% -1.26%
==========================================
Files 23 23
Lines 3290 3557 +267
==========================================
+ Hits 3062 3266 +204
- Misses 228 291 +63 ☔ View full report in Codecov by Sentry. |
…me-like return types
Hi @martindurant, |
@@ -483,6 +483,7 @@ def __init__( | |||
npartitions: int, | |||
prefix: str | None = None, | |||
storage_options: dict | None = None, | |||
write_metadata: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and lines below) leaked from another PR
Quick question on usage: why are there two returns for the first one above, but only one for the second? |
The first one is a |
I'd like to follow up on this PR after some work I'm doing related to #559. I found that there's some synergy with mapfilter, that would be nice to implement. |
This PR adds a new decorator, called
mapfilter
, that behaves similarly todask_awkward.map_partitions
, but extends it with some useful features for e.g. HEP analyses:needs
argument can be used to touch additional columnsmeta
argument allows to mock the output return values. This essentially allows to skip the tracing step if combined withneeds
.dak.mapfilter
A decorated function will be a single node in the compute graph for the single value return case. For multiple return values it will be 2 nodes in the compute graph (one for the decorated function, and one to select the return value). For multiple nested return values the number of nodes corresponds to the nesting depth + 1.
An example of it's usefulness is shown in the following:
Untraceable functions
In a complex HEP analysis it may happen that some computation is not traceable (i.e. a user leaves the "awkward-array world"). For this,
needs
andmeta
exist:There are 3 cases that need to be considered:
if
conditions are present:dak.mapfilter
works in the same way asdak.map_partitions
.if
conditions):dak.mapfilter
can be used withneeds
to touch additional columns needed in theif
conditions.needs
with all needed columns, andmeta
with the expected outputs of the function.dak.prerun
For complex untraceable functions especially
needs
may be cumbersome to provide, for thisdak.prerun
exists:dak.prerun
does a typetracing step of a given function and tries to inferneeds
andmeta
from it. If the function is untraceable (like in this example) it will report all recordedneeds
up to the point where the tracing failed. This can be useful for providingneeds
by hand to an untraceable function.In addition, providing
meta
skips running the type tracer through the computation ofuntraceable_fun
entirely - similar to howmap_partitions
works -, which can be beneficial ifuntraceable_fun
is a computational expensive operation (e.g. evaluation of a neural network).A useful trick here is to run
meta, needs = dak.prerun(fun, *args, **kwargs)
once, storemeta
andneeds
and provide it todak.mapfilter
in consecutive runs in order to avoid multiple unnecessary and costly tracings.Other notes
Currently, there are only 2 types of dask collections that can be returned: a
dask_awkward.Array
or adask.bag.Bag
. It would be nice if array would be correctly wrapped intodask.Arrays
and dataframe-likes intodask.DataFrames
- this is currently not supported.Instead it is recommended to wrap them into python collections (will be wrapped into
dask Bags
) or with awkward-arrays (will be wrapped intodak.Array
).