feat: add `mapfilter` decorator #551

pfackeldey · 2024-11-19T13:15:45Z

This PR adds a new decorator, called mapfilter, that behaves similarly to dask_awkward.map_partitions, but extends it with some useful features for e.g. HEP analyses:

It can return multiple values and wraps them into dask collections - all with the same partitioning
It makes sure that all input dask collections have the same partitioning
The new needs argument can be used to touch additional columns
The meta argument allows to mock the output return values. This essentially allows to skip the tracing step if combined with needs.

`dak.mapfilter`

A decorated function will be a single node in the compute graph for the single value return case. For multiple return values it will be 2 nodes in the compute graph (one for the decorated function, and one to select the return value). For multiple nested return values the number of nodes corresponds to the nesting depth + 1.

An example of it's usefulness is shown in the following:

import dask_awkward as dak
import awkward as ak
import numpy as np


ak_array = ak.zip({"foo": [1, 2, 3, 4], "bar": [1, 1, 1, 1]})
dak_array = dak.from_awkward(ak_array, 2)

class some: ...

@dak.mapfilter
def fun(x):
  y = x.foo + 1
  return y, (np.sum(y),), some(), ak.Array(np.ones(4))
  

# this is not possible with `dask_awkward.map_partitions`  
y, y_sum, something, static = fun(dak_array)

# print the graph (HLG)
# We're seeing 3 nodes:
# 0. IO-layer
# 1. the decorated function `fun`
# 2. a "pick" layer that selection the correct value from the output tuple, here `y` is the 0-th element of all return values of `fun`
print(y.dask)
# >> HighLevelGraph with 3 layers.
# >> <dask.highlevelgraph.HighLevelGraph object at 0x106c881c0>
# >> 0. from-awkward-2669279954392e1535b365de1bfdef38
# >> 1. <dask-awkward.lib.core.ArgsKwargsPackedFunction ob-5f5b871945e30263d4972530f5679e79
# >> 2. <dask-awkward.lib.core.ArgsKwargsPackedFunction ob-5f5b871945e30263d4972530f5679e79-pick-0th

print(y_sum.compute())
# >> [array(5), array(9)]

# we can also track metadata per partition, e.g.:
print(something.compute())
# >> (<__main__.some at 0x10a7fa680>, <__main__.some at 0x10a7fa650>)

print(static.compute())
# >> <Array [1, 1, 1, 1, 1, 1, 1, 1] type='8 * float64'>

Untraceable functions

In a complex HEP analysis it may happen that some computation is not traceable (i.e. a user leaves the "awkward-array world"). For this, needs and meta exist:

import dask_awkward as dak
import awkward as ak
import numpy as np


ak_array = ak.zip({"pt": [10, 20, 30, 40], "eta": [1, 1, 1, 1]})
dak_array = dak.from_awkward(ak_array, 2)

def untraceable_fun(muons):
  # a non-traceable computation for ak.typetracer, because we're switching to NumPy (non-awkward)
  # which needs "pt" column from muons and returns a 1-element array (per partition)
  pt = ak.to_numpy(muons.pt)
  return ak.Array([np.sum(pt)])
  
dak.map_partitions(untraceable_fun, dak_array)
# >> TypeError: Converting from an nplike without known data to an nplike with known data is not supported


# This can be circumvented by mocking the output and specifying explicitly the columns that need to be read:
from functools import partial

@partial(
  dak.mapfilter,
  needs={"muons": ["pt"]},
  meta=ak.Array([0, 0]),
)
def untraceable_fun(muons):
  # a non-traceable computation for ak.typetracer, because we're switching to NumPy (non-awkward)
  # which needs "pt" column from muons and returns a 1-element array (per partition)
  pt = ak.to_numpy(muons.pt)
  return ak.Array([np.sum(pt)])
  
out = untraceable_fun(dak_array)
print(out.compute())
# >> <Array [30, 70] type='2 * int64'>

# check what needs to be read:
cols = next(iter(dak.report_necessary_columns(out).values()))
print(cols)
# >> frozenset({'pt'})

There are 3 cases that need to be considered:

typetracing is fine and no if conditions are present: dak.mapfilter works in the same way as dak.map_partitions.
typetracing is fine, but there is branched code (if conditions): dak.mapfilter can be used with needs to touch additional columns needed in the if conditions.
typracing fails: there's not much one can do about it except for skipping the tracing step. This can be done by providing needs with all needed columns, and meta with the expected outputs of the function.

`dak.prerun`

For complex untraceable functions especially needs may be cumbersome to provide, for this dak.prerun exists:

ak_array = ak.zip({"pt": [10, 20, 30, 40], "eta": [1, 1, 1, 1]})
dak_array = dak.from_awkward(ak_array, 2)

def untraceable_fun(muons):
  # a non-traceable computation for ak.typetracer, because we're switching to NumPy (non-awkward)
  # which needs "pt" column from muons and returns a 1-element array (per partition)
  pt = ak.to_numpy(muons.pt)
  return ak.Array([np.sum(pt)])

meta, needs = dak.prerun(untraceable_fun, muons=dak_array)
# >> UntraceableFunctionError: '<function untraceable_fun at 0x1056117e0>' is not traceable, an error occurred at line 7. 'dak.mapfilter' can circumvent this by providing 'needs' and 'meta' arguments to it.
#
# - 'needs': mapping where the keys point to input argument dask_awkward arrays and the values to columns that should be touched explicitly. The typetracing step could determine the following necessary columns until the exception occurred:
#
# needs={'muons': [('pt',)]}
#
# - 'meta': value(s) of what the wrapped function would return. For arrays, only the shape and type matter.

dak.prerun does a typetracing step of a given function and tries to infer needs and meta from it. If the function is untraceable (like in this example) it will report all recorded needs up to the point where the tracing failed. This can be useful for providing needs by hand to an untraceable function.

In addition, providing meta skips running the type tracer through the computation of untraceable_fun entirely - similar to how map_partitions works -, which can be beneficial if untraceable_fun is a computational expensive operation (e.g. evaluation of a neural network).
A useful trick here is to run meta, needs = dak.prerun(fun, *args, **kwargs) once, store meta and needs and provide it to dak.mapfilter in consecutive runs in order to avoid multiple unnecessary and costly tracings.

Other notes

Currently, there are only 2 types of dask collections that can be returned: a dask_awkward.Array or a dask.bag.Bag. It would be nice if array would be correctly wrapped into dask.Arrays and dataframe-likes into dask.DataFrames - this is currently not supported.
Instead it is recommended to wrap them into python collections (will be wrapped into dask Bags) or with awkward-arrays (will be wrapped into dak.Array).

codecov-commenter · 2024-11-19T13:17:38Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 70.67308% with 61 lines in your changes missing coverage. Please review.

Project coverage is 91.81%. Comparing base (8cb8994) to head (afb628c).
Report is 157 commits behind head on main.

Files with missing lines	Patch %	Lines
src/dask_awkward/lib/mapfilter.py	62.91%	56 Missing ⚠️
src/dask_awkward/lib/io/parquet.py	75.00%	3 Missing ⚠️
src/dask_awkward/lib/core.py	95.34%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   93.06%   91.81%   -1.26%     
==========================================
  Files          23       23              
  Lines        3290     3557     +267     
==========================================
+ Hits         3062     3266     +204     
- Misses        228      291      +63

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…me-like return types

pfackeldey · 2024-12-02T18:20:14Z

Hi @martindurant,
This PR is ready to be reviewed now.
I'd only need to increase the awkward version once it is released 👍 (that's why I'm leaving it as draft for now).

martindurant · 2024-12-12T16:10:18Z

src/dask_awkward/lib/io/parquet.py

@@ -483,6 +483,7 @@ def __init__(
        npartitions: int,
        prefix: str | None = None,
        storage_options: dict | None = None,
+        write_metadata: bool = False,


This change (and lines below) leaked from another PR

martindurant · 2024-12-12T16:23:24Z

# we can also track metadata per partition, e.g.:
print(something.compute())
# >> (<__main__.some at 0x10a7fa680>, <__main__.some at 0x10a7fa650>)

print(static.compute())
# >> <Array [1, 1, 1, 1, 1, 1, 1, 1] type='8 * float64'>

Quick question on usage: why are there two returns for the first one above, but only one for the second?

pfackeldey · 2024-12-12T17:57:38Z

# we can also track metadata per partition, e.g.:
print(something.compute())
# >> (<__main__.some at 0x10a7fa680>, <__main__.some at 0x10a7fa650>)

print(static.compute())
# >> <Array [1, 1, 1, 1, 1, 1, 1, 1] type='8 * float64'>
Quick question on usage: why are there two returns for the first one above, but only one for the second?

The first one is a dask.Bag, while the second one is a dak.Array. In the dak.Array case the partitions are stacked by concatenating the first axis, while a dask.Bag returns a list/tuple with 1 element per partition.

pfackeldey · 2024-12-12T17:58:51Z

I'd like to follow up on this PR after some work I'm doing related to #559. I found that there's some synergy with mapfilter, that would be nice to implement.
I'll update this PR afterwards.

pfackeldey added 2 commits November 19, 2024 13:35

add mapfilter decorator

4213659

mapfilter: fix args passing

1026471

pfackeldey requested a review from martindurant November 19, 2024 13:18

pfackeldey marked this pull request as draft November 19, 2024 18:36

mapfilter: raise NotImplementedError for non-akward arrays or datafra…

836b24b

…me-like return types

pfackeldey mentioned this pull request Nov 21, 2024

feat: Content.form_with_key_path() scikit-hep/awkward#3311

Merged

pfackeldey and others added 11 commits November 26, 2024 13:11

wip

3634179

properly refactor mapfilter and prerun functions

2f3ef20

polishing + doc strings

cfa79f5

remove obsolete code & improve error message

bae0915

Only return parquet metadata if intending to write

330c5f6

Add fire&forget experimental option

1d68731

tree option

18493f6

Tree becomes only way when not making parquet meta file

6fe45fd

add a little documentation

9a4f09a

allow older awkward-cpp in ci

bc3cc1d

add docs for and

afb628c

pfackeldey mentioned this pull request Dec 2, 2024

wrong token generation with dak.map_partitions #553

Open

martindurant reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `mapfilter` decorator #551

feat: add `mapfilter` decorator #551

pfackeldey commented Nov 19, 2024 •

edited

Loading

codecov-commenter commented Nov 19, 2024 •

edited

Loading

pfackeldey commented Dec 2, 2024

martindurant Dec 12, 2024

martindurant commented Dec 12, 2024

pfackeldey commented Dec 12, 2024

pfackeldey commented Dec 12, 2024

feat: add mapfilter decorator #551

Are you sure you want to change the base?

feat: add mapfilter decorator #551

Conversation

pfackeldey commented Nov 19, 2024 • edited Loading

dak.mapfilter

Untraceable functions

dak.prerun

Other notes

codecov-commenter commented Nov 19, 2024 • edited Loading

Codecov Report

pfackeldey commented Dec 2, 2024

martindurant Dec 12, 2024

Choose a reason for hiding this comment

martindurant commented Dec 12, 2024

pfackeldey commented Dec 12, 2024

pfackeldey commented Dec 12, 2024

feat: add `mapfilter` decorator #551

feat: add `mapfilter` decorator #551

pfackeldey commented Nov 19, 2024 •

edited

Loading

`dak.mapfilter`

`dak.prerun`

codecov-commenter commented Nov 19, 2024 •

edited

Loading