You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It will be the case that HEP analyses are going to join small columns of privately produced data with a larger central set.
This is not yet possible in dask_awkward but is reasonably well trodden in dask-array and dask-dataframe.
The goal would be to take two datasets with a shared key definition (like run number, luminosity block, and event number) and then merge those tables together into a single unified dataset that a dask_awkward analysis can handle cleanly (accounting for missing keys, file skew, etc. - i.e. the usual general distributed table join operations a la SQL).
Joins on a single-partition of concrete in-memory array we can implement trivially.
is reasonably well trodden in dask-array and dask-dataframe.
Only the latter? It depends on a well-understood concept of an index, which doesn't really apply to our arrays. Is such a join always "row-wise", meaning by top-level array item rather than at some deeper nesting level in the schema?
It applies to records, at least in HEP.
Indeed, it's rowwise, so there would be some key that we specify (like run number, luminosity block, event number) that we would then join two record arrays together based on.
I think if we could get the row-wise implementation first the more arbitrary keys could come later, the row-wise should be by far the most common and useful use case. The distributed join would be the most important thing to address, since the in-memory part is easy as you say.
It will be the case that HEP analyses are going to join small columns of privately produced data with a larger central set.
This is not yet possible in dask_awkward but is reasonably well trodden in dask-array and dask-dataframe.
The goal would be to take two datasets with a shared key definition (like run number, luminosity block, and event number) and then merge those tables together into a single unified dataset that a dask_awkward analysis can handle cleanly (accounting for missing keys, file skew, etc. - i.e. the usual general distributed table join operations a la SQL).
Related to / likely depends on #250
The text was updated successfully, but these errors were encountered: