Add handling for nested dicts in dask-cudf groupby #9054

charlesbluca · 2021-08-17T21:26:01Z

Adds handling for nested dict (renamed) aggregations supplied to dask-cudf's groupby, by storing the new aggregation names when standardizing the aggs input and applying them in _finalize_gb_agg().

charlesbluca · 2021-08-17T21:34:23Z

python/dask_cudf/dask_cudf/groupby.py

+            agg_array.append(
+                aggs_renames.get(_make_name(col, agg, sep=sep), agg)
+            )
+        _meta.columns = pd.MultiIndex.from_arrays([col_array, agg_array])


I don't like that we have to do the aggregation renames for both _meta and the groupby result, but this is required so that we have the correct final_columns for the last step of _finalize_gb_agg(). It would be nice if we also supported nested dict aggregations in cuDF's groupby so that _meta would have the correct index without any additional steps in dask-cuDF.

I think @shwina said this could be done but would require some effort. Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path. We could be wrong and if you feel strongly you should speak up

Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path

If pandas doesn't support something ugly, I'd lean away from doing it in cudf for the sake of dask-cudf logic :)

Yeah I generally agree - there could be larger motivations to want nested renaming support for groupby in cuDF, but I don't think this case alone is a good enough reason to work on it

codecov · 2021-08-17T23:06:32Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@04b7027). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 479c3da differs from pull request most recent head 0954d23. Consider uploading reports for the commit 0954d23 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #9054   +/-   ##
===============================================
  Coverage                ?   10.78%           
===============================================
  Files                   ?      114           
  Lines                   ?    18716           
  Branches                ?        0           
===============================================
  Hits                    ?     2018           
  Misses                  ?    16698           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04b7027...0954d23. Read the comment docs.

quasiben · 2021-08-18T13:19:58Z

This looks cleaner than #9033 and would be in favor of this PR. @rjzamora if you have a few minutes your thoughts/reviews would be appreciated

marlenezw · 2021-08-19T10:44:34Z

python/dask_cudf/dask_cudf/groupby.py

@@ -367,6 +382,8 @@ def _is_supported(arg, supported: set):
            for col in arg:
                if isinstance(arg[col], list):
                    _global_set = _global_set.union(set(arg[col]))
+                elif isinstance(arg[col], dict):


Does order matter for _global_set? If it does, using set can sometimes change the order and give unexpected results.

Shouldn't matter here, since we only need _global_set to check if our aggs are a subset of supported. Ordering is more of a concern with _redirect_aggs() since that function returns a copy of the aggs that's used for the remainder of the groupby. AFAIK that should be good

Great, sounds good! For the most part the code looks pretty good to me!

charlesbluca · 2021-08-20T13:44:44Z

rerun tests

quasiben · 2021-08-26T22:19:01Z

@gpucibot merge

Add handling for nested dicts in dask-cudf groupby

1964f53

charlesbluca added feature request New feature or request 3 - Ready for Review Ready for review by team dask Dask issue non-breaking Non-breaking change labels Aug 17, 2021

charlesbluca requested review from a team as code owners August 17, 2021 21:26

charlesbluca requested review from marlenezw and skirui-source August 17, 2021 21:26

github-actions bot added the Python Affects Python cuDF API. label Aug 17, 2021

charlesbluca commented Aug 17, 2021

View reviewed changes

charlesbluca added 4 commits August 18, 2021 07:26

Normalize nested dict aggs in one pass

2805907

Fix dict handling in _redirect_aggs, add test case to reflect this

cc82256

Merge remote-tracking branch 'upstream/branch-21.10' into fix-9017

05afea7

Avoid using _make_name for agg renames setting/getting

0954d23

marlenezw reviewed Aug 19, 2021

View reviewed changes

quasiben mentioned this pull request Aug 21, 2021

[WIP] Handle nested dicts in groupby operations #9033

Closed

marlenezw approved these changes Aug 24, 2021

View reviewed changes

quasiben approved these changes Aug 26, 2021

View reviewed changes

rapids-bot bot merged commit 4e0584b into rapidsai:branch-21.10 Aug 26, 2021

sarahyurick mentioned this pull request Aug 27, 2021

Update aggregate.py dask-contrib/dask-sql#207

Closed

charlesbluca deleted the fix-9017 branch July 19, 2022 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling for nested dicts in dask-cudf groupby #9054

Add handling for nested dicts in dask-cudf groupby #9054

charlesbluca commented Aug 17, 2021

charlesbluca Aug 17, 2021

quasiben Aug 18, 2021

rjzamora Aug 18, 2021

charlesbluca Aug 18, 2021

codecov bot commented Aug 17, 2021 •

edited

Loading

quasiben commented Aug 18, 2021

marlenezw Aug 19, 2021

charlesbluca Aug 19, 2021

marlenezw Aug 19, 2021

charlesbluca commented Aug 20, 2021

quasiben commented Aug 26, 2021

Add handling for nested dicts in dask-cudf groupby #9054

Add handling for nested dicts in dask-cudf groupby #9054

Conversation

charlesbluca commented Aug 17, 2021

charlesbluca Aug 17, 2021

Choose a reason for hiding this comment

quasiben Aug 18, 2021

Choose a reason for hiding this comment

rjzamora Aug 18, 2021

Choose a reason for hiding this comment

charlesbluca Aug 18, 2021

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2021 • edited Loading

Codecov Report

quasiben commented Aug 18, 2021

marlenezw Aug 19, 2021

Choose a reason for hiding this comment

charlesbluca Aug 19, 2021

Choose a reason for hiding this comment

marlenezw Aug 19, 2021

Choose a reason for hiding this comment

charlesbluca commented Aug 20, 2021

quasiben commented Aug 26, 2021

codecov bot commented Aug 17, 2021 •

edited

Loading