DEPR: deprecate relableling dicts in groupby.agg #15931

jreback · 2017-04-07T02:44:53Z

pre-curser to #14668

This is basically in the whatsnew, but:

In [1]:     df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
   ...:                        'B': range(5),
   ...:                        'C':range(5)})
   ...:     df
   ...: 
Out[1]: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  2  3  3
4  2  4  4

This is good; multiple aggregations on a dataframe with a dict-of-lists

In [2]:    df.groupby('A').agg({'B': ['sum', 'max'],
   ...:                         'C': ['count', 'min']})
   ...: 
Out[2]: 
    B         C    
  sum max count min
A                  
1   3   2     3   0
2   7   4     2   3

This is a dict on a grouped Series -> deprecated

In [3]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dictionary on a Series for aggregation
is deprecated and will be removed in a future version
Out[3]: 
   foo
A     
1    3
2    2

Further this has to go as well, a nested dict that does renaming.
Note once we fix #4160 (renaming with a level); the following becomes almost trivial to rename in-line.

In [4]: df.groupby('A').agg({'B': {'foo': ['sum', 'max']}, 
                             'C': {'bar': ['count', 'min']}})
FutureWarning: using a dictionary on a Series for aggregation
is deprecated and will be removed in a future version
Out[4]: 
  foo       bar    
  sum max count min
A                  
1   3   2     3   0
2   7   4     2   3

Note: I will fix this message (as it doesn't actually apply here)

jreback · 2017-04-07T02:56:16Z

cc @jorisvandenbossche @shoyer @wesm @TomAugspurger

codecov · 2017-04-07T03:25:10Z

Codecov Report

Merging #15931 into master will decrease coverage by 0.02%.
The diff coverage is 73.33%.

@@            Coverage Diff             @@
##           master   #15931      +/-   ##
==========================================
- Coverage   91.03%      91%   -0.03%     
==========================================
  Files         145      145              
  Lines       49587    49636      +49     
==========================================
+ Hits        45141    45171      +30     
- Misses       4446     4465      +19

Flag	Coverage Δ
#multiple	`88.77% <73.33%> (-0.03%)`	⬇️
#single	`40.53% <5.33%> (-0.04%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby.py	`95.54% <100%> (ø)`	⬆️
pandas/types/cast.py	`85.11% <20%> (-0.63%)`	⬇️
pandas/core/base.py	`92.32% <71.42%> (-3.19%)`	⬇️
pandas/core/common.py	`91.03% <0%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b8a6b1...ff1a5f6. Read the comment docs.

jreback · 2017-04-09T13:48:07Z

@jorisvandenbossche @chris-b1 @shoyer thoughts (chris this was talked about at the meeting, to try to reduce the number of cases that we would handle). The idea is that we go ahead with this deprecation, then merge .agg (which will also have the same deprecation; or maybe I can just raise as that's new things, but that is after anyhow).

chris-b1 · 2017-04-09T14:18:39Z

This seems like a reasonable deprecation, the current behavior is probably too overloaded and hard to think about.

Might put the recommended way in the deprecation message? Would also be nice to have #4160 in 0.20, so the DataFrame case is more consistent.

jreback · 2017-04-09T14:35:45Z

yes going to try to fix #4160 before the release as well.

jreback · 2017-04-12T10:22:39Z

@jorisvandenbossche @TomAugspurger @chris-b1 @shoyer if you'd have a look. going to merge later today.

TomAugspurger

+1 on a quick skim. A few other places in the docs that may need updating:

https://github.com/pandas-dev/pandas/blame/1751628adef96b913d0083a48e51658a70dac8c4/doc/source/computation.rst#L613
remove second sentence here
replace this example with the new syntax: https://github.com/jreback/pandas/blob/d87e564be48a3db2a4c55d78cbf23d4177cbac2d/pandas/core/groupby.py#L2787

TomAugspurger · 2017-04-12T11:42:39Z

doc/source/whatsnew/v0.20.0.txt

+1) We are deprecating passing a dict to a grouped/rolled/resampled ``Series``. This allowed
+one to ``rename`` the resulting aggregation, but this had a completely different
+meaning than passing a dictionary to a grouped ``DataFrame``, which accepts column-to-aggregations.
+2) We are deprecating passing a dict-of-dict to a grouped/rolled/resampled ``DataFrame`` in a similar manner.


dict-of-dicts

TomAugspurger · 2017-04-12T11:44:04Z

doc/source/whatsnew/v0.20.0.txt

+.. code-block:: ipython
+
+   In [6]: df.groupby('A').B.agg({'foo': 'count'})
+   FutureWarning: using a dictionary on a Series for aggregation


Reminder to updated this with the new FutureWarning if we change the message

TomAugspurger · 2017-04-12T11:48:07Z

doc/source/whatsnew/v0.20.0.txt

+.. ipython:: python
+
+   r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']})
+   r.columns = r.columns.set_levels(['foo', 'bar'], level=0)


I think .rename works here as well

In [11]: r.rename(columns={"B": "foo", "C": "bar"}) Out[11]: foo bar sum max count min A 1 3 2 3 0 2 7 4 2 3

though I didn't realize .rename worked like that on MI. I thought you'd get tuples.

Yes, for the first level this indeed works.

hmm that's nice actually.

jorisvandenbossche

Nice whatsnew docs again!

There are some places in the docs that need updating as well. Eg here: http://pandas-docs.github.io/pandas-docs-travis/groupby.html#applying-multiple-functions-at-once and here: http://pandas-docs.github.io/pandas-docs-travis/timeseries.html#aggregation

jorisvandenbossche · 2017-04-12T12:39:59Z

doc/source/whatsnew/v0.20.0.txt

+(potentially different) aggregations.
+
+However, ``.agg(..)`` can *also* accept a dict that allows 'renaming' of the result columns. This is a complicated and confusing syntax, as well as not consistent
+between ``Series`` and ``DataFrame``. We are deprecating this 'renaming' functionarility.


typo in functionarility

jorisvandenbossche · 2017-04-12T12:43:33Z

doc/source/whatsnew/v0.20.0.txt

+.. ipython:: python
+
+   df.groupby('A').agg({'B': ['sum', 'max'],
+                        'C': ['count', 'min']})


You might do the simpler thing of a dict of scalars instead of list of lists (to not complicate the example).

Eg

In [29]: df.groupby('A').agg({'B': 'sum','C': 'min'}) Out[29]: C B A 1 0 3 2 3 7

Then it even contrasts more with the series one, where the example is also a dict of scalar.

jorisvandenbossche · 2017-04-12T12:56:44Z

doc/source/whatsnew/v0.20.0.txt

+
+.. ipython:: python
+
+   df.groupby('A').B.agg(['count']).rename({'count': 'foo'})


df.groupby('A').B.agg('count').rename('foo') would actually be simpler .. (but is not exactly equivalent to the dict case)

yes, there are a myriad of options!

jorisvandenbossche · 2017-04-12T12:57:47Z

doc/source/whatsnew/v0.20.0.txt

+.. ipython:: python
+
+   r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']})
+   r.columns = r.columns.set_levels(['foo', 'bar'], level=0)


Yes, for the first level this indeed works.

jorisvandenbossche · 2017-04-12T13:23:04Z

pandas/tests/groupby/test_aggregate.py

+                           'C': range(5)})
+
+        with tm.assert_produces_warning(FutureWarning,
+                                        check_stacklevel=False) as w:


It is giving an error without the check_stacklevel=False ?

yeah I never check the stacklevel, too hard to get it exactly right.

jorisvandenbossche · 2017-04-12T13:29:04Z

pandas/tests/groupby/test_aggregate.py

+            df.groupby('A').agg({'B': {'foo': ['sum', 'max']},
+                                 'C': {'bar': ['count', 'min']}})
+            assert "using a dict with renaming" in str(w[0].message)
+


It is tested in other cases as well, but since this is a test specifically for the deprs, maybe also add the case of df.groupby('A')[['B', 'C']].agg({'ma': 'max'}) (then you have the different 'cases' that raise deprecation here)

jorisvandenbossche · 2017-04-12T13:37:36Z

pandas/core/base.py

-    def name(self):
+    def _selection_name(self):
+        """ return a name for myself; this would ideally be the 'name' property, but
+        we cannot conflict with the Series.name property which can be set """


This explanation is not fully clear to me (but maybe I am not familiar enough with the groupby codebase)

yeah this is purely internal.

jorisvandenbossche · 2017-04-12T13:46:40Z

pandas/core/groupby.py

+                    ("using a dict on a Series for aggregation\n"
+                     "is deprecated and will be removed in a future "
+                     "version"),
+                    FutureWarning, stacklevel=7)


Using python 3.5, this needed to be 3 instead of 7 for the example of the whatsnew docs (but may depend on the code path taken up to this function)

yeah I have no idea what these should be. I think I was playing with these when testing.

the real issue is that the agg evaulation is recursive (somewhat), so I could figure it out I suppose but...

I see you changed them all to 4, that is indeed good for the others, but for this one you need 3 for the case in those tests.

(patch based on the previous state of this PR, the first two are already OK):

diff --git a/pandas/core/base.py b/pandas/core/base.py index 25ebb1d..451c773 100644 --- a/pandas/core/base.py +++ b/pandas/core/base.py @@ -502,7 +502,7 @@ pandas.DataFrame.%(name)s ("using a dict with renaming " "is deprecated and will be removed in a future " "version"), - FutureWarning, stacklevel=3) + FutureWarning, stacklevel=4) arg = new_arg @@ -516,7 +516,7 @@ pandas.DataFrame.%(name)s ("using a dict with renaming " "is deprecated and will be removed in a future " "version"), - FutureWarning, stacklevel=3) + FutureWarning, stacklevel=4) from pandas.tools.concat import concat diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py index 978f444..5591ce4 100644 --- a/pandas/core/groupby.py +++ b/pandas/core/groupby.py @@ -2843,7 +2843,7 @@ class SeriesGroupBy(GroupBy): ("using a dict on a Series for aggregation\n" "is deprecated and will be removed in a future " "version"), - FutureWarning, stacklevel=7) + FutureWarning, stacklevel=3) columns = list(arg.keys()) arg = list(arg.items())

The above gives correct warnings for the three cases in this explicit test for deprecation warnings.

So I would try to remove check_stacklevel=False for those three (and leave it for all other, as indeed with some other code paths, it might be different ..)

jorisvandenbossche · 2017-04-12T13:47:37Z

pandas/core/base.py

+
+                    raise ValueError("cannot perform both aggregation "
+                                     "and transformation operations "
+                                     "simultaneously")


Is this a new error message? (and in case so, just checking if there is a test added for it?)

this are used in .agg changes (which are on top of this). They aren't used in this PR, but were in the same file so left them.

jreback · 2017-04-12T22:45:03Z

ok docs are updated. note that in the post-PR #14668 I already rewrote a lot of the docs (and links) for agg/transform.

jreback · 2017-04-13T10:18:07Z

merging, will fix up any additional comments in #14668

zertrin · 2017-06-13T07:27:33Z

Hi, sorry for digging this up, but even if I understand the rationale for the deprecation, and after reading the What's New and the documentation, I still don't see how to replace the following use case.

(The documentation is only covering the simple case where one either apply exactly one aggregator per column, or the same set of aggregators over all columns, but not when different sets of aggregator are applied to different columns):

Input Dataframe:

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6))

  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

Cool aggregation and rename in one step (but DEPRECATED):

mydf_agg = mydf.groupby('cat').agg({
    'energy': {'energy_sum': 'sum'},
    'distance': {
        'distance_sum': 'sum',
        'distance_mean': 'mean',
    },
})

Resulting in a MultiIndex columns

        energy     distance              
    energy_sum distance_sum distance_mean
cat                                      
A         5.79         4.44         1.480
B         2.85         1.83         0.915
C         1.01         0.60         0.600

Just have to drop the upper level to get to my resulting dataframe with the renamed columns:

mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

     energy_sum  distance_sum  distance_mean
cat                                         
A          5.79          4.44          1.480
B          2.85          1.83          0.915
C          1.01          0.60          0.600

Of course this is a toy example, in a typical usecase there can be many more columns/aggregator functions.

So my question is: could you provide an example of the currently recommended way to achieve the exact same result (last Dataframe) in the case where different sets of aggregator are applied to different columns.

zertrin · 2017-06-13T07:42:20Z

Oh i see that that was originally documented, but subsequently simplified:
ff1a5f6#diff-52364fb643114f3349390ad6bcf24d8fL521

However by trying this approach, I'm still blocked:

mydf_agg2 = mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
})

    energy distance       
       sum      sum   mean
cat                       
A     5.79     4.44  1.480
B     2.85     1.83  0.915
C     1.01     0.60  0.600

But then, how can I rename with a mapping of (level0 + level1 --> final_name) like this:

{
    'energy.sum': 'energy_sum',
    'distance.sum': 'distance_sum',
    'distance.mean': 'distance_mean',
}

Or even better, by using some kind of callable like this:

def rename_mapping(level0, level1):
    return level0 + '_' + level1

zertrin · 2017-06-13T08:03:11Z

Sorry for the spam (this is the last one) but I just found an interesting discussion and solutions here: https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function (don't look at the accepted answer)

In particular, the missing piece of information for me was the existence of the df.columns.ravel() method.

newidx = []
for (n1,n2) in mydf_agg.columns.ravel():
    newidx.append("%s_%s" % (n1,n2))
mydf_agg.columns=newidx

More generally I think this is good to leave a link to this stackoverflow thread here, since after seeing the deprecation message, this GitHub pull request is one of the first place where to look for solutions (after the docs and what's new).

Maybe some of Joel Ostblom's answer and/or of Gadi Oron's answer could make their way into the docs as an example for all of us that relied previously on this relabeling functionality with .agg() ?

In particular, with this deprecation, the use of lambda functions in .agg() is directly impacted (cf Joel Ostblom's answer above) and could warrant a notice in the docs.

jreback · 2017-06-13T10:28:54Z

@zertrin if you want to show a more extended / complex example in the docs that would be great. push up a PR and will comment.

garfieldthecat · 2017-10-11T17:16:20Z

Please, please, pretty please, do NOT deprecate this. Not only is removing backward compatibility always an issue, and one of the key obstacles in the adoption of Python for data science - it makes it way more cumbersome to run what should be extremely banal, i.e. a groupby where different aggregate functions are applied to different columns (sum of x, avg of x, min of y, etc), and where you have the explicit need to rename the resulting field (e.g. sum_x won't do). The way you are going, you are forcing people to rename fields manually after the groupby - surely this is as non-pythonic as it gets?

I do not understand in what way removing this feature would possibly clean anything up, or make anything clearer. How would you answer this question now? https://stackoverflow.com/questions/32374620/python-pandas-applying-different-aggregate-functions-to-different-columns

How would you recommend rewriting this very simple and IMHO pythonically elegant line of code?

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "Mean GDP", "std_gdp": "STD of GDP"},
                                "unemp": {"mean_unemp": "Mean unemployment"}})

TomAugspurger · 2017-10-11T17:31:18Z

I assume you meant

In [21]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
    ...:                        "unemp": {"mean_unemp": "mean"}})
    ...:
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4139: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[21]:
     realgdp                  unemp
    mean_gdp     std_gdp mean_unemp
qtr
1     1692.0  115.258405       4.95
2     1658.8         NaN       5.60
3     1723.0         NaN       4.60
4     1753.9         NaN       4.20

The recommendation in the whatsnew gets you most of the way there:

In [30]: r = df.groupby("qtr").agg({"realgdp": ['mean', 'std'], "unemp": ['mean']})

In [32]: r
Out[32]:
    realgdp             unemp
       mean         std  mean
qtr
1    1692.0  115.258405  4.95
2    1658.8         NaN  5.60
3    1723.0         NaN  4.60
4    1753.9         NaN  4.20

Does that work for you? At this point I typically get rid of the MI in the columns, since I find them awkward to work with.

garfieldthecat · 2017-10-11T17:59:41Z

How would you recommend renaming the columns?
If I just do columns.droplevel(0), I end up with multiple columns sharing the same name, as the same aggregate function applies to multiple columns.
I could do something like
r.columns = [' '.join(col).strip() for col in r.columns.values]

so that the fields become: [ "x sum", "x min", "y sum"] etc. (or whatever the aggregate functions were)
and take it from here, but it is still longer and more cumbersome that my previous approach.

**Can someone please, please, please remind me why this is being deprecated?

I see the downsides, I do not see any upside!**

Removing backward compatibility should always be a last resort. Doing so when the new approach becomes way longer and more convoluted, well, it just beggars belief!

zertrin · 2017-10-12T03:13:51Z

At this point I typically get rid of the MI in the columns, since I find them awkward to work with.

Yeah, this is one of issue with this change: it makes something trivially simple to do before much harder to do now. Especially when applying the same aggregate over many columns, you can't just drop the first level of the MI.

In summary, this is what this changes results in:

Before:

straightforward, quite easy to understand, flexible (I have the choice for the name of the columns)

mydf_agg = mydf.groupby('cat').agg({
    'energy': {'total_energy': 'sum'},
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
    },
})
# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

Result:

    total_energy total_distance average_distance
cat
A           5.79           4.44            1.480
B           2.85           1.83            0.915
C           1.01           0.60            0.600

After:

No way of really customizing the column names after aggregation, the best we can get is some combination of original column name and aggregate function's name:

mydf_agg2 = mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
})
mydf_agg2.columns = ['_'.join(col) for col in mydf_agg2.columns]

Result:

     energy_sum  distance_sum  distance_mean
cat
A          5.79          4.44          1.480
B          2.85          1.83          0.915
C          1.01          0.60          0.600

Note that I couldn't really choose the name of the resulting columns.... If I want to, I need to find another way of replacing the name. Like a mapping like this (which is annoying to write):

mydf_agg2.rename({"energy_sum": "total_energy", "distance_sum": "total_distance", "distance_mean": "average_distance"}, inplace=True)

Now we finally get the same result as before, just in a longer and more complicated way...

And another annoying issue with this change: when using custom aggregation callables:

Before there's no issue since I could specify the destination's column name myself.
Now I can't do it that easily since the destination column name is based on the aggregate callable's name and I need to make sure that my custom aggregation callable has a __name__ attribute... Which isn't necessary the case with partial or lambda functions for example.

TomAugspurger · 2017-10-12T10:47:35Z

Can someone please, please, please remind me why this is being deprecated?

I see the downsides, I do not see any upside!

I'm assuming you saw the release note with the deprecation? A nested dictionary meant we had two behaviors for the renaming, either selecting columns, or assigning names.

Thanks for the thoughtful writeup @zertrin. It sounds like the main difficulty is with the renaming. Would something like

mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
}).collapse_levels(columns="_")  # [-'.join(col) for col in df.columns]

Work for you? That's when the "default" names are OK. For non-default names, maybe something like

mydf.groupby('cat').agg({
    ...
}).relabel(columns=['c1', 'c2', 'c3'])

garfieldthecat · 2017-10-12T10:55:24Z

I'm assuming you saw the release note with the deprecation? A nested dictionary meant we had two >behaviors for the renaming, either selecting columns, or assigning names.

Do you mean these notes? #14668
I read them but I'm not sure I understood the reason. Could you maybe try to re-explain?

garfieldthecat · 2017-10-12T12:23:03Z

Also, another problem with renaming is if you use more than one lambda function on the same column, e.g. to calculate the % of both sum(x) and count(x). In this case, you'd end up with multiple columns having the same name: "x_lambda". Quite a mess! You could rename the columns based on their position rather than their names, but it's extremely cumbersome and un-pythonic. All of this would, of course, be avoided by not deprecating.
Or is there maybe a better way I am missing?

zertrin · 2017-10-12T13:35:39Z

@TomAugspurger Thanks for proposing alternatives, however these miss to tackle the real issue.

The problem is not whether or not it is possible to do the renaming and how. The answer to that is yes it's possible and your proposed solutions do not bring more value than the other example above (like assigning directly to mydf.columns), except the burden of adding two more methods to the already long list of methods of the DataFrame class.

The real issue is that this change forces us to separate the place where the renaming is defined from the definition of the corresponding aggregate function.

Semantically this is really annoying, because now we have to keep track of two lists and keep them in sync when we want to add another aggregate column. We must track down in which order the new aggregate column will land and where in the renaming list to update after adding one more aggregate function...

So in a nutshell:

Before, column renaming and the definition of the operation were together, so they are naturally in sync.
Now, first you define the aggregate callable, and afterward you have to rename and be very careful about the resulting column order.

And no one has even begun to address the issue of using custom aggregates. Because these custom callables may have the same __name__ attribute and this results in an exception (partial functions inherit the name of the parent function, and one cannot define it at creation, and all lambda functions are named <lambda> and this is worse because afaik there's no way to define the name of a lambda).

Thus this is a backward incompatible change, and this one has no easy workaround. (there exists tricky workarounds to add the __name__ attribute)

Slightly extended example from above with lambda and partial:

(please note, this is a crafted example for the purpose of demonstrating the problem, but all of the demonstrated issues here did bite me in real life since the change)

Before:

easy and works as expected

import numpy as np
import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),
        'energy_p17': percentile17,
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,
        'distance_mad_c1': mad_c1,
    },
})

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

After

import numpy as np
import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({
    'energy': [
    	'sum',
    	lambda x: np.percentile(x, 98),
    	percentile17
    ],
    'distance': [
    	'sum',
    	'mean',
    	smrb.mad,
    	mad_c1
    ],
})

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still the renaming to deal with.

zertrin · 2017-11-02T04:39:33Z

@TomAugspurger @jreback do we need to open a separate issue to get this deprecation being reconsidered with all the new facts summarized above that were not initially considered when deciding this?

shoyer · 2017-11-02T05:38:14Z

If you feel strongly about this, then yes, a new issue would be appropriate. I agree that this API is not as expressive as what we had before, but the behavior we had before for .agg() was inconsistent and could not be explained with a simple set of rules. Please read the full discussion on #14668 for context.

I would be interested to see proposals for alternative APIs that solve your use-case without the full complexity of the deprecated GroupBy.agg() API. For example, one solution might be to handle the deprecated behavior (dict-of-dict) with a new dedicated method.

TomAugspurger · 2017-11-02T10:53:13Z

But yes, please open a new issue for discussion so that this isn't buried.

…

On Thu, Nov 2, 2017 at 12:38 AM, Stephan Hoyer ***@***.***> wrote: If you feel strongly about this, then yes, a new issue would be appropriate. I agree that this API is not as expressive as what we had before, but the behavior we had before for .agg() was inconsistent and could not be explained with a simple set of rules. Please read the full discussion on #14668 <#14668> for context. I would be interested to see proposals for alternative APIs that solve your use-case without the full complexity of the deprecated GroupBy.agg() API. For example, one solution might be to handle the deprecated behavior (dict-of-dict) with a new dedicated method. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15931 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInnCQicjV9USJ3E3nmQ5obvMqhEYks5syVVKgaJpZM4M2ZQ4> .

jaron-hivery · 2017-11-14T00:06:57Z

@zertrin did you open a new issue for this discussion?

zertrin · 2017-11-14T00:12:31Z

Not yet, have been pretty busy lately, but have a draft. Will try to finish it very soon.

zertrin · 2017-11-19T07:17:45Z

@jaron-hivery the new issue is #18366

pirsquared · 2017-11-30T20:22:28Z

I'm sure @zertrin saw an email but I provided an easy recipe to produce the same results using existing API.

#18366 (comment)

Grouped, rolled, and resample Series / DataFrames will now disallow dicts / nested dicts respectively as parameters to aggregation (was deprecated before). xref pandas-devgh-15931.

jreback added Deprecate Functionality to remove in pandas Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2017

jreback added this to the 0.20.0 milestone Apr 7, 2017

jreback changed the title ~~DEPR: deprecate relabling dictionarys in groupby.agg~~ DEPR: deprecate relableling dicts in groupby.agg Apr 7, 2017

jreback force-pushed the af branch from ec6361f to 8d64882 Compare April 7, 2017 20:39

jreback force-pushed the af branch 2 times, most recently from a63da2d to d87e564 Compare April 12, 2017 10:20

jsexauer mentioned this pull request Apr 12, 2017

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

TomAugspurger reviewed Apr 12, 2017

View reviewed changes

jorisvandenbossche reviewed Apr 12, 2017

View reviewed changes

jreback and others added 4 commits April 12, 2017 18:17

DEPR: deprecate relabling dictionarys in groupby.agg

d0107af

give proper message for Dataframe with renaming keys

29f3ae6

docs & fix window test

7262515

update docs per review

ff1a5f6

jreback force-pushed the af branch from d87e564 to ff1a5f6 Compare April 12, 2017 22:43

jreback merged commit 1c4dacb into pandas-dev:master Apr 13, 2017

jreback mentioned this pull request Apr 13, 2017

TST: use checkstack level as per comments in groupby.agg with dicts depr testing #15992

Merged

jorisvandenbossche mentioned this pull request Apr 15, 2017

ENH/BUG: Rename of MultiIndex DataFrames does not work #4160

Open

TomAugspurger mentioned this pull request May 4, 2017

Additional parameterization for groupby.agg #2286

Closed

zertrin mentioned this pull request Nov 19, 2017

Deprecation of relabeling dicts in groupby.agg brings many issues #18366

Closed

gfyoung mentioned this pull request Oct 28, 2018

API: Disallow dict as agg parameter during groupby #23393

Closed

jreback mentioned this pull request Oct 28, 2018

DEPR: deprecations log for removed issues #13777

Closed

Khris777 mentioned this pull request Feb 28, 2019

"SpecificationError: nested dictionary is ambiguous in aggregation" in a certain case of groupby-aggregation #25471

Closed

ghost mentioned this pull request Jul 14, 2019

Discuss: transformation vs. aggregation in agg vs. transform #27389

Closed

OlivierCavadenti mentioned this pull request Oct 8, 2021

TST : add test for groupby aggregation dtype #43915

Merged

4 tasks

rhshadrach mentioned this pull request Jan 11, 2023

DEPR: SeriesGroupBy.agg with dict argument #50684

Closed


		.. ipython:: python

		df.groupby('A').B.agg(['count']).rename({'count': 'foo'})

DEPR: deprecate relableling dicts in groupby.agg #15931

DEPR: deprecate relableling dicts in groupby.agg #15931

Conversation

jreback commented Apr 7, 2017 • edited Loading

jreback commented Apr 7, 2017

codecov bot commented Apr 7, 2017 • edited Loading

Codecov Report

jreback commented Apr 9, 2017

chris-b1 commented Apr 9, 2017

jreback commented Apr 9, 2017

jreback commented Apr 12, 2017

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Apr 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 12, 2017

jreback commented Apr 13, 2017

zertrin commented Jun 13, 2017 • edited Loading

zertrin commented Jun 13, 2017

zertrin commented Jun 13, 2017

jreback commented Jun 13, 2017

garfieldthecat commented Oct 11, 2017

TomAugspurger commented Oct 11, 2017

garfieldthecat commented Oct 11, 2017

zertrin commented Oct 12, 2017 • edited Loading

Before:

After:

TomAugspurger commented Oct 12, 2017

garfieldthecat commented Oct 12, 2017

garfieldthecat commented Oct 12, 2017

zertrin commented Oct 12, 2017

Before:

After

zertrin commented Nov 2, 2017

shoyer commented Nov 2, 2017

TomAugspurger commented Nov 2, 2017 via email

jaron-hivery commented Nov 14, 2017

zertrin commented Nov 14, 2017

zertrin commented Nov 19, 2017

pirsquared commented Nov 30, 2017

jreback commented Apr 7, 2017 •

edited

Loading

codecov bot commented Apr 7, 2017 •

edited

Loading

jorisvandenbossche Apr 13, 2017 •

edited

Loading

zertrin commented Jun 13, 2017 •

edited

Loading

zertrin commented Oct 12, 2017 •

edited

Loading