-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: deprecate relableling dicts in groupby.agg #15931
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15931 +/- ##
==========================================
- Coverage 91.03% 91% -0.03%
==========================================
Files 145 145
Lines 49587 49636 +49
==========================================
+ Hits 45141 45171 +30
- Misses 4446 4465 +19
Continue to review full report at Codecov.
|
@jorisvandenbossche @chris-b1 @shoyer thoughts (chris this was talked about at the meeting, to try to reduce the number of cases that we would handle). The idea is that we go ahead with this deprecation, then merge |
This seems like a reasonable deprecation, the current behavior is probably too overloaded and hard to think about. Might put the recommended way in the deprecation message? Would also be nice to have #4160 in 0.20, so the DataFrame case is more consistent. |
yes going to try to fix #4160 before the release as well. |
a63da2d
to
d87e564
Compare
@jorisvandenbossche @TomAugspurger @chris-b1 @shoyer if you'd have a look. going to merge later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on a quick skim. A few other places in the docs that may need updating:
- https://github.com/pandas-dev/pandas/blame/1751628adef96b913d0083a48e51658a70dac8c4/doc/source/computation.rst#L613
- remove second sentence here
- replace this example with the new syntax: https://github.com/jreback/pandas/blob/d87e564be48a3db2a4c55d78cbf23d4177cbac2d/pandas/core/groupby.py#L2787
doc/source/whatsnew/v0.20.0.txt
Outdated
1) We are deprecating passing a dict to a grouped/rolled/resampled ``Series``. This allowed | ||
one to ``rename`` the resulting aggregation, but this had a completely different | ||
meaning than passing a dictionary to a grouped ``DataFrame``, which accepts column-to-aggregations. | ||
2) We are deprecating passing a dict-of-dict to a grouped/rolled/resampled ``DataFrame`` in a similar manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict-of-dicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.20.0.txt
Outdated
.. code-block:: ipython | ||
|
||
In [6]: df.groupby('A').B.agg({'foo': 'count'}) | ||
FutureWarning: using a dictionary on a Series for aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to updated this with the new FutureWarning if we change the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.20.0.txt
Outdated
.. ipython:: python | ||
|
||
r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']}) | ||
r.columns = r.columns.set_levels(['foo', 'bar'], level=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think .rename
works here as well
In [11]: r.rename(columns={"B": "foo", "C": "bar"})
Out[11]:
foo bar
sum max count min
A
1 3 2 3 0
2 7 4 2 3
though I didn't realize .rename
worked like that on MI. I thought you'd get tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the first level this indeed works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that's nice actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice whatsnew docs again!
There are some places in the docs that need updating as well. Eg here: http://pandas-docs.github.io/pandas-docs-travis/groupby.html#applying-multiple-functions-at-once and here: http://pandas-docs.github.io/pandas-docs-travis/timeseries.html#aggregation
doc/source/whatsnew/v0.20.0.txt
Outdated
(potentially different) aggregations. | ||
|
||
However, ``.agg(..)`` can *also* accept a dict that allows 'renaming' of the result columns. This is a complicated and confusing syntax, as well as not consistent | ||
between ``Series`` and ``DataFrame``. We are deprecating this 'renaming' functionarility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in functionarility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.20.0.txt
Outdated
.. ipython:: python | ||
|
||
df.groupby('A').agg({'B': ['sum', 'max'], | ||
'C': ['count', 'min']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might do the simpler thing of a dict of scalars instead of list of lists (to not complicate the example).
Eg
In [29]: df.groupby('A').agg({'B': 'sum','C': 'min'})
Out[29]:
C B
A
1 0 3
2 3 7
Then it even contrasts more with the series one, where the example is also a dict of scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
||
.. ipython:: python | ||
|
||
df.groupby('A').B.agg(['count']).rename({'count': 'foo'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.groupby('A').B.agg('count').rename('foo')
would actually be simpler .. (but is not exactly equivalent to the dict case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are a myriad of options!
doc/source/whatsnew/v0.20.0.txt
Outdated
.. ipython:: python | ||
|
||
r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']}) | ||
r.columns = r.columns.set_levels(['foo', 'bar'], level=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the first level this indeed works.
'C': range(5)}) | ||
|
||
with tm.assert_produces_warning(FutureWarning, | ||
check_stacklevel=False) as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is giving an error without the check_stacklevel=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I never check the stacklevel, too hard to get it exactly right.
df.groupby('A').agg({'B': {'foo': ['sum', 'max']}, | ||
'C': {'bar': ['count', 'min']}}) | ||
assert "using a dict with renaming" in str(w[0].message) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested in other cases as well, but since this is a test specifically for the deprs, maybe also add the case of df.groupby('A')[['B', 'C']].agg({'ma': 'max'})
(then you have the different 'cases' that raise deprecation here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pandas/core/base.py
Outdated
def name(self): | ||
def _selection_name(self): | ||
""" return a name for myself; this would ideally be the 'name' property, but | ||
we cannot conflict with the Series.name property which can be set """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is not fully clear to me (but maybe I am not familiar enough with the groupby codebase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is purely internal.
pandas/core/groupby.py
Outdated
("using a dict on a Series for aggregation\n" | ||
"is deprecated and will be removed in a future " | ||
"version"), | ||
FutureWarning, stacklevel=7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using python 3.5, this needed to be 3 instead of 7 for the example of the whatsnew docs (but may depend on the code path taken up to this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I have no idea what these should be. I think I was playing with these when testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the real issue is that the agg evaulation is recursive (somewhat), so I could figure it out I suppose but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you changed them all to 4, that is indeed good for the others, but for this one you need 3 for the case in those tests.
(patch based on the previous state of this PR, the first two are already OK):
diff --git a/pandas/core/base.py b/pandas/core/base.py
index 25ebb1d..451c773 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -502,7 +502,7 @@ pandas.DataFrame.%(name)s
("using a dict with renaming "
"is deprecated and will be removed in a future "
"version"),
- FutureWarning, stacklevel=3)
+ FutureWarning, stacklevel=4)
arg = new_arg
@@ -516,7 +516,7 @@ pandas.DataFrame.%(name)s
("using a dict with renaming "
"is deprecated and will be removed in a future "
"version"),
- FutureWarning, stacklevel=3)
+ FutureWarning, stacklevel=4)
from pandas.tools.concat import concat
diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 978f444..5591ce4 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2843,7 +2843,7 @@ class SeriesGroupBy(GroupBy):
("using a dict on a Series for aggregation\n"
"is deprecated and will be removed in a future "
"version"),
- FutureWarning, stacklevel=7)
+ FutureWarning, stacklevel=3)
columns = list(arg.keys())
arg = list(arg.items())
The above gives correct warnings for the three cases in this explicit test for deprecation warnings.
So I would try to remove check_stacklevel=False
for those three (and leave it for all other, as indeed with some other code paths, it might be different ..)
|
||
raise ValueError("cannot perform both aggregation " | ||
"and transformation operations " | ||
"simultaneously") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new error message? (and in case so, just checking if there is a test added for it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this are used in .agg changes (which are on top of this). They aren't used in this PR, but were in the same file so left them.
ok docs are updated. note that in the post-PR #14668 I already rewrote a lot of the docs (and links) for agg/transform. |
merging, will fix up any additional comments in #14668 |
Hi, sorry for digging this up, but even if I understand the rationale for the deprecation, and after reading the What's New and the documentation, I still don't see how to replace the following use case. (The documentation is only covering the simple case where one either apply exactly one aggregator per column, or the same set of aggregators over all columns, but not when different sets of aggregator are applied to different columns): Input Dataframe:
Cool aggregation and rename in one step (but DEPRECATED):
Resulting in a MultiIndex columns
Just have to drop the upper level to get to my resulting dataframe with the renamed columns:
Of course this is a toy example, in a typical usecase there can be many more columns/aggregator functions. So my question is: could you provide an example of the currently recommended way to achieve the exact same result (last Dataframe) in the case where different sets of aggregator are applied to different columns. |
Oh i see that that was originally documented, but subsequently simplified: However by trying this approach, I'm still blocked:
But then, how can I rename with a mapping of (
Or even better, by using some kind of callable like this:
|
Sorry for the spam (this is the last one) but I just found an interesting discussion and solutions here: https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function (don't look at the accepted answer) In particular, the missing piece of information for me was the existence of the
More generally I think this is good to leave a link to this stackoverflow thread here, since after seeing the deprecation message, this GitHub pull request is one of the first place where to look for solutions (after the docs and what's new). Maybe some of Joel Ostblom's answer and/or of Gadi Oron's answer could make their way into the docs as an example for all of us that relied previously on this relabeling functionality with .agg() ? In particular, with this deprecation, the use of lambda functions in .agg() is directly impacted (cf Joel Ostblom's answer above) and could warrant a notice in the docs. |
@zertrin if you want to show a more extended / complex example in the docs that would be great. push up a PR and will comment. |
Please, please, pretty please, do NOT deprecate this. Not only is removing backward compatibility always an issue, and one of the key obstacles in the adoption of Python for data science - it makes it way more cumbersome to run what should be extremely banal, i.e. a groupby where different aggregate functions are applied to different columns (sum of x, avg of x, min of y, etc), and where you have the explicit need to rename the resulting field (e.g. sum_x won't do). The way you are going, you are forcing people to rename fields manually after the groupby - surely this is as non-pythonic as it gets? I do not understand in what way removing this feature would possibly clean anything up, or make anything clearer. How would you answer this question now? https://stackoverflow.com/questions/32374620/python-pandas-applying-different-aggregate-functions-to-different-columns How would you recommend rewriting this very simple and IMHO pythonically elegant line of code?
|
I assume you meant In [21]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
...: "unemp": {"mean_unemp": "mean"}})
...:
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4139: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[21]:
realgdp unemp
mean_gdp std_gdp mean_unemp
qtr
1 1692.0 115.258405 4.95
2 1658.8 NaN 5.60
3 1723.0 NaN 4.60
4 1753.9 NaN 4.20 The recommendation in the whatsnew gets you most of the way there: In [30]: r = df.groupby("qtr").agg({"realgdp": ['mean', 'std'], "unemp": ['mean']})
In [32]: r
Out[32]:
realgdp unemp
mean std mean
qtr
1 1692.0 115.258405 4.95
2 1658.8 NaN 5.60
3 1723.0 NaN 4.60
4 1753.9 NaN 4.20 Does that work for you? At this point I typically get rid of the MI in the columns, since I find them awkward to work with. |
How would you recommend renaming the columns? so that the fields become: [ "x sum", "x min", "y sum"] etc. (or whatever the aggregate functions were) **Can someone please, please, please remind me why this is being deprecated? I see the downsides, I do not see any upside!** Removing backward compatibility should always be a last resort. Doing so when the new approach becomes way longer and more convoluted, well, it just beggars belief! |
Yeah, this is one of issue with this change: it makes something trivially simple to do before much harder to do now. Especially when applying the same aggregate over many columns, you can't just drop the first level of the MI. In summary, this is what this changes results in: Before:straightforward, quite easy to understand, flexible (I have the choice for the name of the columns)
Result:
After:No way of really customizing the column names after aggregation, the best we can get is some combination of original column name and aggregate function's name:
Result:
Note that I couldn't really choose the name of the resulting columns.... If I want to, I need to find another way of replacing the name. Like a mapping like this (which is annoying to write):
Now we finally get the same result as before, just in a longer and more complicated way... And another annoying issue with this change: when using custom aggregation callables:
|
I'm assuming you saw the release note with the deprecation? A nested dictionary meant we had two behaviors for the renaming, either selecting columns, or assigning names. Thanks for the thoughtful writeup @zertrin. It sounds like the main difficulty is with the renaming. Would something like mydf.groupby('cat').agg({
'energy': 'sum',
'distance': ['sum', 'mean'],
}).collapse_levels(columns="_") # [-'.join(col) for col in df.columns] Work for you? That's when the "default" names are OK. For non-default names, maybe something like mydf.groupby('cat').agg({
...
}).relabel(columns=['c1', 'c2', 'c3']) |
Do you mean these notes? #14668 |
Also, another problem with renaming is if you use more than one lambda function on the same column, e.g. to calculate the % of both sum(x) and count(x). In this case, you'd end up with multiple columns having the same name: "x_lambda". Quite a mess! You could rename the columns based on their position rather than their names, but it's extremely cumbersome and un-pythonic. All of this would, of course, be avoided by not deprecating. |
@TomAugspurger Thanks for proposing alternatives, however these miss to tackle the real issue. The problem is not whether or not it is possible to do the renaming and how. The answer to that is yes it's possible and your proposed solutions do not bring more value than the other example above (like assigning directly to The real issue is that this change forces us to separate the place where the renaming is defined from the definition of the corresponding aggregate function. Semantically this is really annoying, because now we have to keep track of two lists and keep them in sync when we want to add another aggregate column. We must track down in which order the new aggregate column will land and where in the renaming list to update after adding one more aggregate function... So in a nutshell:
And no one has even begun to address the issue of using custom aggregates. Because these custom callables may have the same Thus this is a backward incompatible change, and this one has no easy workaround. (there exists tricky workarounds to add the Slightly extended example from above with lambda and partial: (please note, this is a crafted example for the purpose of demonstrating the problem, but all of the demonstrated issues here did bite me in real life since the change) Before:easy and works as expected import numpy as np
import statsmodels.robust as smrb
percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)
mydf_agg = mydf.groupby('cat').agg({
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98),
'energy_p17': percentile17,
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad,
'distance_mad_c1': mad_c1,
},
}) results in
and all is left is: # get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0) Afterimport numpy as np
import statsmodels.robust as smrb
percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)
mydf_agg = mydf.groupby('cat').agg({
'energy': [
'sum',
lambda x: np.percentile(x, 98),
percentile17
],
'distance': [
'sum',
'mean',
smrb.mad,
mad_c1
],
}) The above breaks because the lambda functions will all result in columns named
Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore. If one removes the
Finally, after overwriting the
with still the renaming to deal with. |
@TomAugspurger @jreback do we need to open a separate issue to get this deprecation being reconsidered with all the new facts summarized above that were not initially considered when deciding this? |
If you feel strongly about this, then yes, a new issue would be appropriate. I agree that this API is not as expressive as what we had before, but the behavior we had before for I would be interested to see proposals for alternative APIs that solve your use-case without the full complexity of the deprecated |
But yes, please open a new issue for discussion so that this isn't buried.
…On Thu, Nov 2, 2017 at 12:38 AM, Stephan Hoyer ***@***.***> wrote:
If you feel strongly about this, then yes, a new issue would be
appropriate. I agree that this API is not as expressive as what we had
before, but the behavior we had before for .agg() was inconsistent and
could not be explained with a simple set of rules. Please read the full
discussion on #14668 <#14668>
for context.
I would be interested to see proposals for alternative APIs that solve
your use-case without the full complexity of the deprecated GroupBy.agg()
API. For example, one solution might be to handle the deprecated behavior
(dict-of-dict) with a new dedicated method.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15931 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInnCQicjV9USJ3E3nmQ5obvMqhEYks5syVVKgaJpZM4M2ZQ4>
.
|
@zertrin did you open a new issue for this discussion? |
Not yet, have been pretty busy lately, but have a draft. Will try to finish it very soon. |
@jaron-hivery the new issue is #18366 |
I'm sure @zertrin saw an email but I provided an easy recipe to produce the same results using existing API. |
Grouped, rolled, and resample Series / DataFrames will now disallow dicts / nested dicts respectively as parameters to aggregation (was deprecated before). xref pandas-devgh-15931.
Grouped, rolled, and resample Series / DataFrames will now disallow dicts / nested dicts respectively as parameters to aggregation (was deprecated before). xref pandas-devgh-15931.
pre-curser to #14668
This is basically in the whatsnew, but:
This is good; multiple aggregations on a dataframe with a dict-of-lists
This is a dict on a grouped Series -> deprecated
Further this has to go as well, a nested dict that does renaming.
Note once we fix #4160 (renaming with a level); the following becomes almost trivial to rename in-line.
Note: I will fix this message (as it doesn't actually apply here)