-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support nested renaming / selection #26399
Changes from all commits
aa43cf6
8bd8e31
10c8f40
2e52653
06a86ec
9e636c1
14f66e6
2c3d11a
cdf9373
2c544f0
c0cd575
386cca1
2f6e1dc
6d8a18a
6c1f567
bcc63f5
769a909
1da90d4
a028f48
0ddd51f
42e69a1
769d7d3
1cee0e2
6369eb1
02d7169
eb9ba8f
7df14d7
cf8db51
9501e82
d65afe4
25dca1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -568,6 +568,67 @@ For a grouped ``DataFrame``, you can rename in a similar manner: | |
'mean': 'bar', | ||
'std': 'baz'})) | ||
|
||
.. _groupby.aggregate.named: | ||
|
||
Named Aggregation | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
.. versionadded:: 0.25.0 | ||
|
||
To support column-specific aggregation *with control over the output column names*, pandas | ||
accepts the special syntax in :meth:`GroupBy.agg`, known as "named aggregation", where | ||
|
||
- The keywords are the *output* column names | ||
- The values are tuples whose first element is the column to select | ||
and the second element is the aggregation to apply to that column. Pandas | ||
provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']`` | ||
to make it clearer what the arguments are. As usual, the aggregation can | ||
be a callable or a string alias. | ||
|
||
.. ipython:: python | ||
|
||
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'], | ||
'height': [9.1, 6.0, 9.5, 34.0], | ||
'weight': [7.9, 7.5, 9.9, 198.0]}) | ||
animals | ||
|
||
animals.groupby("kind").agg( | ||
min_height=pd.NamedAgg(column='height', aggfunc='min'), | ||
max_height=pd.NamedAgg(column='height', aggfunc='max'), | ||
average_weight=pd.NamedAgg(column='height', aggfunc=np.mean), | ||
) | ||
|
||
|
||
``pandas.NamedAgg`` is just a ``namedtuple``. Plain tuples are allowed as well. | ||
|
||
.. ipython:: python | ||
|
||
animals.groupby("kind").agg( | ||
min_height=('height', 'min'), | ||
max_height=('height', 'max'), | ||
average_weight=('height', np.mean), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would not show the example in a mixed form (as this is something we really don't want to recommend I think?). I would maybe just show it twice, eg first with tuples and then with comment |
||
) | ||
|
||
|
||
If your desired output column names are not valid python keywords, construct a dictionary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess these are technically identifiers instead of keywords |
||
and unpack the keyword arguments | ||
|
||
.. ipython:: python | ||
|
||
animals.groupby("kind").agg(**{ | ||
'total weight': pd.NamedAgg(column='weight', aggfunc=sum), | ||
}) | ||
|
||
Additional keyword arguments are not passed through to the aggregation functions. Only pairs | ||
of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions | ||
requires additional arguments, partially apply them with :meth:`functools.partial`. | ||
|
||
.. note:: | ||
|
||
For Python 3.5 and earlier, the order of ``**kwargs`` in a functions was not | ||
preserved. This means that the output column ordering would not be | ||
consistent. To ensure consistent ordering, the keys (and so output columns) | ||
will always be sorted for Python 3.5. | ||
|
||
Applying different functions to DataFrame columns | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
@@ -588,19 +649,6 @@ must be either implemented on GroupBy or available via :ref:`dispatching | |
|
||
grouped.agg({'C': 'sum', 'D': 'std'}) | ||
|
||
.. note:: | ||
|
||
If you pass a dict to ``aggregate``, the ordering of the output columns is | ||
non-deterministic. If you want to be sure the output columns will be in a specific | ||
order, you can use an ``OrderedDict``. Compare the output of the following two commands: | ||
|
||
.. ipython:: python | ||
|
||
from collections import OrderedDict | ||
|
||
grouped.agg({'D': 'std', 'C': 'mean'}) | ||
grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')])) | ||
|
||
.. _groupby.aggregate.cython: | ||
|
||
Cython-optimized aggregation functions | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
from pandas.core.groupby.groupby import GroupBy # noqa: F401 | ||
from pandas.core.groupby.generic import ( # noqa: F401 | ||
SeriesGroupBy, DataFrameGroupBy) | ||
DataFrameGroupBy, NamedAgg, SeriesGroupBy) | ||
from pandas.core.groupby.groupby import GroupBy # noqa: F401 | ||
from pandas.core.groupby.grouper import Grouper # noqa: F401 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,15 +6,18 @@ | |
which here returns a DataFrameGroupBy object. | ||
""" | ||
|
||
from collections import OrderedDict, abc | ||
from collections import OrderedDict, abc, namedtuple | ||
import copy | ||
from functools import partial | ||
from textwrap import dedent | ||
import typing | ||
from typing import Any, Callable, List, Union | ||
import warnings | ||
|
||
import numpy as np | ||
|
||
from pandas._libs import Timestamp, lib | ||
from pandas.compat import PY36 | ||
from pandas.errors import AbstractMethodError | ||
from pandas.util._decorators import Appender, Substitution | ||
|
||
|
@@ -41,6 +44,10 @@ | |
|
||
from pandas.plotting._core import boxplot_frame_groupby | ||
|
||
NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment on what this is (can we add a doc-string)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could using |
||
# TODO(typing) the return value on this callable should be any *scalar*. | ||
AggScalar = Union[str, Callable[..., Any]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be a TypeVar instead of a Union |
||
|
||
|
||
class NDFrameGroupBy(GroupBy): | ||
|
||
|
@@ -144,8 +151,18 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True, | |
return new_items, new_blocks | ||
|
||
def aggregate(self, func, *args, **kwargs): | ||
|
||
_level = kwargs.pop('_level', None) | ||
|
||
relabeling = func is None and _is_multi_agg_with_relabel(**kwargs) | ||
if relabeling: | ||
func, columns, order = _normalize_keyword_aggregation(kwargs) | ||
|
||
kwargs = {} | ||
elif func is None: | ||
# nicer error message | ||
raise TypeError("Must provide 'func' or tuples of " | ||
"'(column, aggfunc).") | ||
|
||
result, how = self._aggregate(func, _level=_level, *args, **kwargs) | ||
if how is None: | ||
return result | ||
|
@@ -179,6 +196,10 @@ def aggregate(self, func, *args, **kwargs): | |
self._insert_inaxis_grouper_inplace(result) | ||
result.index = np.arange(len(result)) | ||
|
||
if relabeling: | ||
result = result[order] | ||
result.columns = columns | ||
|
||
return result._convert(datetime=True) | ||
|
||
agg = aggregate | ||
|
@@ -791,11 +812,8 @@ def _aggregate_multiple_funcs(self, arg, _level): | |
# list of functions / function names | ||
columns = [] | ||
for f in arg: | ||
if isinstance(f, str): | ||
columns.append(f) | ||
else: | ||
# protect against callables without names | ||
columns.append(com.get_callable_name(f)) | ||
columns.append(com.get_callable_name(f) or f) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
arg = zip(columns, arg) | ||
|
||
results = OrderedDict() | ||
|
@@ -1296,6 +1314,26 @@ class DataFrameGroupBy(NDFrameGroupBy): | |
A | ||
1 1 2 0.590716 | ||
2 3 4 0.704907 | ||
|
||
To control the output names with different aggregations per column, | ||
pandas supports "named aggregation" | ||
|
||
>>> df.groupby("A").agg( | ||
... b_min=pd.NamedAgg(column="B", aggfunc="min"), | ||
... c_sum=pd.NamedAgg(column="C", aggfunc="sum")) | ||
b_min c_sum | ||
A | ||
1 1 -1.956929 | ||
2 3 -0.322183 | ||
|
||
- The keywords are the *output* column names | ||
- The values are tuples whose first element is the column to select | ||
and the second element is the aggregation to apply to that column. | ||
Pandas provides the ``pandas.NamedAgg`` namedtuple with the fields | ||
``['column', 'aggfunc']`` to make it clearer what the arguments are. | ||
As usual, the aggregation can be a callable or a string alias. | ||
|
||
See :ref:`groupby.aggregate.named` for more. | ||
""") | ||
|
||
@Substitution(see_also=_agg_see_also_doc, | ||
|
@@ -1304,7 +1342,7 @@ class DataFrameGroupBy(NDFrameGroupBy): | |
klass='DataFrame', | ||
axis='') | ||
@Appender(_shared_docs['aggregate']) | ||
def aggregate(self, arg, *args, **kwargs): | ||
def aggregate(self, arg=None, *args, **kwargs): | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return super().aggregate(arg, *args, **kwargs) | ||
|
||
agg = aggregate | ||
|
@@ -1577,3 +1615,77 @@ def groupby_series(obj, col=None): | |
return results | ||
|
||
boxplot = boxplot_frame_groupby | ||
|
||
|
||
def _is_multi_agg_with_relabel(**kwargs): | ||
""" | ||
Check whether the kwargs pass to .agg look like multi-agg with relabling. | ||
|
||
Parameters | ||
---------- | ||
**kwargs : dict | ||
|
||
Returns | ||
------- | ||
bool | ||
|
||
Examples | ||
-------- | ||
>>> _is_multi_agg_with_relabel(a='max') | ||
False | ||
>>> _is_multi_agg_with_relabel(a_max=('a', 'max'), | ||
... a_min=('a', 'min')) | ||
True | ||
>>> _is_multi_agg_with_relabel() | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
False | ||
""" | ||
return all( | ||
isinstance(v, tuple) and len(v) == 2 | ||
for v in kwargs.values() | ||
) and kwargs | ||
|
||
|
||
def _normalize_keyword_aggregation(kwargs): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a doc-string here |
||
""" | ||
Normalize user-provided "named aggregation" kwargs. | ||
|
||
Transforms from the new ``Dict[str, NamedAgg]`` style kwargs | ||
to the old OrderedDict[str, List[scalar]]]. | ||
|
||
Parameters | ||
---------- | ||
kwargs : dict | ||
|
||
Returns | ||
------- | ||
aggspec : dict | ||
The transformed kwargs. | ||
columns : List[str] | ||
The user-provided keys. | ||
order : List[Tuple[str, str]] | ||
Pairs of the input and output column names. | ||
|
||
Examples | ||
-------- | ||
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')}) | ||
(OrderedDict([('input', ['sum'])]), ('output',), [('input', 'sum')]) | ||
""" | ||
if not PY36: | ||
kwargs = OrderedDict(sorted(kwargs.items())) | ||
|
||
# Normalize the aggregation functions as Dict[column, List[func]], | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# process normally, then fixup the names. | ||
# TODO(Py35): When we drop python 3.5, change this to | ||
# defaultdict(list) | ||
aggspec = OrderedDict() # type: typing.OrderedDict[str, List[AggScalar]] | ||
order = [] | ||
columns, pairs = list(zip(*kwargs.items())) | ||
|
||
for name, (column, aggfunc) in zip(columns, pairs): | ||
if column in aggspec: | ||
aggspec[column].append(aggfunc) | ||
else: | ||
aggspec[column] = [aggfunc] | ||
order.append((column, | ||
com.get_callable_name(aggfunc) or aggfunc)) | ||
return aggspec, columns, order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can give a
:class:`NamedAgg`
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this added in the reference.rst ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring autogenerated by the namedtuple isn't super helpful