-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support nested renaming / selection #26399
ENH: Support nested renaming / selection #26399
Conversation
Thanks for bringing this back up as this had certainly gone stale. I might not have been as vocal as I should have and may be in the minority but I rather dislike the mapping of dict keys to labels for the following reasons:
I unfortunately have a travel conflict for the meeting Thursday but just providing these points as food for thought. I'll be sure to check the follow ups - thanks again! |
@WillAyd For clarification: you mean the mapping of "keyword argument names" instead of "dict keys" ? (because it is currently that we use dict keys for it) |
That’s right
…Sent from my iPhone
On May 14, 2019, at 10:57 PM, Joris Van den Bossche ***@***.***> wrote:
I rather dislike the mapping of dict keys to labels
For clarification: you mean the mapping of "keyword argument names" instead of "dict keys" ? (because it is currently that we use dict keys for it)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
For some of your points about limitation what you can use: note that you can still pass it as a dict (and then do |
Thanks @WillAyd
I agree that it's awkward. In particular, I don't like how idiosyncratic this is; it really is a special mode with its own set of rules. But our set of options (this, nothing, or some other API), I think this is best. I need to read through the discussion again, but do we agree that we need to do something to replace the deprecated feature?
I think this isn't an issue, or it isn't made any worse by this PR. I believe that in "normal" mode (user passing an aggfunc) pandas promises to pass through kwargs to the aggfunc. So I don't think we'd be able to add keywords to
I try to document this. You need to use the (ugly)
If you have a chance, can you make an example? We kinda have that with In [6]: df.groupby('kind').agg(_level=('height', 'sum'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-45280b886040> in <module>
----> 1 df.groupby('kind').agg(_level=('height', 'sum'))
~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1335 @Appender(_shared_docs['aggregate'])
1336 def aggregate(self, arg=None, *args, **kwargs):
-> 1337 return super().aggregate(arg, *args, **kwargs)
1338
1339 agg = aggregate
~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
166 elif func is None:
167 # nicer error message
--> 168 raise TypeError("Must provide 'func' or tuples of "
169 "'(column, aggfunc).")
170
TypeError: Must provide 'func' or tuples of '(column, aggfunc). |
How do people feel about the tuple We could expose a namedtuple, like In [16]: NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
In [17]: NamedAgg('height', 'sum')
Out[17]: NamedAgg(column='height', aggfunc='sum') but I'm not sure where in the pandas namespace that would fit. One other concern: in the issue someone raise an annoyance with multiple lambdas. Currently we raise a SpecificationError In [23]: df.groupby('kind').agg(a=('height', lambda x: 1), b=('height', lambda x: 2))
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
~/sandbox/pandas/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
483 try:
--> 484 result = _agg(arg, _agg_1dim)
485 except SpecificationError:
~/sandbox/pandas/pandas/core/base.py in _agg(arg, func)
434 for fname, agg_how in arg.items():
--> 435 result[fname] = func(fname, agg_how)
436 return result
~/sandbox/pandas/pandas/core/base.py in _agg_1dim(name, how, subset)
417 "in aggregation")
--> 418 return colg.aggregate(how, _level=(_level or 0) + 1)
419
~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
766 ret = self._aggregate_multiple_funcs(func_or_funcs,
--> 767 (_level or 0) + 1)
768 else:
~/sandbox/pandas/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg, _level)
829 'Function names must be unique, found multiple named '
--> 830 '{}'.format(name))
831
SpecificationError: Function names must be unique, found multiple named <lambda> Addressing that will require a bit of work; I've tried to keep this PR small by reusing existing machinery where possible. |
If it is between this and still doing nothing I'd vote for the latter without offering anything better. I think this will be pretty buggy and not provide that great of a user experience anyway, so hesitant to start down this path |
FYI, this seems somewhat hard to support in Python 3.5, since before 3.6 the order of
And how about if I restrict your choices this or nothing? 😄 If you sort our issues by most +1s, this is the 3rd-most upvoted issue right now. So I'm inclined to do something. Right now this seems like the least-bad something to me :) |
Ha tough one! I'd still say nothing in that case. From my (very biased) workflow I couldn't imagine ever really using this instead of just assigning the columns I want after the fact. (caveat: I don't use pipe / method chaining) |
Do we want to keep the fundamental discussion on the issue? (so about to do this or not, not the technical details of the PR) |
Sure, I'll do a short summary of this PR over there. |
It's mainly Will's arguments that are relevant there.
But don't we have the same issue now already with the dict of dicts approach? In those cases we simply sort the keys, so why not here? |
I suppose sorting is best for python 3.5. |
Codecov Report
@@ Coverage Diff @@
## master #26399 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50739 50763 +24
==========================================
+ Hits 46524 46542 +18
- Misses 4215 4221 +6
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #26399 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50739 50763 +24
==========================================
+ Hits 46524 46542 +18
- Misses 4215 4221 +6
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26399 +/- ##
==========================================
- Coverage 91.77% 91.76% -0.01%
==========================================
Files 174 174
Lines 50649 50677 +28
==========================================
+ Hits 46483 46505 +22
- Misses 4166 4172 +6
Continue to review full report at Codecov.
|
@TomAugspurger : thanks for keeping this alive, but I think 'ENH' is underselling it, it was essentially an outright BUG (#18366) to deprecate existing behavior in @WillAyd : can you please be more specific and quantifiable about your objections? Yes, dict-based-renaming limits to valid ASCII(/Unicode?) names. If a user wants an arbitrary name, they can either name-mangle, or just use a different method than dict-based-renaming. But for 99+% of users this functionality will work out-of-the-box. Why compromise the API for the sake of 0.01%(?) of users, or whatever? Can you actually ballpark-estimate what % of pandas user code has non-compliant names? Dict-based-renaming names moreover cannot be existing kwargs of that command; so yeah we need a Warning which checks name-collision of the dict with kwargs, the warning should default to on, and be disableable for speed. (That's not too hard, right?). Obviously we should strive for kwarg naming consistency across all commands that would support dict-based-renaming to minimize such collisions. That addresses all your objections, right?
Yes, that's just life in API-land. This one has huge tangible benefits, it's worth the price.
I personally disagree, anyway that's subjective. And R supports this sort of thing for 10+ years; this behavior is expected and achievable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax looks reasonable to me. I don't think you should sort for 3.5, though do we for .assign?
Yeah, that was my initial preference between the two, but it's not strong. NamedAgg / "named aggregation" emphasizes the "what"l more, KeywordAgg / "keyword aggregation" emphasizes the "how" more. @mroeschke did you have a strong reason for KeywordAgg? |
Opened up a couple issues for SeriesGroupBy.agg and NDFrame.agg. I'll do the last round of doc fixups here once we decide on KeywordAgg vs. NamedAgg, (also gives conda a chance to fix itself). |
i think NamedAgg might be more descriptive here |
KeywordAgg was just a preference over Agg. In general, I am in favor of any naming that's more descriptive than Agg. |
Switched to NamedAgg and updated the docs a bit. |
For the documentation I think it’s important to include the namedtuple. I surveyed a few friends and they were split on whether the keyword was the column selection or the output name.
I view it similar to attribute access of column names. In my own throwaway code, I’ll use . to select a column. But in library code I’ll use getitem.
… On May 24, 2019, at 20:49, zertrin ***@***.***> wrote:
@zertrin commented on this pull request.
In pandas/core/groupby/generic.py:
> @@ -1296,6 +1314,26 @@ class DataFrameGroupBy(NDFrameGroupBy):
A
1 1 2 0.590716
2 3 4 0.704907
+
+ To control the output names with different aggregations per column,
+ pandas supports "keyword aggregation"
+
+ >>> df.groupby("A").agg(
+ ... b_min=pd.KeywordAgg(column="B", aggfunc="min"),
+ ... c_sum=pd.KeywordAgg(column="C", aggfunc="sum"))
Speaking as a user, no I don't think that's the best.
The old way was delightfully concise (dict of dicts) but not very explicity, the new way with keywords is slightly less concise (repetition of the selection column) but not by much and more consistent so it's pretty OK.
But the named tuple version is very verbose and - - unless my code was designed for other people and need to be self documenting - - in the usual case of using pandas in Notebooks for my own work, once the user understands the concept, using the namedtuple version is just making the code less readable due to the lower ratio of user content (column names and aggfunc names) / boilerplate content (pd.NamedAgg, column=, aggfunc=).
Moreover, letting the simple keyword-tuple version having the most emphasis is more likely to help intermediate and advanced python user think that this system of aggregation can be easily made more dynamic inside code through usage of the **{} construct, where the dict to unpack is constructed by the code, and only is a dict of tuples.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@ previous comment: see answer in the thread here #26399 (comment) |
Is is arguable whether the NamedTuple really solves that particular confusion: the keyword is still there, and in the namedtuple the first argument is just called In the end, I think it is much more important that the examples in the documentation are clear as to which is which. The current example are good enough already in my opinion ( Once the user see and understand the feature from the documentation once, she should not have much problems herself in the future to understand it. Ultimately, the user is also responsible to give output columns meaningful names that aren't confusing. Using namedtuples doesn't solve every problem if the user gives confusing names (for example something like And lastly but most importantly in my eyes, a huge number of python coders do not know exactly what a namedtuple is (most beginner to intermediate python coders might never have heard of it). At first glance, something like My point is: emphasising/encouraging usage of the namedtuple might lead users to think that this So in summary, I think that we should do the reverse: show the simple keyword-tuple in most examples, and have one section with one example showing the namedtuple as an alternative (clearly stating that it doesn't do anything more than the simple keyword-tuple version). |
One last thought: on the other hand, I must concede that the NamedAgg solution has one advantage: easier discovery. If a user stumble upon it and wonders what it is, (in the future) she can google for |
- The keywords are the *output* column names | ||
- The values are tuples whose first element is the column to select | ||
and the second element is the aggregation to apply to that column. Pandas | ||
provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can give a :class:`NamedAgg`
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this added in the reference.rst ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -41,6 +44,10 @@ | |||
|
|||
from pandas.plotting._core import boxplot_frame_groupby | |||
|
|||
NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what this is (can we add a doc-string)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could using typing.NamedTuple
.
I think the remaining comments are about how to included NamedAgg in our API docs. I think our options are to
I'd prefer to leave that as a followup, along with or after |
All green. If we're comfortable with it, I have the following followups to do: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nitpicky stuff but OK with this as is and with follow ups. Nice work @TomAugspurger
@@ -41,6 +44,10 @@ | |||
|
|||
from pandas.plotting._core import boxplot_frame_groupby | |||
|
|||
NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"]) | |||
# TODO(typing) the return value on this callable should be any *scalar*. | |||
AggScalar = Union[str, Callable[..., Any]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a TypeVar instead of a Union
) | ||
|
||
|
||
If your desired output column names are not valid python keywords, construct a dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these are technically identifiers instead of keywords
thanks @TomAugspurger nice work. yeah happy to merge this and have issues for the followups. |
This adds special support for controlling the output column names when performing column-specific groupby aggregations. We use kwargs, using the keywords as the output names, and expecting tuples of
(selection, aggfunc)
.Closes #18366
Putting this up as a WIP before the meeting on Thursday.
TODO:
cc @jorisvandenbossche @WillAyd