-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: categorical grouping will no longer return the cartesian product #20583
Conversation
this makes categorical groupers work like other groupers and should make things more performant and intuitive. It is somewhat walking back a change from when we first introduced categorical groupers. But the information is preserved (meaning the categories are still there), just removes the automatic re-indexing (which was causing memory to blow up). still need some increased test coverage. note the first commit actually has almost all of the changes, the next are just cleaning up tests. |
Just making sure, this affects grouping by a single categorical as well? |
yes this affects just a single categorical column also |
Codecov Report
@@ Coverage Diff @@
## master #20583 +/- ##
==========================================
+ Coverage 91.78% 91.79% +<.01%
==========================================
Files 153 153
Lines 49341 49371 +30
==========================================
+ Hits 45287 45319 +32
+ Misses 4054 4052 -2
Continue to review full report at Codecov.
|
d53e6f6
to
582da12
Compare
so this is all ready to go if any comments. @TomAugspurger @jorisvandenbossche |
That doesn't seem to be the case? With this branch, and the example of the whatsnew docs:
Further a general comment (will try to do more detailed review later this week): I am not sure we can just change this. First, it is a API breaking change in several places, eg also pivot_table*. And second, I think the current behaviour can actually useful in certain cases and it would be nice to have a way to keep this behaviour. * but I agree we should look into it and try to make this more consistent. As, for example, |
So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent. We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for:
|
I’d favor an option. I think the default should eventually be to exclude unobserved categories (after a deprecation cycle)
…________________________________
From: Jeff Reback <notifications@github.com>
Sent: Thursday, April 12, 2018 5:39:16 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] API: categorical grouping will no longer return the cartesian product (#20583)
So I could also change this for a single grouper. That breaks a couple of tests. I am inclined to do this actually as then it makes the multi and single case consistent.
We cannot support a full cartesian product for multi-groupers as this will blow up memory and kill performance. So either we go for:
* an option to turn this on/off.
* differing behavior for single vs multi
* change single grouping behavior as well
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#20583 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHItJEfaXLGyuztAiN7hpsUYEzKUqRks5tny7UgaJpZM4TDk6b>.
|
|
||
|
||
@pytest.mark.xfail(reason="failing with observed") | ||
def test_observed_failing(observed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anyone wants to take a crack at this test. Its the only one that uses an IntervalIndex as its category, though this might be a red herring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't been able to follow all the way through yet but I think this is a regression with the below method:
Line 127 in 41db527
def decons_group_index(comp_labels, shape): |
Using the failing test example this returns [array(1, 1, 0, 0], ...
but on master that same object would return [array(1, 1, 2, 2], ...
. As a result, I think the reconstructed group labels are getting swapped in _wrap_agged_blocks
and causing the failure.
Will try to find time next day or so to walk through in more detail but sharing in case it helps anyone else reviewing the issue
422a6be
to
7cd56cd
Compare
this is read for a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Main thing is the change to pivot_table
.
It'd be good to document whether the unobserved categories are present in the resulting index's type.
i.e. is
In [7]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b'])).count().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)
or
Out[7]: CategoricalDtype(categories=['a'], ordered=False)
This could go in the whatsnew and either groupby.rst or categorical.rst probably. I think either behavior is fine.
Still going through the test changes.
doc/source/whatsnew/v0.23.0.txt
Outdated
|
||
.. ipython:: python | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have an example showing the future warning? So just this without observed=False
, and an :okwarning:
directive. Then you can say "use observed=False
to retain the previous behavior and silence the warning.
doc/source/whatsnew/v0.23.0.txt
Outdated
df.groupby(['A', 'B', 'C'], observed=False).count() | ||
|
||
|
||
New Behavior (show only observed values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New -> Future?
pandas/core/groupby/groupby.py
Outdated
msg = ("pass observed=True to ensure that a " | ||
"categorical grouper only returns the " | ||
"observed groupers, or\n" | ||
"observed=False to return NA for non-observed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things like 'count' return 0 for unobserved. You could rephrase as "observed=False to include unobserved categories."
@@ -79,7 +79,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', | |||
pass | |||
values = list(values) | |||
|
|||
grouped = data.groupby(keys) | |||
grouped = data.groupby(keys, observed=dropna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to think about this a bit. Are we all OK with "overloading" dropna
to serve two purposes? I think it's ok...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but that's the meaning of the dropna
now here anyhow
@@ -241,10 +241,13 @@ def _all_key(key): | |||
return (key, margins_name) + ('',) * (len(cols) - 1) | |||
|
|||
if len(rows) > 0: | |||
margin = data[rows + values].groupby(rows).agg(aggfunc) | |||
margin = data[rows + values].groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observed=True
for these changes are all backwards compatible?
@@ -488,12 +488,12 @@ def test_agg_structs_series(structure, expected): | |||
|
|||
|
|||
@pytest.mark.xfail(reason="GH-18869: agg func not called on empty groups.") | |||
def test_agg_category_nansum(): | |||
def test_agg_category_nansum(observed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be xfailed for observed=True
? I think it may be the right answer (not sure though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed so that observed=True is XPASS (not sure how to xfail on a particular fixture value)
pandas/tests/groupby/conftest.py
Outdated
@@ -4,6 +4,11 @@ | |||
from pandas.util import testing as tm | |||
|
|||
|
|||
@pytest.fixture(params=[True, False]) | |||
def observed(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring would be nice.
names=['A', 'B']) | ||
expected = DataFrame( | ||
{'values': [1, 2, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so those changes in pivot are API breaking? I'd prefer that this goes through the same deprecation cycle. On the other hand, this would mean an additional keyword, so that we can do the deprecation cycle properly...
Do we think pivot(..., dropna=True, observed=False)
(i.e. the current default) is a useful combination? I could see it being desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observed works like dropna
, so I am not sure we need this. I can make the default work like the existing I think (which is effectively dropna=False
)
def test_groupby_sort_categorical(): | ||
def test_sort(): | ||
|
||
# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line too long? https://stackoverflow.com/q/23814368/1889400 is a short link to the same q.
http://pandas-docs.github.io/pandas-docs-travis/categorical.html#operations would be a good place to show About |
I will try to give this a more detailed look tomorrow. But general comment: now that we have the keyword to specify the behaviour (for now with It's difficult to judge, as I personally don't run much in such situations, but my gut feeling says that the original issue that motivated the change (the combinatorial explosion with multiple categorical groupers) is not the majority usage pattern of categoricals. And if you don't want to include unobserved categoricals in your analyses in general (groupby, pivot, value_counts, plotting, ..), you always have the easy functionality of cc @jankatins |
@TomAugspurger the pivot was a red herring, was not passing things thru. and @jorisvandenbossche I disagree. You almost never want a cartesian product of all of the groupers. It can easily blow you up and shouldn't be the default. (its also easy to create if you need). Not this has nothing to do with the actual categories that are returned, they are in BOTH cases indicated on the dtype of the level of that index, (observed or not), its the groupers that are the issue. |
@jorisvandenbossche I also think the default should change. Categoricals are mostly an implementation detail, and should behave as similarly as possible as ordinary |
But in most cases you have no cartesian product, you only have a single categorical key
It's also easy to get the version with only the observed ones (certainly now there is the keyword), so that is not really an argument IMO
Yes, and that is certainly a good thing. But for me it is still the question what we want the visible output to be. |
My original motivation to work on categoricals were stuff like lickert scales ("completly agree ... completly disagree", 5 to 7 values). For that seing unobserved categoricals in group bys are a good thing. Since then a lot of changes have been made to make categoricals usefull in other situations (like as a memory efficient string replacement or to analyse genes?). This change feels like it is a change to make working with the latter easier but will make the former harder. |
I gave it as granted that if we change
True... I just think your use case is more rare (but I have no data to support this statement, other than personal experience). Anyway, with the new argument both things will be pretty easy anyway, it is mainly a matter of which use case should require awareness from the user. |
I changed the default back to |
If you have a defined order, where would you put the new categorical into the orders? E.g.
This usecase might also be satisfied by a |
FYI, discussion for a new dict-encoded / interned values type can go at #20899 so we don't lose it. Reviewing this PR one more time, but I think people are all on board with having the default be |
@@ -396,6 +396,58 @@ documentation. If you build an extension array, publicize it on our | |||
|
|||
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/ | |||
|
|||
.. _whatsnew_0230.enhancements.categorical_grouping: | |||
|
|||
Categorical Groupers has gained an observed keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has -> have? Because "categorical Groupers" is plural right?
@@ -671,6 +678,26 @@ def _codes_for_groupby(self, sort): | |||
categories in the original order. | |||
""" | |||
|
|||
# we only care about observed values | |||
if observed: | |||
unique_codes = unique1d(self.codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't thought this through, but can this if
block be replaced with self.remove_unused_cateogories()._codes_for_groupby(sort=sort, observed=False)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you actually need the uniques
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I would like to ask, again, can you please only add new commits when updating for reviews instead of amending different parts to different previous commits?
I wanted to start complaining that you didn't update for the bug I pointed out in _codes_for_groupby
because I didn't see it in the new commits, but I see you actually fixed it.
But this way it is really hard to see that you updated it and added tests for it, and to see what you actually changed in that function compared to the previous time I reviewed.
Categorical Groupers has gained an observed keyword | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In previous versions, grouping by 1 or more categorical columns would result in an index that was the cartesian product of all of the categories for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To repeat my previous comment: I would not use the "cartesian product" to introduce this. The actual change is about whether to include ubobserved categories or not, and the consequence of that is that for multiple groupers this results in a cartesian product or not (but I would start with the first thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change this on purpose, this is more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Cartesian product" really only makes sense in the 2 or more case, right? But you say "1 or more" above. I would phrase it as
"Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categories, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in high memory usage."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the explanation of Tom is exactly what I meant.
@jreback I have no problem at all with that you don't agree with a comment (it would be strange otherwise :-)) and thus not update for it, but can you then answer to that comment noting that? Otherwise I cannot know that I should not repeat a comment (or that I shouldn't get annoyed with my comments being ignored :))
Handling of (un)observed Categorical values | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use "grouper" as terminology in our documentation (except for the pd.Grouper
object), so I would write "groupby key" or "to group by"
also "multipler" -> "multiple"
|
||
When using a ``Categorical`` grouper (as a single or as part of multipler groupers), the ``observed`` keyword | ||
controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those | ||
that are observed groupers (``observed=True``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"or only those that are observed groupers" -> "or only the observed categories"
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).count() | ||
|
||
The returned dtype of the grouped will *always* include *all* of the catergories that were grouped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catergories -> categories
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe just create s
and cat
to avoid repeating this a few times
@@ -6632,6 +6632,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | |||
squeeze : boolean, default False | |||
reduce the dimensionality of the return type if possible, | |||
otherwise return a consistent type | |||
observed : boolean, default None | |||
if True: only show observed values for categorical groupers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capital If (below as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you start this explanation with noting this keyword is only when grouping by categorical values?
if True: only show observed values for categorical groupers. | ||
if False: show all values for categorical groupers. | ||
if None: if any categorical groupers, show a FutureWarning, | ||
default to False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no identation for rst formatting
@@ -2898,14 +2907,16 @@ class Grouping(object): | |||
""" | |||
|
|||
def __init__(self, index, grouper=None, obj=None, name=None, level=None, | |||
sort=True, in_axis=False): | |||
sort=True, observed=None, in_axis=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have it as False
default? (if we want to deprecate in the future, we can then just use None
?)
|
||
# TODO(jreback): remove completely | ||
# when observed parameter is defaulted to True | ||
# gh-20583 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this comment be removed for now?
My remaining comments are mainly doc related, so I am fine with merging this now for 0.23rc, if @jreback does a follow-up PR. |
and I did push new ones. |
Well, if I look at the diff for only the commits you added the last day (https://github.com/pandas-dev/pandas/pull/20583/files/19c9cf7871847de8f0a8504e9f121ad1460512d0..bdf7525812ca670f9406ab8df333030d36d30947), there is no change in the |
@TomAugspurger if you want to do the RC now, then i'll have to followup on comments later. |
If we're doing this for 0.23 then it should go in the RC I think. I can wait a bit longer if you plan to push more changes. |
changes will only be cosmetic so u can merge if u want now |
That's what I thought, thanks. |
Great work! Looking forward to trying this out. :+1 |
I think there is a bug with observed=True. The below statement removed the unobserved categorical values, but the dataframe returned wrong counts. df.groupby('categorical_column', observed=True, as_index=False).count() |
Try with pandas 0.24.0 if you're not already on it. There were a couple bug
fixes in this area.
If it still persists, search for open issues / open a new one with a
minimal example.
…On Wed, Jan 30, 2019 at 11:42 AM abdullah-online ***@***.***> wrote:
I think there is a bug with observed=True. The below statement removed the
unobserved categorical values, but the dataframe returned wrong counts.
The categorical values were not sorted and the counts were mismatched with
the categories.
df.groupby('categorical_column', observed=True, as_index=False).count()
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#20583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIjzDDTxRSKcgL86j5bjm_fabMd5oks5vIdlrgaJpZM4TDk6b>
.
|
closes #14942
closes #15217
closes #17594
closes #8869
xref #8138