-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix groupby with "as_index" for categorical multi #13204 #13394
Conversation
Current coverage is 84.34%@@ master #13394 diff @@
==========================================
Files 138 138
Lines 51113 51120 +7
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43109 43116 +7
Misses 8004 8004
Partials 0 0
|
result = result.reindex(**d).sortlevel(axis=self.axis) | ||
|
||
if not self.as_index: | ||
idx = result.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty kludgey - maybe an do this inside the Index (not exactly sure what you are trying to do here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is kludgey. I'll try to make it cleaner but I'm not sure if I can hide the ugly part. Some of the old code (3 lines?) can go to MultiIndex (refactoring Categorical levels) but the new part consists of operation on a data frame.
The outline:
- the old code is:
# group keys are already in result.index
new_index = refactor(old_index) # expanding categorical levels
result.reindex(new_index)
- the new code (naive version):
if self.as_index == False:
# result has a flat index, group keys are in result.columns
result.set_index(group keys) # (1) now, result.index == old_index
result.reindex(new_index) # (2) the old part
result.reset_index() # (3)
(1) and (3) involve, respectively, dropping and inserting columns into result
and they don't go well with hierachical columns (if result
has them). So instead, I set index with an equivalent object in (1) and copy index levels to respective columns (if possible) in (3).
I'm not quite happy with the last commit and would like to change it. But I'm not sure on the general direction. In particular, whether to preserve categorical dtypes in MultiIndex. In other words, if columns
then should/may/must not Some possible ways to proceed:
Any comments will be very helpful. |
anything else is a bug - pretty big on trying to preserve types if at all possible this is pretty general - but if u have a specific situation so ears ideally all of the preservation will be done in Index or subclasses / higher level things should not know/care about dtypes - they just. call methods - so any messiness should be in Index |
@jreback Thanks for the answer. An example where cidx = pd.CategoricalIndex(['a', 'b'])
pd.MultiIndex.from_product([cidx, cidx]).levels[0]
Out[84]: Index(['a', 'b'], dtype='object')
# while it preserves all other types
pd.MultiIndex.from_product([pd.datetime(2011,1,1), cidx]).levels[0]
Out[85]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None) So, I would expect that it returns This can be easily changed in Now, I think that much more natural is to deal with it inside # preserves all dtypes inside "categories" as an appropriate Index
pd.Categorical(pd.datetime(2011,1,1)).categories
Out[93]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None)
# except categorical:
pd.Categorical(cidx).categories
Out[94]: Index(['a', 'b'], dtype='object')
pd.Categorical(pd.Categorical(['a','b'])).categories
Out[95]: Index(['a', 'b'], dtype='object') I'd expect the last two calls return I know that categorical dtype is different than other low level dtypes. So probably the only way to keep it is to have I'd be grateful for more guidance on this (i.e. categorical dtypes): Thanks |
|
||
indexer = self.grouper.result_index.get_indexer(index) | ||
result = result.reindex(index=indexer, copy=False) | ||
result.index = range(len(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.reset_index(drop=True)
is more idiomatic instead of range(len(result))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks.
can you update |
@jreback Sure, but what to do about this fishiness? |
Its hard to see what you did, push what you have an i'll take another look. If its MultiIndex specific then move to a MI method. |
OK, I'll push it soon. I don't do anything with multiindexes (anything other than |
ad38d1c
to
7e7d3b2
Compare
d = {self.obj._get_axis_name(self.axis): index, 'copy': False} | ||
return result.reindex(**d) | ||
|
||
indexer = self.grouper.result_index.get_indexer(index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I guess this is fine. can you put a comment on what is going on here & why its needed
and the issue number.
result = result.reindex(index=indexer, copy=False) | ||
result.reset_index(drop=True, inplace=True) | ||
# Grouper columns got garbled with NaN's, | ||
# so update them with `index` levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comments.
@@ -2249,7 +2249,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None, | |||
self.grouper = to_timedelta(self.grouper) | |||
|
|||
def __repr__(self): | |||
return 'Grouping(%s)' % self.name | |||
return 'Grouping(%s)' % str(self.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This one is actually not related, so I can fix it separately if needed.)
It failed if self.name
was a tuple. For example:
mi = pd.MultiIndex.from_arrays([list("AAB"), list("aba")])
df = pd.DataFrame([[1,2,3]], columns=mi)
gr = df.groupby(df[('A', 'a')])
gr.grouper.groupings
...
TypeError: not all arguments converted during string formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using .format
will solve these, and is the canonical approach
@jreback I forgot to ping after the update. If the comments are ok now, should I resubmit it? |
yes |
@jreback Done |
I think this is wrong (from the issue).
|
I'm not sure if I understand. Column Do you mean testing whether |
yes the index could should not be there - I am not sure if u r testing this |
I'm not testing this. I can do it, no problem. But something like this seems impossible. With Moreover, in this case (with a categorical grouper, here column 'C'), |
for name in [None, 'X', 'B', 'cat']: | ||
df.index = Index(list("abc"), name=name) | ||
result = df.groupby(['cat', 'A'], as_index=False).sum() | ||
tm.assert_frame_equal(result, expected, check_index_type=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Added the tests. (Hope this is what you meant.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. what needed changing in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing. Just added these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example is wrong though. #13394 (comment) There shouldn't be an index column at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand. With as_index=False
, the aggregated output has (edit: actually, it's RangeIndex
Int64Index([0, 1, 2])
here) as an index:
df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [1,2,3]})
df.groupby(["A","C"], as_index=False).sum()
Out[34]:
A C B
0 1 1 2
1 1 2 2
2 1 3 2
and now, after .reset_index()
it's just reset as a new column.
df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[35]:
index A C B
0 0 1 1 2
1 1 1 2 2
2 2 1 3 2
It has nothing to do with groupers being categorical.
Didn't you mean rather as_index=True
?
df.groupby(["A","C"],as_index=True).sum().reset_index()
Out[36]:
A C B
0 1 1 2
1 1 2 2
2 1 3 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example is wrong though. #13394 (comment) There shouldn't be an index column at all.
Are you still convinced something may be wrong here? I think it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was looking at something weird. looks ok.
@pijucha I think an error in your tests, pls rebase / update and ping when green. lgtm. |
result = result.drop(labels=list(g_names), axis=1) | ||
|
||
# Set a temp index and reindex (possibly expanding) | ||
result = result.set_index(self.grouper.result_index).reindex(index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you should pass copy=False
to the .reindex
in this doesn't actually do anything on the .reindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it makes sense.
2a74283
to
6f38955
Compare
@jreback updated |
ok minor change. ping when green. |
…dev#13204 BUG: Fix string repr of Grouping
@jreback Done and updated |
thanks @pijucha |
git diff upstream/master | flake8 --diff
Fixes a bug that returns all nan's for groupby(as_index=False) with
multiple column groupers containing a categorical one (#13204).
Also:
fixes an internal bug in the string representation of
Grouping
.