BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

pijucha · 2016-06-07T23:50:50Z

closes groupby with index = False returns NANs when column is categorical. #13204
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Fixes a bug that returns all nan's for groupby(as_index=False) with
multiple column groupers containing a categorical one (#13204).

Also:
fixes an internal bug in the string representation of Grouping.

mi = pd.MultiIndex.from_arrays([list("AAB"), list("aba")])
df = pd.DataFrame([[1,2,3]], columns=mi)
gr = df.groupby(df[('A', 'a')])

gr.grouper.groupings
...
TypeError: not all arguments converted during string formatting

codecov-io · 2016-06-08T01:01:11Z

Current coverage is 84.34%

Merging #13394 into master will increase coverage by <.01%

@@             master     #13394   diff @@
==========================================
  Files           138        138          
  Lines         51113      51120     +7   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43109      43116     +7   
  Misses         8004       8004          
  Partials          0          0

Powered by Codecov. Last updated by a01644c...374402c

jreback · 2016-06-08T01:04:11Z

pandas/core/groupby.py

+        result = result.reindex(**d).sortlevel(axis=self.axis)
+
+        if not self.as_index:
+            idx = result.index


this is pretty kludgey - maybe an do this inside the Index (not exactly sure what you are trying to do here)

It is kludgey. I'll try to make it cleaner but I'm not sure if I can hide the ugly part. Some of the old code (3 lines?) can go to MultiIndex (refactoring Categorical levels) but the new part consists of operation on a data frame.

The outline:

the old code is:

# group keys are already in result.index new_index = refactor(old_index) # expanding categorical levels result.reindex(new_index)

the new code (naive version):

if self.as_index == False: # result has a flat index, group keys are in result.columns result.set_index(group keys) # (1) now, result.index == old_index result.reindex(new_index) # (2) the old part result.reset_index() # (3)

(1) and (3) involve, respectively, dropping and inserting columns into result and they don't go well with hierachical columns (if result has them). So instead, I set index with an equivalent object in (1) and copy index levels to respective columns (if possible) in (3).

pijucha · 2016-06-10T01:30:58Z

I'm not quite happy with the last commit and would like to change it. But I'm not sure on the general direction. In particular, whether to preserve categorical dtypes in MultiIndex. In other words, if columns df['A'] is categorical and

out = df.groupby(['A', 'B']).sum()

then should/may/must not out.index.levels[0] be a CategoricalIndex?

Some possible ways to proceed:

keep changes as minimal as possible (no new public methods, no messing with dtypes)
ok to change dtypes of MultiIndex levels, but no need for new public methods (expand_all)
just cosmetic changes to the last commit
take it further and change MultiIndex.from_product and MultiIndex.from_arrays so that they preserve Categorical dtypes in levels (?)
abandon it completely since the code is messy and not much of it can be salvaged

Any comments will be very helpful.

jreback · 2016-06-10T06:02:28Z

is prob not tested but should preserve (or we would like it to do so )

anything else is a bug - pretty big on trying to preserve types if at all possible
of course want to limit complexity

this is pretty general - but if u have a specific situation so ears

ideally all of the preservation will be done in Index or subclasses / higher level things should not know/care about dtypes - they just. call methods - so any messiness should be in Index

pijucha · 2016-06-10T20:44:03Z

@jreback Thanks for the answer.

An example where pd.MultiIndex.from_product doesn't preserve categorical types in levels:

cidx = pd.CategoricalIndex(['a', 'b'])

pd.MultiIndex.from_product([cidx, cidx]).levels[0]
Out[84]: Index(['a', 'b'], dtype='object')

# while it preserves all other types
pd.MultiIndex.from_product([pd.datetime(2011,1,1), cidx]).levels[0]
Out[85]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None)

So, I would expect that it returns CategoricalIndex if an argument is categorical.

This can be easily changed in pd.MultiIndex.from_product and pd.MultiIndex.from_arrays (same issue) but the change is a bit hacky (passing categories=CategoricalIndex(input.categories) to Categorical constructor).

Now, I think that much more natural is to deal with it inside Categorical itself. (Of course, if such a behaviour is desired.) Currently, we have for example:

# preserves all dtypes inside "categories" as an appropriate Index
pd.Categorical(pd.datetime(2011,1,1)).categories
Out[93]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None)

# except categorical:
pd.Categorical(cidx).categories
Out[94]: Index(['a', 'b'], dtype='object')
pd.Categorical(pd.Categorical(['a','b'])).categories
Out[95]: Index(['a', 'b'], dtype='object')

I'd expect the last two calls return CategoricalIndex(['a', 'b']).

I know that categorical dtype is different than other low level dtypes. So probably the only way to keep it is to have categories keep it.

I'd be grateful for more guidance on this (i.e. categorical dtypes):
A. change it in Categorical,
B. change it in MultiIndex
C. start a new issue for discussion
D. do nothing

Thanks

jreback · 2016-06-12T16:52:51Z

pandas/core/groupby.py

+
+        indexer = self.grouper.result_index.get_indexer(index)
+        result = result.reindex(index=indexer, copy=False)
+        result.index = range(len(result))


.reset_index(drop=True) is more idiomatic instead of range(len(result))

Yes, thanks.

jreback · 2016-06-17T12:00:03Z

can you update

pijucha · 2016-06-17T12:28:46Z

@jreback Sure, but what to do about this fishiness?
Is the snippet from my last comment acceptable? Or should I return just an unexpanded data frame?

jreback · 2016-06-17T12:32:37Z

Its hard to see what you did, push what you have an i'll take another look. If its MultiIndex specific then move to a MI method.

pijucha · 2016-06-17T13:09:47Z

OK, I'll push it soon.
It's now a very minimal commit (as the last one): only the expansion of a data frame.

I don't do anything with multiindexes (anything other than _reindex_output was already doing) or dtypes - it's independent and maybe I'll get back to it in a separate issue.

jreback · 2016-06-17T22:06:42Z

pandas/core/groupby.py

+            d = {self.obj._get_axis_name(self.axis): index, 'copy': False}
+            return result.reindex(**d)
+
+        indexer = self.grouper.result_index.get_indexer(index)


ok I guess this is fine. can you put a comment on what is going on here & why its needed
and the issue number.

pijucha · 2016-06-18T15:20:15Z

pandas/core/groupby.py

+        result = result.reindex(index=indexer, copy=False)
+        result.reset_index(drop=True, inplace=True)
+        # Grouper columns got garbled with NaN's,
+        # so update them with `index` levels.


Updated the comments.

pijucha · 2016-06-19T04:25:45Z

pandas/core/groupby.py

@@ -2249,7 +2249,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
                self.grouper = to_timedelta(self.grouper)

    def __repr__(self):
-        return 'Grouping(%s)' % self.name
+        return 'Grouping(%s)' % str(self.name)


(This one is actually not related, so I can fix it separately if needed.)

It failed if self.name was a tuple. For example:

mi = pd.MultiIndex.from_arrays([list("AAB"), list("aba")]) df = pd.DataFrame([[1,2,3]], columns=mi) gr = df.groupby(df[('A', 'a')]) gr.grouper.groupings ... TypeError: not all arguments converted during string formatting

using .format will solve these, and is the canonical approach

pijucha · 2016-06-22T22:05:11Z

@jreback I forgot to ping after the update. If the comments are ok now, should I resubmit it?

jreback · 2016-06-22T22:15:37Z

yes

pijucha · 2016-06-23T00:34:33Z

@jreback Done

jreback · 2016-06-23T10:12:30Z

I think this is wrong (from the issue). index should not appear here. Note that maybe add a test with a non-integer index as well (it should be dropped), maybe also way one with a name (again should be dropped), and a conflicting name to a column.

In [4]: >>> df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[4]: 
   index  A  C  B
0      0  1  1  2
1      1  1  2  2
2      2  1  3  2

pijucha · 2016-06-23T13:44:10Z

I'm not sure if I understand. Column index is not the index of the original df.

Do you mean testing whether df.index is not accidentally carried over to df.groupby(["A","C"], as_index=False).sum()?

jreback · 2016-06-23T15:02:17Z

yes the index could should not be there - I am not sure if u r testing this

pijucha · 2016-06-23T15:30:05Z

I'm not testing this. I can do it, no problem.

But something like this seems impossible. With as_index=False, the original df.index is discarded somewhere inside groupby or aggregation, regardless of whether df contains categoricals - so such a bug would have appeared earlier. (_reindex_output doesn't access df.index.)

Moreover, in this case (with a categorical grouper, here column 'C'), _reindex_output is returning result.reset_index(drop=True). So, data frame df.groupby(["A","C"], as_index=False).sum() has a RangeIndex index, which is then (I believe correctly) reset to a column named index by reset_index().

pijucha · 2016-06-23T19:34:54Z

pandas/tests/test_groupby.py

+        for name in [None, 'X', 'B', 'cat']:
+            df.index = Index(list("abc"), name=name)
+            result = df.groupby(['cat', 'A'], as_index=False).sum()
+            tm.assert_frame_equal(result, expected, check_index_type=True)


@jreback Added the tests. (Hope this is what you meant.)

yes. what needed changing in the code?

Nothing. Just added these tests.

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

I don't quite understand. With as_index=False, the aggregated output has ~~RangeIndex~~ (edit: actually, it's Int64Index([0, 1, 2]) here) as an index:

df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [1,2,3]}) df.groupby(["A","C"], as_index=False).sum() Out[34]: A C B 0 1 1 2 1 1 2 2 2 1 3 2

and now, after .reset_index() it's just reset as a new column.

df.groupby(["A","C"], as_index=False).sum().reset_index() Out[35]: index A C B 0 0 1 1 2 1 1 1 2 2 2 2 1 3 2

It has nothing to do with groupers being categorical.

Didn't you mean rather as_index=True?

df.groupby(["A","C"],as_index=True).sum().reset_index() Out[36]: A C B 0 1 1 2 1 1 2 2 2 1 3 2

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

Are you still convinced something may be wrong here? I think it's correct.

I think I was looking at something weird. looks ok.

jreback · 2016-06-30T10:32:40Z

@pijucha I think an error in your tests, pls rebase / update and ping when green. lgtm.

jreback · 2016-06-30T10:34:00Z

pandas/core/groupby.py

+        result = result.drop(labels=list(g_names), axis=1)
+
+        # Set a temp index and reindex (possibly expanding)
+        result = result.set_index(self.grouper.result_index).reindex(index)


I wonder if you should pass copy=False to the .reindex in this doesn't actually do anything on the .reindex

OK, it makes sense.

pijucha · 2016-06-30T23:58:09Z

@jreback updated

jreback · 2016-07-01T12:14:00Z

ok minor change. ping when green.

…dev#13204 BUG: Fix string repr of Grouping

pijucha · 2016-07-02T03:40:51Z

@jreback Done and updated

jreback · 2016-07-03T23:28:58Z

thanks @pijucha

jreback reviewed Jun 8, 2016
View reviewed changes

jreback added Bug Groupby Categorical Categorical Data Type labels Jun 8, 2016

pijucha force-pushed the groupbycat13204 branch from 43789f3 to c864209 Compare June 9, 2016 02:36

pijucha force-pushed the groupbycat13204 branch from c864209 to 4a6621f Compare June 12, 2016 15:53

jreback reviewed Jun 12, 2016
View reviewed changes

pijucha force-pushed the groupbycat13204 branch 2 times, most recently from ad38d1c to 7e7d3b2 Compare June 17, 2016 18:33

jreback reviewed Jun 17, 2016
View reviewed changes

pijucha force-pushed the groupbycat13204 branch from 7e7d3b2 to 7178ffd Compare June 18, 2016 04:52

pijucha reviewed Jun 18, 2016
View reviewed changes

pijucha force-pushed the groupbycat13204 branch from 7178ffd to b0a383e Compare June 19, 2016 04:20

pijucha reviewed Jun 19, 2016
View reviewed changes

pijucha force-pushed the groupbycat13204 branch from b0a383e to 9cacefb Compare June 20, 2016 01:08

pijucha force-pushed the groupbycat13204 branch from 9cacefb to 8ad5dea Compare June 22, 2016 23:29

jreback added this to the 0.18.2 milestone Jun 23, 2016

pijucha force-pushed the groupbycat13204 branch from 8ad5dea to f1017fc Compare June 23, 2016 17:57

pijucha reviewed Jun 23, 2016
View reviewed changes

jreback reviewed Jun 30, 2016
View reviewed changes

pijucha force-pushed the groupbycat13204 branch 2 times, most recently from 2a74283 to 6f38955 Compare June 30, 2016 22:53

BUG: Fix groupby with as_index for categorical multi groupers pandas-…

374402c

…dev#13204 BUG: Fix string repr of Grouping

pijucha force-pushed the groupbycat13204 branch from 6f38955 to 374402c Compare July 2, 2016 02:33

jreback closed this in 8f8d75d Jul 3, 2016

pijucha deleted the groupbycat13204 branch July 11, 2016 17:41

martijnvermaat mentioned this pull request Jul 21, 2016

groupby on multiple columns does not preserve (categorical) dtype #13743

Closed

BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

Conversation

pijucha commented Jun 7, 2016 • edited Loading

codecov-io commented Jun 8, 2016 • edited Loading

Current coverage is 84.34%

Choose a reason for hiding this comment

pijucha Jun 8, 2016 • edited Loading

Choose a reason for hiding this comment

pijucha commented Jun 10, 2016

jreback commented Jun 10, 2016

pijucha commented Jun 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 17, 2016

pijucha commented Jun 17, 2016

jreback commented Jun 17, 2016

pijucha commented Jun 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pijucha Jun 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pijucha commented Jun 22, 2016

jreback commented Jun 22, 2016

pijucha commented Jun 23, 2016

jreback commented Jun 23, 2016

pijucha commented Jun 23, 2016

jreback commented Jun 23, 2016

pijucha commented Jun 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pijucha Jun 23, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pijucha commented Jun 30, 2016

jreback commented Jul 1, 2016

pijucha commented Jul 2, 2016

jreback commented Jul 3, 2016

pijucha commented Jun 7, 2016 •

edited

Loading

codecov-io commented Jun 8, 2016 •

edited

Loading

pijucha Jun 8, 2016 •

edited

Loading

pijucha commented Jun 17, 2016 •

edited

Loading

pijucha Jun 19, 2016 •

edited

Loading

pijucha Jun 23, 2016 •

edited

Loading