Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

Closed
wants to merge 1 commit into from

Conversation

pijucha
Copy link
Contributor

@pijucha pijucha commented Jun 7, 2016

Fixes a bug that returns all nan's for groupby(as_index=False) with
multiple column groupers containing a categorical one (#13204).


Also:
fixes an internal bug in the string representation of Grouping.

mi = pd.MultiIndex.from_arrays([list("AAB"), list("aba")])
df = pd.DataFrame([[1,2,3]], columns=mi)
gr = df.groupby(df[('A', 'a')])

gr.grouper.groupings
...
TypeError: not all arguments converted during string formatting

@codecov-io
Copy link

codecov-io commented Jun 8, 2016

Current coverage is 84.34%

Merging #13394 into master will increase coverage by <.01%

@@             master     #13394   diff @@
==========================================
  Files           138        138          
  Lines         51113      51120     +7   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43109      43116     +7   
  Misses         8004       8004          
  Partials          0          0          

Powered by Codecov. Last updated by a01644c...374402c

result = result.reindex(**d).sortlevel(axis=self.axis)

if not self.as_index:
idx = result.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty kludgey - maybe an do this inside the Index (not exactly sure what you are trying to do here)

Copy link
Contributor Author

@pijucha pijucha Jun 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kludgey. I'll try to make it cleaner but I'm not sure if I can hide the ugly part. Some of the old code (3 lines?) can go to MultiIndex (refactoring Categorical levels) but the new part consists of operation on a data frame.

The outline:

  • the old code is:
# group keys are already in result.index
new_index = refactor(old_index)       # expanding categorical levels
result.reindex(new_index)
  • the new code (naive version):
if self.as_index == False:
# result has a flat index, group keys are in result.columns
    result.set_index(group keys)              # (1)  now, result.index == old_index
    result.reindex(new_index)                 # (2)  the old part
    result.reset_index()                      # (3)

(1) and (3) involve, respectively, dropping and inserting columns into result and they don't go well with hierachical columns (if result has them). So instead, I set index with an equivalent object in (1) and copy index levels to respective columns (if possible) in (3).

@pijucha
Copy link
Contributor Author

pijucha commented Jun 10, 2016

I'm not quite happy with the last commit and would like to change it. But I'm not sure on the general direction. In particular, whether to preserve categorical dtypes in MultiIndex. In other words, if columns df['A'] is categorical and

out = df.groupby(['A', 'B']).sum()

then should/may/must not out.index.levels[0] be a CategoricalIndex?

Some possible ways to proceed:

  1. keep changes as minimal as possible (no new public methods, no messing with dtypes)
  2. ok to change dtypes of MultiIndex levels, but no need for new public methods (expand_all)
  3. just cosmetic changes to the last commit
  4. take it further and change MultiIndex.from_product and MultiIndex.from_arrays so that they preserve Categorical dtypes in levels (?)
  5. abandon it completely since the code is messy and not much of it can be salvaged

Any comments will be very helpful.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

  1. is prob not tested but should preserve (or we would like it to do so )

anything else is a bug - pretty big on trying to preserve types if at all possible
of course want to limit complexity

this is pretty general - but if u have a specific situation so ears

ideally all of the preservation will be done in Index or subclasses / higher level things should not know/care about dtypes - they just. call methods - so any messiness should be in Index

@pijucha
Copy link
Contributor Author

pijucha commented Jun 10, 2016

@jreback Thanks for the answer.

An example where pd.MultiIndex.from_product doesn't preserve categorical types in levels:

cidx = pd.CategoricalIndex(['a', 'b'])

pd.MultiIndex.from_product([cidx, cidx]).levels[0]
Out[84]: Index(['a', 'b'], dtype='object')

# while it preserves all other types
pd.MultiIndex.from_product([pd.datetime(2011,1,1), cidx]).levels[0]
Out[85]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None)

So, I would expect that it returns CategoricalIndex if an argument is categorical.

This can be easily changed in pd.MultiIndex.from_product and pd.MultiIndex.from_arrays (same issue) but the change is a bit hacky (passing categories=CategoricalIndex(input.categories) to Categorical constructor).

Now, I think that much more natural is to deal with it inside Categorical itself. (Of course, if such a behaviour is desired.) Currently, we have for example:

# preserves all dtypes inside "categories" as an appropriate Index
pd.Categorical(pd.datetime(2011,1,1)).categories
Out[93]: DatetimeIndex(['2011-01-01'], dtype='datetime64[ns]', freq=None)

# except categorical:
pd.Categorical(cidx).categories
Out[94]: Index(['a', 'b'], dtype='object')
pd.Categorical(pd.Categorical(['a','b'])).categories
Out[95]: Index(['a', 'b'], dtype='object')

I'd expect the last two calls return CategoricalIndex(['a', 'b']).

I know that categorical dtype is different than other low level dtypes. So probably the only way to keep it is to have categories keep it.

I'd be grateful for more guidance on this (i.e. categorical dtypes):
A. change it in Categorical,
B. change it in MultiIndex
C. start a new issue for discussion
D. do nothing

Thanks


indexer = self.grouper.result_index.get_indexer(index)
result = result.reindex(index=indexer, copy=False)
result.index = range(len(result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.reset_index(drop=True) is more idiomatic instead of range(len(result))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

can you update

@pijucha
Copy link
Contributor Author

pijucha commented Jun 17, 2016

@jreback Sure, but what to do about this fishiness?
Is the snippet from my last comment acceptable? Or should I return just an unexpanded data frame?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

Its hard to see what you did, push what you have an i'll take another look. If its MultiIndex specific then move to a MI method.

@pijucha
Copy link
Contributor Author

pijucha commented Jun 17, 2016

OK, I'll push it soon.
It's now a very minimal commit (as the last one): only the expansion of a data frame.

I don't do anything with multiindexes (anything other than _reindex_output was already doing) or dtypes - it's independent and maybe I'll get back to it in a separate issue.

@pijucha pijucha force-pushed the groupbycat13204 branch 2 times, most recently from ad38d1c to 7e7d3b2 Compare June 17, 2016 18:33
d = {self.obj._get_axis_name(self.axis): index, 'copy': False}
return result.reindex(**d)

indexer = self.grouper.result_index.get_indexer(index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I guess this is fine. can you put a comment on what is going on here & why its needed
and the issue number.

result = result.reindex(index=indexer, copy=False)
result.reset_index(drop=True, inplace=True)
# Grouper columns got garbled with NaN's,
# so update them with `index` levels.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments.

@@ -2249,7 +2249,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
self.grouper = to_timedelta(self.grouper)

def __repr__(self):
return 'Grouping(%s)' % self.name
return 'Grouping(%s)' % str(self.name)
Copy link
Contributor Author

@pijucha pijucha Jun 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This one is actually not related, so I can fix it separately if needed.)

It failed if self.name was a tuple. For example:

mi = pd.MultiIndex.from_arrays([list("AAB"), list("aba")])
df = pd.DataFrame([[1,2,3]], columns=mi)
gr = df.groupby(df[('A', 'a')])

gr.grouper.groupings
...
TypeError: not all arguments converted during string formatting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using .format will solve these, and is the canonical approach

@pijucha
Copy link
Contributor Author

pijucha commented Jun 22, 2016

@jreback I forgot to ping after the update. If the comments are ok now, should I resubmit it?

@jreback
Copy link
Contributor

jreback commented Jun 22, 2016

yes

@pijucha
Copy link
Contributor Author

pijucha commented Jun 23, 2016

@jreback Done

@jreback jreback added this to the 0.18.2 milestone Jun 23, 2016
@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

I think this is wrong (from the issue). index should not appear here. Note that maybe add a test with a non-integer index as well (it should be dropped), maybe also way one with a name (again should be dropped), and a conflicting name to a column.

In [4]: >>> df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[4]: 
   index  A  C  B
0      0  1  1  2
1      1  1  2  2
2      2  1  3  2

@pijucha
Copy link
Contributor Author

pijucha commented Jun 23, 2016

I'm not sure if I understand. Column index is not the index of the original df.

Do you mean testing whether df.index is not accidentally carried over to df.groupby(["A","C"], as_index=False).sum()?

@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

yes the index could should not be there - I am not sure if u r testing this

@pijucha
Copy link
Contributor Author

pijucha commented Jun 23, 2016

I'm not testing this. I can do it, no problem.

But something like this seems impossible. With as_index=False, the original df.index is discarded somewhere inside groupby or aggregation, regardless of whether df contains categoricals - so such a bug would have appeared earlier. (_reindex_output doesn't access df.index.)

Moreover, in this case (with a categorical grouper, here column 'C'), _reindex_output is returning result.reset_index(drop=True). So, data frame df.groupby(["A","C"], as_index=False).sum() has a RangeIndex index, which is then (I believe correctly) reset to a column named index by reset_index().

for name in [None, 'X', 'B', 'cat']:
df.index = Index(list("abc"), name=name)
result = df.groupby(['cat', 'A'], as_index=False).sum()
tm.assert_frame_equal(result, expected, check_index_type=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Added the tests. (Hope this is what you meant.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. what needed changing in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing. Just added these tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

Copy link
Contributor Author

@pijucha pijucha Jun 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. With as_index=False, the aggregated output has RangeIndex (edit: actually, it's Int64Index([0, 1, 2]) here) as an index:

df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [1,2,3]})
df.groupby(["A","C"], as_index=False).sum()
Out[34]: 
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2

and now, after .reset_index() it's just reset as a new column.

df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[35]: 
   index  A  C  B
0      0  1  1  2
1      1  1  2  2
2      2  1  3  2

It has nothing to do with groupers being categorical.

Didn't you mean rather as_index=True?

df.groupby(["A","C"],as_index=True).sum().reset_index()
Out[36]: 
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

Are you still convinced something may be wrong here? I think it's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was looking at something weird. looks ok.

@jreback
Copy link
Contributor

jreback commented Jun 30, 2016

@pijucha I think an error in your tests, pls rebase / update and ping when green. lgtm.

result = result.drop(labels=list(g_names), axis=1)

# Set a temp index and reindex (possibly expanding)
result = result.set_index(self.grouper.result_index).reindex(index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you should pass copy=False to the .reindex in this doesn't actually do anything on the .reindex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it makes sense.

@pijucha pijucha force-pushed the groupbycat13204 branch 2 times, most recently from 2a74283 to 6f38955 Compare June 30, 2016 22:53
@pijucha
Copy link
Contributor Author

pijucha commented Jun 30, 2016

@jreback updated

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

ok minor change. ping when green.

@pijucha
Copy link
Contributor Author

pijucha commented Jul 2, 2016

@jreback Done and updated

@jreback jreback closed this in 8f8d75d Jul 3, 2016
@jreback
Copy link
Contributor

jreback commented Jul 3, 2016

thanks @pijucha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby with index = False returns NANs when column is categorical.
4 participants