BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

pijucha · 2016-07-31T00:08:00Z

closes groupby on multiple columns does not preserve (categorical) dtype #13743
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This commit modifies MultiIndex.from_array and MultiIndex.from_product.

Example:

cat = pd.Categorical(['a', 'b'], categories=list("bac"), ordered=True)
mi = pd.MultiIndex.from_arrays([cat, cat])

mi.levels[0]
Out[55]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

mi.get_level_values(0)
Out[56]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

Previously, the results were:

mi.levels[0]
Out[345]: Index(['b', 'a', 'c'], dtype='object')

mi.get_level_values(0)
Out[346]: Index(['a', 'b'], dtype='object')

This modification makes groupby, pivot, and set_index preserve categorical types in indexes.

codecov-io · 2016-07-31T00:37:10Z

Current coverage is 85.27% (diff: 100%)

Merging #13854 into master will decrease coverage by <.01%

@@             master     #13854   diff @@
==========================================
  Files           139        139          
  Lines         50555      50561     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43111      43116     +5   
- Misses         7444       7445     +1   
  Partials          0          0

Powered by Codecov. Last update ccec504...99e4a52

jreback · 2016-08-01T10:19:52Z

doc/source/whatsnew/v0.19.0.txt

@@ -855,3 +855,4 @@ Bug Fixes

 - Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
 - Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
+- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)


just say MultiIndex constructor

pijucha · 2016-08-02T03:12:38Z

Other places where a categorical dtype is lost in similar circumstances.

cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)

concat

df = pd.DataFrame([[10, 11, 12]])

pd.concat([df, df], keys=cidx).index.levels[0]
Out[32]: Index(['y', 'x'], dtype='object')

stack with a non-unique index/multi-index:

df = pd.DataFrame([[10, 11, 12]], columns=cidx_nonunique)

df.stack().index.levels[1]
Out[35]: Index(['x', 'y', 'z'], dtype='object')

get_dummies

pd.get_dummies(cidx).columns
Out[36]: Index(['x', 'y', 'z'], dtype='object')

make_axis_dummies with transform

df = pd.DataFrame([[10, 11]], columns=cidx)
ldf = pd.Panel({'A': df, 'B': df}).to_frame()

pd.core.reshape.make_axis_dummies(panel.to_frame(), transform=lambda x: x).columns
Out[53]: Index(['x', 'y', 'z'], dtype='object')

panel_index

pi = pd.core.panel.panel_index([0, 1, 2], cidx)
pi.levels[1]
Out[57]: Index(['x', 'y'], dtype='object', name='panel')

pytables: LegacyTable.read()
No quick example yet.

jreback · 2016-08-02T10:30:38Z

@pijucha 5 & 6 you can ignore

jreback · 2016-08-02T10:32:56Z

@pijucha this should exist as a separate function from the Categorical constructor as a private function (but you can put in pandas.core.categorical) is prob not a bad location. maybe _create_categoricals_from_arrays? (or similar). E.g. its a 'categorical' function, but returns labels/levels (and not exactly a cat).

can certainly merge this fix and then do a followup with a more reaching name change. lmk.

jreback · 2016-08-02T10:34:25Z

pandas/indexes/multi.py

+    if is_categorical(values):
+        if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
+            values = values._values
+        categories = CategoricalIndex(values.categories,


why is this a CI and not just a Categorial?

For consistency. The else part returns cat.categories, which is an Index.

pijucha · 2016-08-02T12:54:55Z

@jreback OK, Thanks.

pijucha · 2016-08-17T04:28:59Z

Sorry for a bit of delay. I fixed stack, get_dummies, make_axis_dummies (2-4 in the list above) and opened a separate issue for concat (and for two other issues I came across).

jreback · 2016-08-17T10:43:00Z

pandas/tests/test_categorical.py

@@ -1607,6 +1607,113 @@ def test_map(self):
        result = c.map(lambda x: 1)
        tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))

+    def test_groupby_preserve_dtype(self):


move to test_groupby (for the groupby tests)

jreback · 2016-08-17T10:46:53Z

small changes. looks pretty good.

pijucha · 2016-08-22T01:30:56Z

I moved some tests to files I thought were more appropriate (though, I'm not 100% sure).

jreback · 2016-08-25T10:55:32Z

pandas/indexes/multi.py

@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
            if len(arrays[i]) != len(arrays[i - 1]):
                raise ValueError('all arrays must be same length')

-        cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]


do we use Categorical.from_array any longer? (in the codebase)

There are just a few places: internals of concat and unstack, LegacyTable in pytables, and panel_index. I can replace them all with _factorize - it's pretty straightforward. It wouldn't automatically fix concat and unstack (that's why I opened separate issues for them) but wouldn't hurt either.

If we replace them, what should be done to the definition of Categorical.from_array? Remove completely or rather add a deprecation warning?

ideally we should replace these and deprecate the constructor. but can do that later / another PR if desired.

OK. I'll try to do it today if it goes smoothly. Otherwise, I'll do a follow up as soon as this PR is merged in.

Just a question:
Categorical.from_array should emit a FutureWarning with a comment like this: "Categorical.from_array is deprecated, use Categorical instead"?

jreback · 2016-08-26T10:54:42Z

@pijucha test split looks good.

@sinhrks any comments?

pijucha · 2016-08-27T12:54:35Z

Deprecated Categorical.from_array.

pijucha · 2016-08-31T04:07:15Z

update

jreback · 2016-08-31T13:06:39Z

lgtm. @sinhrks @jorisvandenbossche

jreback · 2016-09-02T11:33:33Z

@pijucha can you rebase

…eprecate .from_array Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`, `get_dummies`, and `make_axis_dummies`.

pijucha · 2016-09-02T14:47:02Z

@jreback Done (rebase + small update to tests/test_reshape.py).

One build on travis has probably stalled - should I resubmit?

jreback · 2016-09-02T21:34:00Z

@pijucha i restarted the build. ping when green.

jreback · 2016-09-02T22:08:51Z

thanks @pijucha really nice PR. touches lots of parts!

Deprecated in 0.19.0 xref pandas-devgh-13854.

Deprecated in 0.19.0 xref gh-13854.

jreback added Bug Groupby MultiIndex Categorical Categorical Data Type labels Aug 1, 2016

jreback reviewed Aug 1, 2016
View reviewed changes

jreback reviewed Aug 2, 2016
View reviewed changes

pijucha mentioned this pull request Aug 16, 2016

BUG: concat with categorical keys doesn't preserve categorical dtype #14016

Open

pijucha force-pushed the catdtype branch from ee6b6c1 to 87af41c Compare August 16, 2016 23:01

This was referenced Aug 16, 2016

BUG: Inconsistency in make_axis_dummies (and/or Panel.to_frame()) with categorical index level #14017

Closed

BUG: unstack doesn't preserve categorical dtype #14018

Closed

jreback reviewed Aug 17, 2016
View reviewed changes

jreback mentioned this pull request Aug 21, 2016

BUG: Categoricals shouldn't allow non-strings when object dtype is passed (#13919) #14047

Closed

4 tasks

jorisvandenbossche added this to the 0.19.0 milestone Aug 21, 2016

pijucha force-pushed the catdtype branch from 87af41c to cd1a83b Compare August 21, 2016 22:49

jreback reviewed Aug 25, 2016
View reviewed changes

pijucha force-pushed the catdtype branch from bfe6087 to b5b9a24 Compare August 27, 2016 04:53

pijucha force-pushed the catdtype branch from b5b9a24 to a82bb83 Compare August 31, 2016 03:37

jsexauer mentioned this pull request Aug 31, 2016

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback mentioned this pull request Aug 31, 2016

RLS: 0.19.0 #13991

Closed

jorisvandenbossche modified the milestones: 0.19.0, 0.19.0rc Sep 1, 2016

BUG/DEPR: Categorical: keep dtype in MultiIndex (pandas-dev#13743), d…

99e4a52

…eprecate .from_array Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`, `get_dummies`, and `make_axis_dummies`.

pijucha force-pushed the catdtype branch from a82bb83 to 99e4a52 Compare September 2, 2016 13:50

jreback closed this in d26363b Sep 2, 2016

pijucha deleted the catdtype branch September 4, 2016 14:37

jorisvandenbossche modified the milestones: 0.19.0rc, 0.19.0 Sep 7, 2016

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 5, 2017

CLN: Remove Categorical.from_array

3ab8d8f

Deprecated in 0.19.0 xref pandas-devgh-13854.

gfyoung mentioned this pull request Dec 5, 2017

CLN: Remove Categorical.from_array #18642

Merged

jreback mentioned this pull request Dec 5, 2017

DEPR: deprecations log for removed issues #13777

Closed

jreback pushed a commit that referenced this pull request Dec 5, 2017

CLN: Remove Categorical.from_array (#18642)

c3c04e2

Deprecated in 0.19.0 xref gh-13854.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

pijucha commented Jul 31, 2016 •

edited

Loading

codecov-io commented Jul 31, 2016 •

edited

Loading

jreback Aug 1, 2016

pijucha commented Aug 2, 2016

jreback commented Aug 2, 2016

jreback commented Aug 2, 2016

jreback Aug 2, 2016

pijucha Aug 2, 2016

pijucha commented Aug 2, 2016

pijucha commented Aug 17, 2016

jreback Aug 17, 2016 •

edited

Loading

jreback commented Aug 17, 2016

pijucha commented Aug 22, 2016

jreback Aug 25, 2016

pijucha Aug 25, 2016

jreback Aug 26, 2016

pijucha Aug 26, 2016

jreback Aug 26, 2016

jreback commented Aug 26, 2016

pijucha commented Aug 27, 2016

pijucha commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Sep 2, 2016

pijucha commented Sep 2, 2016

jreback commented Sep 2, 2016

jreback commented Sep 2, 2016

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

Conversation

pijucha commented Jul 31, 2016 • edited Loading

codecov-io commented Jul 31, 2016 • edited Loading

Current coverage is 85.27% (diff: 100%)

Choose a reason for hiding this comment

pijucha commented Aug 2, 2016

jreback commented Aug 2, 2016

jreback commented Aug 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pijucha commented Aug 2, 2016

pijucha commented Aug 17, 2016

jreback Aug 17, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Aug 17, 2016

pijucha commented Aug 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 26, 2016

pijucha commented Aug 27, 2016

pijucha commented Aug 31, 2016

jreback commented Aug 31, 2016

jreback commented Sep 2, 2016

pijucha commented Sep 2, 2016

jreback commented Sep 2, 2016

jreback commented Sep 2, 2016

pijucha commented Jul 31, 2016 •

edited

Loading

codecov-io commented Jul 31, 2016 •

edited

Loading

jreback Aug 17, 2016 •

edited

Loading