groupby on multiple columns does not preserve (categorical) dtype #13743

martijnvermaat · 2016-07-21T17:51:57Z

When doing a groupby on more than one column, the resulting MultiIndex does not seem to preserve the original column dtypes. I noticed it when working with Categorical columns, expecting CategoricalIndex when grouping on them, but this is only the case when grouping on just one column.

I did see that the behaviour was discussed in a PR, but it ultimately was not addressed.

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({
   ...:     'a': pd.Series(list('xyxxyz')).astype('category', categories=list('xyz')),
   ...:     'b': pd.Series(list('yzzyxz')).astype('category', categories=list('xyz')),
   ...:     'c': [1,2,3,4,5,6]
   ...: })

In [3]: df.groupby('a').sum().reset_index().dtypes
Out[3]: 
a    category
c       int64
dtype: object

In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes
Out[4]: 
a     object
b     object
c    float64
dtype: object

Expected Output

In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes
Out[4]: 
a    category
b    category
c       int64
dtype: object

output of `pd.show_versions()`

In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.13
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.18.1+240.gbb6b5e5
nose: None
pip: 8.1.2
setuptools: 19.4
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.14
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

martijnvermaat · 2016-07-21T18:28:50Z

I thought I'd quickly workaround it by converting the resulting MultiIndex to one with two CategoricalIndexs via reset_index() and set_index(), but it seems that set_index similarly forgets the column dtypes:

In [6]: df.groupby(['a', 'b']).sum().reset_index().assign(
   ...:     a=lambda df: df.a.astype('category', categories=list('xyz')),
   ...:     b=lambda df: df.b.astype('category', categories=list('xyz'))
   ...: ).set_index(['a', 'b']).reset_index().dtypes
Out[6]: 
a     object
b     object
c    float64
dtype: object

So I guess my bug report is now for groupby as well as for set_index.

pijucha · 2016-07-25T03:00:47Z

I did see that the behaviour was discussed in a PR, but it ultimately was not addressed.

I still have it in mind and will submit a fix soon.

…eprecate .from_array Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`, `get_dummies`, and `make_axis_dummies`.

sinhrks added Groupby MultiIndex Categorical Categorical Data Type labels Jul 22, 2016

pijucha mentioned this issue Jul 31, 2016

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

Closed

4 tasks

jreback added this to the 0.19.0 milestone Aug 1, 2016

jreback added the Bug label Aug 1, 2016

pijucha mentioned this issue Aug 16, 2016

BUG: unstack doesn't preserve categorical dtype #14018

Closed

jsexauer mentioned this issue Aug 31, 2016

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback closed this as completed in d26363b Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby on multiple columns does not preserve (categorical) dtype #13743

groupby on multiple columns does not preserve (categorical) dtype #13743

martijnvermaat commented Jul 21, 2016

martijnvermaat commented Jul 21, 2016

pijucha commented Jul 25, 2016

groupby on multiple columns does not preserve (categorical) dtype #13743

groupby on multiple columns does not preserve (categorical) dtype #13743

Comments

martijnvermaat commented Jul 21, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

martijnvermaat commented Jul 21, 2016

pijucha commented Jul 25, 2016

output of `pd.show_versions()`