groupby with index = False returns NANs when column is categorical. #13204

toasteez · 2016-05-17T15:07:21Z

Please see stackoverflow for example of issue

http://stackoverflow.com/questions/37279260/why-doesnt-pandas-allow-a-categorical-column-to-be-used-in-groupby?noredirect=1#comment62084780_37279260

>>> pd.__version__
'0.18.1'
>>> 

# import the pandas module
import pandas as pd

# Create an example dataframe
raw_data = {'Date': ['2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13','2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13'],
    'Portfolio': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Duration': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
    'Yield': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1],}

df = pd.DataFrame(raw_data, columns = ['Date', 'Portfolio', 'Duration', 'Yield'])

df['Portfolio'] = pd.Categorical(df['Portfolio'],['C', 'B', 'A'])
df=df.sort_values('Portfolio')

dfs = df.groupby(['Date','Portfolio'], as_index =False).sum()

print(dfs)

                        Date    Portfolio   Duration   Yield
Date        Portfolio               
13/05/2016  C           NaN     NaN         NaN        NaN
            B           NaN     NaN         NaN        NaN
            A           NaN     NaN         NaN        NaN

The text was updated successfully, but these errors were encountered:

jreback · 2016-05-17T15:15:09Z

pls post an example & show_versions. SO links are nice, but an in-line example much better.

dsm054 · 2016-05-17T16:54:39Z

FWIW my example would be something like

>>> pd.__version__
'0.18.1'
>>> 
>>> df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": pd.Categorical([1,2,3])})
>>> df.groupby(["A","C"]).sum().reset_index()
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2
>>> df.groupby(["A","C"],as_index=False).sum()
      A   C   B
A C            
1 1 NaN NaN NaN
  2 NaN NaN NaN
  3 NaN NaN NaN

jreback · 2016-05-17T17:05:34Z

yeah, this is reindexing I think somewhere inside and is prob not setting it up right. pull-requests welcome.

pijucha · 2016-06-06T15:37:53Z

Looks quite easy to fix. Function _reindex_output() doesn't take account of the variable self.as_index.

Another issue in the same function. The multiindex loses information about dtypes. For example:

df = pd.DataFrame({'cat': pd.Categorical([5,6,6,7,7], [5,6,7,8]),
                  'i1' : [10, 11, 11, 10, 11],
                  'i2' : [101,102,102,102,103]})

df.groupby(['cat', 'i1']).sum().reset_index().dtypes
Out[12]: 
cat      int64
i1       int64
i2     float64
dtype: object

While for a usual one level index:

df.groupby(['cat']).sum().reset_index().dtypes
Out[13]: 
cat    category
i1      float64
i2      float64
dtype: object

And I guess df.groupby(..., as_index=False).agg(...) should be consistent with df.groupby(..., as_index=True).agg(...).reset_index().

Edit: On second thought, I'd rather leave the index as it is. If a change is needed, it'd better be done in MultiIndex constructor, I suppose.

I'll prepare a PR for it later.

BTW, I couldn't find any info whether the following behaviour of categoricals in DataFrame is by design or just a side effect:

# df - same as above
df.sum()
Out[14]: 
cat     31.0
i1      53.0
i2     510.0
dtype: float64

df[['cat']].sum()
Out[15]: 
cat    31
dtype: int64

# while for Series:
df['cat'].sum()
...
TypeError: Categorical cannot perform the operation sum

Shouldn't categricals be rather excluded when aggregating as it is with datetime columns?

…dev#13204 BUG: Fix string repr of Grouping

jreback added Bug Groupby Difficulty Intermediate labels May 17, 2016

jreback added this to the Next Major Release milestone May 17, 2016

pijucha mentioned this issue Jun 7, 2016

BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

Closed

4 tasks

pijucha mentioned this issue Jun 9, 2016

Inconsistencies in groupby aggregation with non-numeric types #13416

Closed

jreback modified the milestones: 0.18.2, Next Major Release Jun 17, 2016

pijucha added a commit to pijucha/pandas that referenced this issue Jul 2, 2016

BUG: Fix groupby with as_index for categorical multi groupers pandas-…

374402c

…dev#13204 BUG: Fix string repr of Grouping

jreback closed this as completed in 8f8d75d Jul 3, 2016

clham mentioned this issue Jan 2, 2017

groupby with category column and two additional columns eats up all main memory #14942

Closed

dragoljub mentioned this issue Jan 25, 2017

Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality #15217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby with index = False returns NANs when column is categorical. #13204

groupby with index = False returns NANs when column is categorical. #13204

toasteez commented May 17, 2016 •

edited

Loading

jreback commented May 17, 2016

dsm054 commented May 17, 2016

jreback commented May 17, 2016

pijucha commented Jun 6, 2016 •

edited

Loading

groupby with index = False returns NANs when column is categorical. #13204

groupby with index = False returns NANs when column is categorical. #13204

Comments

toasteez commented May 17, 2016 • edited Loading

jreback commented May 17, 2016

dsm054 commented May 17, 2016

jreback commented May 17, 2016

pijucha commented Jun 6, 2016 • edited Loading

toasteez commented May 17, 2016 •

edited

Loading

pijucha commented Jun 6, 2016 •

edited

Loading