groupby with category column and two additional columns eats up all main memory #14942

ephes · 2016-12-21T15:14:01Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'cat': np.random.randint(0, 255, size=3000000),
    'int_id': np.random.randint(0, 255, size=3000000),
    'other_id': np.random.randint(0, 10000, size=3000000),
    'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')

# killed after 6 minutes of 100% cpu and  90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

Problem description

The problem occurs only when I try to group by at least three columns. For two and one columns, it works.

If I replace the categorical column with an integer one, the groupby only takes about 2 seconds and does not use so much memory. This is also the workaround I use currently when I have to group by columns where one of the columns has the type category. But this is kind of ugly.

Expected Output

from collections import defaultdict

class CategoryColGrouper:
    def __init__(self):
        self.lookups = defaultdict(dict)
        self.max_num = {}
        self.seen_cat_cols = set()

    def get_data_type(self, max_num):
        if max_num < 2 ** 8:
            return np.uint8
        elif max_num < 2 ** 16:
            return np.uint16
        elif max_num < 2 ** 32:
            return np.uint32
        else:
            return np.uint64

    def get_cat_cols(self, df):
        cat_cols = []
        for col in df.columns:
            if str(df[col].dtype) == 'category':
                cat_cols.append(col)
        return cat_cols

    def get_lookup_for_col(self, df, col):
        col_lookup = self.lookups[col]
        max_num = self.max_num.get(col, 0)

        missing = []
        for value in df[col].unique():
            if value not in col_lookup:
                missing.append(value)

        for num, value in enumerate(missing):
            new_num = num + max_num
            col_lookup[value] = new_num
            self.max_num[col] = new_num

        return col_lookup

    def make_groupable(self, df):
        cat_cols = self.get_cat_cols(df)
        self.seen_cat_cols = self.seen_cat_cols.union(cat_cols)
        for cat_col in cat_cols:
            col_lookup = self.get_lookup_for_col(df, cat_col)
            df[cat_col] = df[cat_col].apply(lambda x: col_lookup[x])
            data_type = self.get_data_type(self.max_num[cat_col])
            df[cat_col] = df[cat_col].astype(data_type)
        return df

    def restore_category_cols(self, df):
        for cat_col in self.seen_cat_cols:
            col_lookup = self.lookups[cat_col]
            rcol_lookup = {v: k for k, v in col_lookup.items()}
            df[cat_col] = df[cat_col].apply(lambda x: rcol_lookup[x])
            df[cat_col] = df[cat_col].astype('category')
        return df

ccg = CategoryColGrouper()
df = ccg.make_groupable(df)

grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

df = ccg.restore_category_cols(df)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.2.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: C LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.1.0.post20161217
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0rc2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-12-21T17:04:35Z

this is a bug in the internals.. happens with a much lower value than the size of the frame (e.g. 10000) is enough to trigger. welcome to have you step thru and see where.

clham · 2016-12-31T22:19:06Z

@jreback I've run this down to something that seems intentional here When we groupby using categoricals, we intentionally add back in unused bins. The result is that the bucket count explodes (Cartesian product of ALL categories, plus other used indices). I'm unclear from an API perspective why this is either useful or desired. Happy to try to clean this up, either as a code fix, or as a documentation fix -- just need some background on the intent.

jreback · 2017-01-01T17:48:48Z

ea0a13c was the original change.

IIRC the logic was something like this.

We are grouping on (1 or more columns / groupands). Some of these could be categorical. We get a result set. Now we need to construct the result index that is a MultiIndex (if we have multiple groupands).

I think the issue was that we needed to make sure that we have the original categories in the output (for each level that of the MultiIndex that was a categorical to start).

I don't really remember why we did a cartesian product on this. I don't think this is necessary. We can just use the categories (levels) or the orginal groupand. I suspect if you change this something will break and you can then dig in an see why.

clham · 2017-01-02T17:24:45Z

Current tests are explicitly set to expect the product -- this is different behavior than non-categorical groupbys. Are you comfortable with the API change (to make categorical groupbys smell the same as non-cats?), or is this better left alone and documented as a gotcha?

jreback · 2017-01-02T17:27:52Z

can you point to tests that we would need to change?

even though I put up an expl above. I am not sure this is actually necessary (to reindex to the cartesian product).

@JanSchulz any thoughts here

clham · 2017-01-02T17:42:58Z

Here is where the test is expecting it: test_categorical.py:L316-L325

some level of reindexing seems needed,(#13204), but perhaps something more like this fix

jreback · 2017-01-02T17:45:07Z

yeah this seems that we should simply reindex each level as needed (if needed). give it a try.

clham · 2017-01-02T18:28:56Z

xref: #10484 and
groupby docs:NaN handling

jankatins · 2017-01-07T19:44:00Z

sorry, no idea :-/

closes pandas-dev#14942

…#20583) * BUG: groupby with categorical and other columns closes #14942

jreback added Bug Difficulty Intermediate Groupby labels Dec 21, 2016

jreback added this to the Next Major Release milestone Dec 21, 2016

jreback modified the milestones: 0.20.0, Next Major Release Jan 25, 2017

This was referenced Jan 25, 2017

Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality #15217

Closed

BUG: groupby.count fails on windows with large categorical index #15234

Closed

Fix overflow error in cartesian_product #15265

Closed

jreback added the Prio-high label Mar 29, 2017

jreback modified the milestones: Next Minor Release, 0.20.0 Mar 29, 2017

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018

BUG: groupby with categorical and other columns

4282824

closes pandas-dev#14942

jreback mentioned this issue Apr 2, 2018

API: categorical grouping will no longer return the cartesian product #20583

Merged

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018

BUG: groupby with categorical and other columns

b8d86e2

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018

BUG: groupby with categorical and other columns

5220a20

closes pandas-dev#14942

jreback removed this from the Next Major Release milestone Apr 9, 2018

jreback added this to the 0.23.0 milestone Apr 9, 2018

jreback added a commit to jreback/pandas that referenced this issue Apr 9, 2018

BUG: groupby with categorical and other columns

69c5cf3

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 20, 2018

BUG: groupby with categorical and other columns

df3533a

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 21, 2018

BUG: groupby with categorical and other columns

422a6be

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 26, 2018

BUG: groupby with categorical and other columns

c61318d

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 27, 2018

BUG: groupby with categorical and other columns

ce9a917

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue Apr 29, 2018

BUG: groupby with categorical and other columns

de8ca3f

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue May 1, 2018

BUG: groupby with categorical and other columns

d34d365

closes pandas-dev#14942

jreback added a commit to jreback/pandas that referenced this issue May 1, 2018

BUG: groupby with categorical and other columns

fa532b6

closes pandas-dev#14942

TomAugspurger closed this as completed in #20583 May 1, 2018

TomAugspurger pushed a commit that referenced this issue May 1, 2018

API: categorical grouping will no longer return the cartesian product (…

b020891

…#20583) * BUG: groupby with categorical and other columns closes #14942

This was referenced Jan 29, 2022

Zero counts in Series.value_counts for categoricals #8559

Closed

BUG: Inconsistent behaviour for DataFrame.value_counts and Series.value_counts on categoricals #44001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby with category column and two additional columns eats up all main memory #14942

groupby with category column and two additional columns eats up all main memory #14942

ephes commented Dec 21, 2016 •

edited

Loading

jreback commented Dec 21, 2016

clham commented Dec 31, 2016 •

edited by jreback

Loading

jreback commented Jan 1, 2017

clham commented Jan 2, 2017

jreback commented Jan 2, 2017

clham commented Jan 2, 2017 •

edited

Loading

jreback commented Jan 2, 2017

clham commented Jan 2, 2017

jankatins commented Jan 7, 2017

groupby with category column and two additional columns eats up all main memory #14942

groupby with category column and two additional columns eats up all main memory #14942

Comments

ephes commented Dec 21, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Dec 21, 2016

clham commented Dec 31, 2016 • edited by jreback Loading

jreback commented Jan 1, 2017

clham commented Jan 2, 2017

jreback commented Jan 2, 2017

clham commented Jan 2, 2017 • edited Loading

jreback commented Jan 2, 2017

clham commented Jan 2, 2017

jankatins commented Jan 7, 2017

ephes commented Dec 21, 2016 •

edited

Loading

Output of `pd.show_versions()`

clham commented Dec 31, 2016 •

edited by jreback

Loading

clham commented Jan 2, 2017 •

edited

Loading