Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby with category column and two additional columns eats up all main memory #14942

Closed
ephes opened this issue Dec 21, 2016 · 9 comments · Fixed by #20583
Closed

groupby with category column and two additional columns eats up all main memory #14942

ephes opened this issue Dec 21, 2016 · 9 comments · Fixed by #20583
Milestone

Comments

@ephes
Copy link

ephes commented Dec 21, 2016

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'cat': np.random.randint(0, 255, size=3000000),
    'int_id': np.random.randint(0, 255, size=3000000),
    'other_id': np.random.randint(0, 10000, size=3000000),
    'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')

# killed after 6 minutes of 100% cpu and  90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

Problem description

The problem occurs only when I try to group by at least three columns. For two and one columns, it works.

If I replace the categorical column with an integer one, the groupby only takes about 2 seconds and does not use so much memory. This is also the workaround I use currently when I have to group by columns where one of the columns has the type category. But this is kind of ugly.

Expected Output

from collections import defaultdict

class CategoryColGrouper:
    def __init__(self):
        self.lookups = defaultdict(dict)
        self.max_num = {}
        self.seen_cat_cols = set()

    def get_data_type(self, max_num):
        if max_num < 2 ** 8:
            return np.uint8
        elif max_num < 2 ** 16:
            return np.uint16
        elif max_num < 2 ** 32:
            return np.uint32
        else:
            return np.uint64

    def get_cat_cols(self, df):
        cat_cols = []
        for col in df.columns:
            if str(df[col].dtype) == 'category':
                cat_cols.append(col)
        return cat_cols

    def get_lookup_for_col(self, df, col):
        col_lookup = self.lookups[col]
        max_num = self.max_num.get(col, 0)

        missing = []
        for value in df[col].unique():
            if value not in col_lookup:
                missing.append(value)

        for num, value in enumerate(missing):
            new_num = num + max_num
            col_lookup[value] = new_num
            self.max_num[col] = new_num

        return col_lookup

    def make_groupable(self, df):
        cat_cols = self.get_cat_cols(df)
        self.seen_cat_cols = self.seen_cat_cols.union(cat_cols)
        for cat_col in cat_cols:
            col_lookup = self.get_lookup_for_col(df, cat_col)
            df[cat_col] = df[cat_col].apply(lambda x: col_lookup[x])
            data_type = self.get_data_type(self.max_num[cat_col])
            df[cat_col] = df[cat_col].astype(data_type)
        return df

    def restore_category_cols(self, df):
        for cat_col in self.seen_cat_cols:
            col_lookup = self.lookups[cat_col]
            rcol_lookup = {v: k for k, v in col_lookup.items()}
            df[cat_col] = df[cat_col].apply(lambda x: rcol_lookup[x])
            df[cat_col] = df[cat_col].astype('category')
        return df

ccg = CategoryColGrouper()
df = ccg.make_groupable(df)

grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

df = ccg.restore_category_cols(df)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.2.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: C LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.1.0.post20161217
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0rc2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

this is a bug in the internals.. happens with a much lower value than the size of the frame (e.g. 10000) is enough to trigger. welcome to have you step thru and see where.

@jreback jreback added this to the Next Major Release milestone Dec 21, 2016
@clham
Copy link
Contributor

clham commented Dec 31, 2016

@jreback I've run this down to something that seems intentional here When we groupby using categoricals, we intentionally add back in unused bins. The result is that the bucket count explodes (Cartesian product of ALL categories, plus other used indices). I'm unclear from an API perspective why this is either useful or desired. Happy to try to clean this up, either as a code fix, or as a documentation fix -- just need some background on the intent.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2017

ea0a13c was the original change.

IIRC the logic was something like this.

We are grouping on (1 or more columns / groupands). Some of these could be categorical. We get a result set. Now we need to construct the result index that is a MultiIndex (if we have multiple groupands).

I think the issue was that we needed to make sure that we have the original categories in the output (for each level that of the MultiIndex that was a categorical to start).

I don't really remember why we did a cartesian product on this. I don't think this is necessary. We can just use the categories (levels) or the orginal groupand. I suspect if you change this something will break and you can then dig in an see why.

@clham
Copy link
Contributor

clham commented Jan 2, 2017

Current tests are explicitly set to expect the product -- this is different behavior than non-categorical groupbys. Are you comfortable with the API change (to make categorical groupbys smell the same as non-cats?), or is this better left alone and documented as a gotcha?

@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

can you point to tests that we would need to change?

even though I put up an expl above. I am not sure this is actually necessary (to reindex to the cartesian product).

@JanSchulz any thoughts here

@clham
Copy link
Contributor

clham commented Jan 2, 2017

Here is where the test is expecting it: test_categorical.py:L316-L325

some level of reindexing seems needed,(#13204), but perhaps something more like this fix

@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

yeah this seems that we should simply reindex each level as needed (if needed). give it a try.

@clham
Copy link
Contributor

clham commented Jan 2, 2017

xref: #10484 and
groupby docs:NaN handling

@jankatins
Copy link
Contributor

sorry, no idea :-/

@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 25, 2017
@jreback jreback modified the milestones: Next Minor Release, 0.20.0 Mar 29, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2018
@jreback jreback removed this from the Next Major Release milestone Apr 9, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 9, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 9, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 20, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 21, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 26, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 27, 2018
jreback added a commit to jreback/pandas that referenced this issue Apr 29, 2018
jreback added a commit to jreback/pandas that referenced this issue May 1, 2018
jreback added a commit to jreback/pandas that referenced this issue May 1, 2018
TomAugspurger pushed a commit that referenced this issue May 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants