-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby with category column and two additional columns eats up all main memory #14942
Comments
this is a bug in the internals.. happens with a much lower value than the size of the frame (e.g. 10000) is enough to trigger. welcome to have you step thru and see where. |
@jreback I've run this down to something that seems intentional here When we |
ea0a13c was the original change. IIRC the logic was something like this. We are grouping on (1 or more columns / groupands). Some of these could be categorical. We get a result set. Now we need to construct the result index that is a MultiIndex (if we have multiple groupands). I think the issue was that we needed to make sure that we have the original categories in the output (for each level that of the MultiIndex that was a categorical to start). I don't really remember why we did a cartesian product on this. I don't think this is necessary. We can just use the categories (levels) or the orginal groupand. I suspect if you change this something will break and you can then dig in an see why. |
Current tests are explicitly set to expect the product -- this is different behavior than non-categorical groupbys. Are you comfortable with the API change (to make categorical groupbys smell the same as non-cats?), or is this better left alone and documented as a gotcha? |
can you point to tests that we would need to change? even though I put up an expl above. I am not sure this is actually necessary (to reindex to the cartesian product). @JanSchulz any thoughts here |
Here is where the test is expecting it: test_categorical.py:L316-L325 some level of reindexing seems needed,(#13204), but perhaps something more like this fix |
yeah this seems that we should simply reindex each level as needed (if needed). give it a try. |
xref: #10484 and |
sorry, no idea :-/ |
Code Sample, a copy-pastable example if possible
Problem description
The problem occurs only when I try to group by at least three columns. For two and one columns, it works.
If I replace the categorical column with an integer one, the groupby only takes about 2 seconds and does not use so much memory. This is also the workaround I use currently when I have to group by columns where one of the columns has the type category. But this is kind of ugly.
Expected Output
Output of
pd.show_versions()
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.1.0.post20161217
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0rc2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: