-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800
Comments
Thanks for the report. Are you interested in debugging to see where things go wrong? |
@kpflugshaupt I ran your code samples and did not get any errors, on pandas 0.24.2. Some later update must have broken this. I can look into this. @TomAugspurger is there an easy way to see what changes between 0.24.2 and 0.25.0 would have affected this functionality? The whatsnew docs (could be a lot of combing)? |
Hi @ncernek , I also cannot recall seeing this ever before 0.25.0, so the upgrade has probably introduced it. As to how to tackle this: I would (given time) debug the call, and either catch where it's going wrong (compare to working call), or list all the visited files and cross-reference the whatsnew file. Thanks for looking into this! I may also have a go, but not in the next days -- too much work. Cheers |
You can also use git bisect to find the faulty commit. Might be able to look through PRs in the git log that mention groupby or categorical to narrow down the bisect range. |
I went the
Result of
Looks like a lot of thinking went into that PR, maybe those of you (e.g. @jreback ) who worked on it want to look into this bug further? |
Thanks @ncernek. At least for this specific one, it's from how concat handles a mix of categorical & non-categorical indexes In [2]: a = pd.DataFrame({"A": [1, 2]})
In [3]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))
In [4]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)
~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
256 )
257
--> 258 return op.get_result()
259
260
~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
466 obj_labels = mgr.axes[ax]
467 if not new_labels.equals(obj_labels):
--> 468 indexers[ax] = obj_labels.reindex(new_labels)[1]
469
470 mgrs_indexers.append((obj._data, indexers))
~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
615 # coerce to a regular index here!
616 result = Index(np.array(self), name=self.name)
--> 617 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
618 else:
619
~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
3388
3389 target = ensure_index(target)
-> 3390 indexer, missing = self.get_indexer_non_unique(target)
3391 check = indexer != -1
3392 new_labels = self.take(indexer[check])
~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
4751 tgt_values = target._ndarray_values
4752
-> 4753 indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
4754 return ensure_platform_int(indexer), missing
4755
~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
305 # increasing, then use binary search for each starget
306 for starget in stargets:
--> 307 start = values.searchsorted(starget, side='left')
308 end = values.searchsorted(starget, side='right')
309 if start != end:
TypeError: '<' not supported between instances of 'str' and 'int'
In [5]: a = pd.DataFrame({"A": [1, 2]})
In [6]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))
In [7]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)
~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
256 )
257
--> 258 return op.get_result()
259
260
~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
466 obj_labels = mgr.axes[ax]
467 if not new_labels.equals(obj_labels):
--> 468 indexers[ax] = obj_labels.reindex(new_labels)[1]
469
470 mgrs_indexers.append((obj._data, indexers))
~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
615 # coerce to a regular index here!
616 result = Index(np.array(self), name=self.name)
--> 617 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
618 else:
619
~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
3388
3389 target = ensure_index(target)
-> 3390 indexer, missing = self.get_indexer_non_unique(target)
3391 check = indexer != -1
3392 new_labels = self.take(indexer[check])
~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
4751 tgt_values = target._ndarray_values
4752
-> 4753 indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
4754 return ensure_platform_int(indexer), missing
4755
~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
305 # increasing, then use binary search for each starget
306 for starget in stargets:
--> 307 start = values.searchsorted(starget, side='left')
308 end = values.searchsorted(starget, side='right')
309 if start != end:
I think that should coerce both to object-dtype index before going too far. |
Here is another bug which is probably another face of this bug: Aggregating a categorical column in a groupby yields a Categorical object rather than a Series. Here is the code:
|
I suspect that this bug was introduced by the following change in 0.25. Aggregation in groupby in Pandas 0.24 dropped the categorical dtype, replacing it by object which was a string. Version 0.25 introduced additional code to restore the original dtype. It seems to me that it is this code which is to blame for this bug. I also forgot to mention that the aggregation of categorical columns take huge amount of time on large datasets, much much more than for other column types. For this reason I bypassed it entirely in my code, aggregating string columns instead, despite the memory cost. |
Looks like the examples work on master. Could use tests
|
This seems solved on the current productive release. In my testing, all cases from the initial report came out correctly, with no regressions. Output of
|
Works fine on 1.0.4. |
@kpflugshaupt we'll want to add a regression test to ensure that this issue doesn't occur again. Are you interested in submitted a PR with that test? |
@mroeschke Not sure I'll have the time -- work & family tend to get in the way. Possibly in the week starting July 20th. I'll try! |
take |
Code Sample
Build the model data frame:
When grouping, single aggregations on a numeric column work:
Single aggregations on an ordered categorical column work, but drop the grouping index:
Combined single aggregations on a numeric and an ordered categorical column work:
Multiple aggregations on an ordered categorical column work, but drop the grouping index:
Combined aggregations on a numeric (single) and an ordered categorical column (multiple) fail with a TypeError:
Combined aggregations on a numeric (multiple) and an ordered categorical column (single) also fail with the same TypeError:
Problem description
Aggregations on ordered categoricals drop the grouping index, or crash, as shown above.
This makes it hard to calculate combined aggregations over big data sets correctly and efficiently.
Expected Output
Aggregations on ordered categoricals should work as on non-categorical columns.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
The text was updated successfully, but these errors were encountered: