-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Empty CategoricalIndex fails with boolean categories #22710
Conversation
Hello @pganssle! Thanks for submitting the PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
doc/source/whatsnew/v0.23.5.txt
Outdated
@@ -26,7 +26,7 @@ Fixed Regressions | |||
- Calling :meth:`DataFrameGroupBy.rank` and :meth:`SeriesGroupBy.rank` with empty groups | |||
and ``pct=True`` was raising a ``ZeroDivisionError`` due to `c1068d9 | |||
<https://github.com/pandas-dev/pandas/commit/c1068d9d242c22cb2199156f6fb82eb5759178ae>`_ (:issue:`22519`) | |||
- | |||
- Constructing a :class:`pd.CategoricalIndex` with empty values and boolean categories was raising a ``ValueError`` after a change to dtype coercion in `78184868 <https://github.com/pandas-dev/pandas/commit/7818486859d1aba53ce359b93cfc772e688958e5>`_ (:issue:`22702`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this wasn't introduced in 0.23.x
(but rather in 0.21.x
), let's move this to 0.24.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved.
f1fec08
to
8d85e9d
Compare
pandas/core/arrays/categorical.py
Outdated
# These may be Index, in which case their dtype would be coerced | ||
# as part of _get_data_algo; in that case, we should retrieve the | ||
# underlying numpy array instead. | ||
values = getattr(values, 'values', values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use np.asarray
here as we have to re-infer anyhow. But pls check perf of this. The comment can be a bit simpler, and reference the gh issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [4]: categories = pd.Index([True, False], dtype='object')
In [5]: values = np.array([], dtype='object')
In [6]: %timeit np.asarray(values)
260 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit np.asarray(categories)
2.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit getattr(values, 'values', values)
312 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [9]: %timeit getattr(categories, 'values', categories)
410 ns ± 4.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: categories = pd.Index(list('abcdefghijklmnopqrstuvwxyz'), dtype='object')
In [11]: categories
Out[11]:
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
dtype='object')
In [12]: %timeit np.asarray(categories)
2.09 µs ± 9.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit getattr(categories, 'values', categories)
409 ns ± 6.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each
Looks like the way I've done it is faster, but np.asarray
doesn't seem to be scaling with the number of categories, so I don't think it's a huge deal to use np.asarray
. That said, I don't love the idea of using np.asarray
in this situation, because although I can't think of a time when it would present a problem, np.asarray
is allowed to infer dtype
. The way I've done it you're guaranteed to either get the original thing back or the underlying numpy array, which is pretty much as light a touch as I can imagine.
8d85e9d
to
9905063
Compare
I've pushed new changes that address all comments (though I did not switch over to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny code change, otherwise lgtm. ping on green.
pandas/core/arrays/categorical.py
Outdated
@@ -2439,11 +2439,15 @@ def _get_codes_for_values(values, categories): | |||
""" | |||
utility routine to turn values into codes given the specified categories | |||
""" | |||
|
|||
from pandas.core.algorithms import _get_data_algo, _hashtables | |||
if not is_dtype_equal(values.dtype, categories.dtype): | |||
values = ensure_object(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you flip this block around, e.g.
if is_dtype_equal(...):
# your added code
else:
# ensure object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that if you want. One thing to note is that I believe if not is_dtype_equal
case is probably the "common case", since it doesn't occur with many dtype
s like integers or strings.
I don't think there's any speed advantage to being in the if
branch vs. the else
branch, but depending on convention you may prefer the common case to be the first block and the rare case to be the else
block. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not about speed or anything else, its about consistency in our code of how we do these kinds of tests.
9905063
to
cfeb81e
Compare
Codecov Report
@@ Coverage Diff @@
## master #22710 +/- ##
==========================================
+ Coverage 92.17% 92.17% +<.01%
==========================================
Files 169 169
Lines 50769 50771 +2
==========================================
+ Hits 46798 46800 +2
Misses 3971 3971
Continue to review full report at Codecov.
|
All issues should be addressed now. I'm guessing the azure pipeline failure is unrelated. I can rebase against master if it's fixed in master. |
Currently working on the azure stuff, you can safely ignore it.
…On Wed, Sep 19, 2018 at 9:04 AM Paul Ganssle ***@***.***> wrote:
All issues should be addressed now. I'm guessing the azure pipeline
failure is unrelated. I can rebase against master if it's fixed in master.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22710 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIohEDLZwgt4Tp5nbgJeV3EhVf8sHks5uck7ogaJpZM4WpVL2>
.
|
lgtm. merging after azure updates by @TomAugspurger |
Sorry about all the noise, your PR is my guinea pig :) Things should be OK now. Ping on green. |
c1ff9e0
to
58c9d54
Compare
@TomAugspurger Green now. |
Thanks! |
…#22710) * TST: Add failing test for empty bool Categoricals * BUG: Failure in empty boolean CategoricalIndex Fixes GH pandas-dev#22702.
Fixes #22702.
This bug was introduced in 7818486859d1aba53; per my comment, the problem is here:
When
categories
isIndex([True], dtype='object')
andvalues
isarray([], dtype='object')
, theensure_object
call is bypassed, but in_get_data_algo
, anIndex
consisting entirely of boolean objects will be coerced touint64
, which violates the assumption thatvalues
andcategories
have the same dtype.I felt that retrieving the underlying numpy arrays (if any exist) is the safest way to handle this without having too many wide-reaching effects across the rest of the codebase, but there might be a better way to enforce that these are not coerced into different data types.
git diff upstream/master -u -- "*.py" | flake8 --diff