BUG: Empty CategoricalIndex fails with boolean categories #22710

pganssle · 2018-09-14T14:02:41Z

This bug was introduced in 7818486859d1aba53; per my comment, the problem is here:

    if not is_dtype_equal(values.dtype, categories.dtype):
        values = ensure_object(values)
        categories = ensure_object(categories)

    (hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
    (_, _), cats = _get_data_algo(categories, _hashtables)

When categories is Index([True], dtype='object') and values is array([], dtype='object'), the ensure_object call is bypassed, but in _get_data_algo, an Index consisting entirely of boolean objects will be coerced to uint64, which violates the assumption that values and categories have the same dtype.

I felt that retrieving the underlying numpy arrays (if any exist) is the safest way to handle this without having too many wide-reaching effects across the rest of the codebase, but there might be a better way to enforce that these are not coerced into different data types.

closes #xxxx
tests added
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-09-14T14:02:45Z

Hello @pganssle! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/tests/arrays/categorical/test_constructors.py !

gfyoung

Nice!

cc @TomAugspurger @jreback

gfyoung · 2018-09-14T21:24:09Z

doc/source/whatsnew/v0.23.5.txt

@@ -26,7 +26,7 @@ Fixed Regressions
 - Calling :meth:`DataFrameGroupBy.rank` and :meth:`SeriesGroupBy.rank` with empty groups
  and ``pct=True`` was raising a ``ZeroDivisionError`` due to `c1068d9
  <https://github.com/pandas-dev/pandas/commit/c1068d9d242c22cb2199156f6fb82eb5759178ae>`_ (:issue:`22519`)
-
+- Constructing a :class:`pd.CategoricalIndex` with empty values and boolean categories was raising a ``ValueError`` after a change to dtype coercion in `78184868 <https://github.com/pandas-dev/pandas/commit/7818486859d1aba53ce359b93cfc772e688958e5>`_ (:issue:`22702`).


Because this wasn't introduced in 0.23.x (but rather in 0.21.x), let's move this to 0.24.0.

pandas/tests/arrays/categorical/test_constructors.py

jreback · 2018-09-15T12:03:21Z

pandas/core/arrays/categorical.py

+        # These may be Index, in which case their dtype would be coerced
+        # as part of _get_data_algo; in that case, we should retrieve the
+        # underlying numpy array instead.
+        values = getattr(values, 'values', values)


I think you can just use np.asarray here as we have to re-infer anyhow. But pls check perf of this. The comment can be a bit simpler, and reference the gh issue number

In [4]: categories = pd.Index([True, False], dtype='object') In [5]: values = np.array([], dtype='object') In [6]: %timeit np.asarray(values) 260 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit np.asarray(categories) 2.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [8]: %timeit getattr(values, 'values', values) 312 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [9]: %timeit getattr(categories, 'values', categories) 410 ns ± 4.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [10]: categories = pd.Index(list('abcdefghijklmnopqrstuvwxyz'), dtype='object') In [11]: categories Out[11]: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'], dtype='object') In [12]: %timeit np.asarray(categories) 2.09 µs ± 9.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [13]: %timeit getattr(categories, 'values', categories) 409 ns ± 6.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each

Looks like the way I've done it is faster, but np.asarray doesn't seem to be scaling with the number of categories, so I don't think it's a huge deal to use np.asarray. That said, I don't love the idea of using np.asarray in this situation, because although I can't think of a time when it would present a problem, np.asarray is allowed to infer dtype. The way I've done it you're guaranteed to either get the original thing back or the underlying numpy array, which is pretty much as light a touch as I can imagine.

doc/source/whatsnew/v0.24.0.txt

pganssle · 2018-09-15T14:16:10Z

I've pushed new changes that address all comments (though I did not switch over to asarray).

jreback

tiny code change, otherwise lgtm. ping on green.

jreback · 2018-09-18T11:20:03Z

pandas/core/arrays/categorical.py

@@ -2439,11 +2439,15 @@ def _get_codes_for_values(values, categories):
    """
    utility routine to turn values into codes given the specified categories
    """
-
    from pandas.core.algorithms import _get_data_algo, _hashtables
    if not is_dtype_equal(values.dtype, categories.dtype):
        values = ensure_object(values)


can you flip this block around, e.g.

if is_dtype_equal(...): # your added code else: # ensure object

I can do that if you want. One thing to note is that I believe if not is_dtype_equal case is probably the "common case", since it doesn't occur with many dtypes like integers or strings.

I don't think there's any speed advantage to being in the if branch vs. the else branch, but depending on convention you may prefer the common case to be the first block and the rare case to be the else block. Up to you.

its not about speed or anything else, its about consistency in our code of how we do these kinds of tests.

codecov · 2018-09-19T13:12:49Z

Codecov Report

Merging #22710 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22710      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50769    50771       +2     
==========================================
+ Hits        46798    46800       +2     
  Misses       3971     3971

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`42.32% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.75% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40dfadd...58c9d54. Read the comment docs.

pganssle · 2018-09-19T14:04:11Z

All issues should be addressed now. I'm guessing the azure pipeline failure is unrelated. I can rebase against master if it's fixed in master.

TomAugspurger · 2018-09-19T14:07:27Z

Currently working on the azure stuff, you can safely ignore it.

…

On Wed, Sep 19, 2018 at 9:04 AM Paul Ganssle ***@***.***> wrote: All issues should be addressed now. I'm guessing the azure pipeline failure is unrelated. I can rebase against master if it's fixed in master. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22710 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIohEDLZwgt4Tp5nbgJeV3EhVf8sHks5uck7ogaJpZM4WpVL2> .

jreback · 2018-09-19T15:07:21Z

lgtm. merging after azure updates by @TomAugspurger

TomAugspurger · 2018-09-19T15:50:38Z

Sorry about all the noise, your PR is my guinea pig :)

Things should be OK now. Ping on green.

Fixes GH pandas-dev#22702.

pganssle · 2018-09-20T13:33:32Z

@TomAugspurger Green now.

TomAugspurger · 2018-09-20T13:40:13Z

Thanks!

…#22710) * TST: Add failing test for empty bool Categoricals * BUG: Failure in empty boolean CategoricalIndex Fixes GH pandas-dev#22702.

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 14, 2018

Add whatsnew entry for PR pandas-dev#22710

f1fec08

gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type labels Sep 14, 2018

gfyoung approved these changes Sep 14, 2018

View reviewed changes

gfyoung reviewed Sep 14, 2018

View reviewed changes

pganssle force-pushed the categorical_bools branch from f1fec08 to 8d85e9d Compare September 14, 2018 23:44

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 14, 2018

Add whatsnew entry for PR pandas-dev#22710

8d85e9d

jschendel reviewed Sep 15, 2018

View reviewed changes

pandas/tests/arrays/categorical/test_constructors.py Outdated Show resolved Hide resolved

jreback requested changes Sep 15, 2018

View reviewed changes

pganssle force-pushed the categorical_bools branch from 8d85e9d to 9905063 Compare September 15, 2018 14:13

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 15, 2018

Add whatsnew entry for PR pandas-dev#22710

9905063

jreback requested changes Sep 18, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Sep 18, 2018

pganssle force-pushed the categorical_bools branch from 9905063 to cfeb81e Compare September 19, 2018 13:12

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 19, 2018

Add whatsnew entry for PR pandas-dev#22710

cfeb81e

jreback approved these changes Sep 19, 2018

View reviewed changes

pganssle force-pushed the categorical_bools branch from c1ff9e0 to 58c9d54 Compare September 19, 2018 16:55

pganssle added 3 commits September 19, 2018 12:55

TST: Add failing test for empty bool Categoricals

ad4eefd

BUG: Failure in empty boolean CategoricalIndex

be101ae

Fixes GH pandas-dev#22702.

Add whatsnew entry for PR pandas-dev#22710

58c9d54

TomAugspurger merged commit 117d0b1 into pandas-dev:master Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Empty CategoricalIndex fails with boolean categories #22710

BUG: Empty CategoricalIndex fails with boolean categories #22710

pganssle commented Sep 14, 2018 •

edited

Loading

pep8speaks commented Sep 14, 2018

gfyoung left a comment

gfyoung Sep 14, 2018

pganssle Sep 14, 2018

jreback Sep 15, 2018

pganssle Sep 15, 2018

pganssle commented Sep 15, 2018

jreback left a comment

jreback Sep 18, 2018

pganssle Sep 18, 2018

jreback Sep 18, 2018

codecov bot commented Sep 19, 2018 •

edited

Loading

pganssle commented Sep 19, 2018

TomAugspurger commented Sep 19, 2018 via email

jreback commented Sep 19, 2018

TomAugspurger commented Sep 19, 2018

pganssle commented Sep 20, 2018

TomAugspurger commented Sep 20, 2018

BUG: Empty CategoricalIndex fails with boolean categories #22710

BUG: Empty CategoricalIndex fails with boolean categories #22710

Conversation

pganssle commented Sep 14, 2018 • edited Loading

pep8speaks commented Sep 14, 2018

gfyoung left a comment

Choose a reason for hiding this comment

gfyoung Sep 14, 2018

Choose a reason for hiding this comment

pganssle Sep 14, 2018

Choose a reason for hiding this comment

jreback Sep 15, 2018

Choose a reason for hiding this comment

pganssle Sep 15, 2018

Choose a reason for hiding this comment

pganssle commented Sep 15, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback Sep 18, 2018

Choose a reason for hiding this comment

pganssle Sep 18, 2018

Choose a reason for hiding this comment

jreback Sep 18, 2018

Choose a reason for hiding this comment

codecov bot commented Sep 19, 2018 • edited Loading

Codecov Report

pganssle commented Sep 19, 2018

TomAugspurger commented Sep 19, 2018 via email

jreback commented Sep 19, 2018

TomAugspurger commented Sep 19, 2018

pganssle commented Sep 20, 2018

TomAugspurger commented Sep 20, 2018

pganssle commented Sep 14, 2018 •

edited

Loading

codecov bot commented Sep 19, 2018 •

edited

Loading