Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Empty CategoricalIndex fails with boolean categories #22710

Merged
merged 3 commits into from
Sep 20, 2018

Conversation

pganssle
Copy link
Contributor

@pganssle pganssle commented Sep 14, 2018

Fixes #22702.

This bug was introduced in 7818486859d1aba53; per my comment, the problem is here:

    if not is_dtype_equal(values.dtype, categories.dtype):
        values = ensure_object(values)
        categories = ensure_object(categories)

    (hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
    (_, _), cats = _get_data_algo(categories, _hashtables)

When categories is Index([True], dtype='object') and values is array([], dtype='object'), the ensure_object call is bypassed, but in _get_data_algo, an Index consisting entirely of boolean objects will be coerced to uint64, which violates the assumption that values and categories have the same dtype.

I felt that retrieving the underlying numpy arrays (if any exist) is the safest way to handle this without having too many wide-reaching effects across the rest of the codebase, but there might be a better way to enforce that these are not coerced into different data types.

  • closes #xxxx
  • tests added
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

Hello @pganssle! Thanks for submitting the PR.

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 14, 2018
@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type labels Sep 14, 2018
Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -26,7 +26,7 @@ Fixed Regressions
- Calling :meth:`DataFrameGroupBy.rank` and :meth:`SeriesGroupBy.rank` with empty groups
and ``pct=True`` was raising a ``ZeroDivisionError`` due to `c1068d9
<https://github.com/pandas-dev/pandas/commit/c1068d9d242c22cb2199156f6fb82eb5759178ae>`_ (:issue:`22519`)
-
- Constructing a :class:`pd.CategoricalIndex` with empty values and boolean categories was raising a ``ValueError`` after a change to dtype coercion in `78184868 <https://github.com/pandas-dev/pandas/commit/7818486859d1aba53ce359b93cfc772e688958e5>`_ (:issue:`22702`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this wasn't introduced in 0.23.x (but rather in 0.21.x), let's move this to 0.24.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved.

pganssle added a commit to pganssle/pandas that referenced this pull request Sep 14, 2018
# These may be Index, in which case their dtype would be coerced
# as part of _get_data_algo; in that case, we should retrieve the
# underlying numpy array instead.
values = getattr(values, 'values', values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use np.asarray here as we have to re-infer anyhow. But pls check perf of this. The comment can be a bit simpler, and reference the gh issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: categories = pd.Index([True, False], dtype='object')

In [5]: values = np.array([], dtype='object')

In [6]: %timeit np.asarray(values)
260 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit np.asarray(categories)
2.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit getattr(values, 'values', values)
312 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]: %timeit getattr(categories, 'values', categories)
410 ns ± 4.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: categories = pd.Index(list('abcdefghijklmnopqrstuvwxyz'), dtype='object')

In [11]: categories
Out[11]: 
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='object')

In [12]: %timeit np.asarray(categories)
2.09 µs ± 9.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit getattr(categories, 'values', categories)
409 ns ± 6.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each

Looks like the way I've done it is faster, but np.asarray doesn't seem to be scaling with the number of categories, so I don't think it's a huge deal to use np.asarray. That said, I don't love the idea of using np.asarray in this situation, because although I can't think of a time when it would present a problem, np.asarray is allowed to infer dtype. The way I've done it you're guaranteed to either get the original thing back or the underlying numpy array, which is pretty much as light a touch as I can imagine.

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
pganssle added a commit to pganssle/pandas that referenced this pull request Sep 15, 2018
@pganssle
Copy link
Contributor Author

I've pushed new changes that address all comments (though I did not switch over to asarray).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny code change, otherwise lgtm. ping on green.

@@ -2439,11 +2439,15 @@ def _get_codes_for_values(values, categories):
"""
utility routine to turn values into codes given the specified categories
"""

from pandas.core.algorithms import _get_data_algo, _hashtables
if not is_dtype_equal(values.dtype, categories.dtype):
values = ensure_object(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you flip this block around, e.g.

if is_dtype_equal(...):
    # your added code
else:
    # ensure object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that if you want. One thing to note is that I believe if not is_dtype_equal case is probably the "common case", since it doesn't occur with many dtypes like integers or strings.

I don't think there's any speed advantage to being in the if branch vs. the else branch, but depending on convention you may prefer the common case to be the first block and the rare case to be the else block. Up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not about speed or anything else, its about consistency in our code of how we do these kinds of tests.

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
pganssle added a commit to pganssle/pandas that referenced this pull request Sep 19, 2018
@codecov
Copy link

codecov bot commented Sep 19, 2018

Codecov Report

Merging #22710 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22710      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50769    50771       +2     
==========================================
+ Hits        46798    46800       +2     
  Misses       3971     3971
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 42.32% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.75% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40dfadd...58c9d54. Read the comment docs.

@pganssle
Copy link
Contributor Author

All issues should be addressed now. I'm guessing the azure pipeline failure is unrelated. I can rebase against master if it's fixed in master.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 19, 2018 via email

@jreback
Copy link
Contributor

jreback commented Sep 19, 2018

lgtm. merging after azure updates by @TomAugspurger

@TomAugspurger
Copy link
Contributor

Sorry about all the noise, your PR is my guinea pig :)

Things should be OK now. Ping on green.

@pganssle
Copy link
Contributor Author

@TomAugspurger Green now.

@TomAugspurger TomAugspurger merged commit 117d0b1 into pandas-dev:master Sep 20, 2018
@TomAugspurger
Copy link
Contributor

Thanks!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
…#22710)

* TST: Add failing test for empty bool Categoricals

* BUG: Failure in empty boolean CategoricalIndex

Fixes GH pandas-dev#22702.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError in Categorical Constructor with empty data and boolean categories
6 participants