Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_bool_dtype for ExtensionArrays #22667

Merged
merged 12 commits into from
Sep 20, 2018

Conversation

TomAugspurger
Copy link
Contributor

Closes #22326
Closes #22665

Two commits: fe35002 partially implements an Arrow-backed ExtensionArray. We need this as we don't currently have any boolean-values EAs.

04029ac has the fix for the bug.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 11, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Sep 11, 2018
@pep8speaks
Copy link

pep8speaks commented Sep 11, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Line 128:13: W504 line break after binary operator

Comment last updated on September 20, 2018 at 14:00 Hours UTC

@TomAugspurger
Copy link
Contributor Author

The bot is wrong, right?

@jschendel
Copy link
Member

Looks like there's an inconsistency in is_bool_indexer with how NaN is handled.

Returns True for a Categorical with NaN:

In [2]: cat = pd.Categorical([True, False, np.nan])

In [3]: cat
Out[3]:
[True, False, NaN]
Categories (2, object): [False, True]

In [4]: pd.core.common.is_bool_indexer(cat)
Out[4]: True

Raises for an Index with NaN:

In [5]: idx = pd.Index([True, False, np.nan])

In [6]: idx
Out[6]: Index([True, False, nan], dtype='object')

In [7]: pd.core.common.is_bool_indexer(idx)
---------------------------------------------------------------------------
ValueError: cannot index with vector containing NA / NaN values

Also, it doesn't look like we have any existing tests for is_bool_indexer? Might be nice to have some tests around it, but can probably create a separate issue/pr for that.

@TomAugspurger
Copy link
Contributor Author

Hmm, that's unfortunate... Ideally we could avoid scanning the values but maybe that 's not possible here.

@codecov
Copy link

codecov bot commented Sep 12, 2018

Codecov Report

Merging #22667 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22667      +/-   ##
==========================================
+ Coverage   92.17%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50780    50794      +14     
==========================================
+ Hits        46809    46823      +14     
  Misses       3971     3971
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 42.33% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/base.py 100% <100%> (ø) ⬆️
pandas/core/dtypes/dtypes.py 96.11% <100%> (+0.03%) ⬆️
pandas/core/dtypes/common.py 95.02% <100%> (+0.08%) ⬆️
pandas/core/common.py 97.44% <100%> (+0.05%) ⬆️
pandas/core/arrays/categorical.py 95.74% <0%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 117d0b1...29b1370. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document how we decide something is considered boolean? (I think it is now the dtype.kind == 'b' check?)

@TomAugspurger
Copy link
Contributor Author

Yes, fixed.

@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype):
# guess this
return (arr_or_dtype.is_object and
arr_or_dtype.inferred_type == 'boolean')
elif is_extension_array_dtype(arr_or_dtype):
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use is_bool_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is_bool_dtype 😄

@@ -169,6 +169,9 @@ def kind(self):
the extension type cannot be represented as a built-in NumPy
type.

This affect whether the ExtensionArray can be used as a boolean
mask. ExtensionArrays with ``kind == 'b'`` can be boolean masks.
Copy link
Contributor

@jreback jreback Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add _is_boolean as a property to EA base class (and default False), similar to how we have numeric introspection via _is_numeric=True for Integer & Decimal types

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this more directly tests the EA array (alternatively these could be on the Dtype itself), however I am -1 on using .kind == 'b' directlry for this purpose (and would prefer a method / property for this

Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however I am -1 on using .kind == 'b' directlry for this purpose

Why's that? It's the stated purpose of .kind. To be clear, I kind of share your opinion. My main reason for maybe not using .kind is if the typecodes used by NumPy aren't flexible enough for our needs. But I'm not sure.

So I'm not against adding a _is_boolean property, but would like a bit more convincing that it isn't covered by .kind already :)

@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype):
# guess this
return (arr_or_dtype.is_object and
arr_or_dtype.inferred_type == 'boolean')
elif is_extension_array_dtype(arr_or_dtype):
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is_bool_dtype 😄

@TomAugspurger
Copy link
Contributor Author

@jreback changed .kind to use a new _is_boolean property (False by default)

@TomAugspurger
Copy link
Contributor Author

All green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. small question.

and contains missing values.
"""
na_msg = 'cannot index with vector containing NA / NaN values'
if (isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if clause necessary? e.g. an EA type cannot match key.dtype == np.object_ (which actually should be is_object_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It my be redundant with the is_array_like(key). But IIRC the tests here were fairly light, and I don't want to risk breaking the old behavior.

if isinstance(arr_or_dtype, CategoricalDtype):
arr_or_dtype = arr_or_dtype.categories
# now we use the special definition for Index

if isinstance(arr_or_dtype, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be elif here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has to be an if so that if someone passes a Categorical we go Categorical -> Categorical.categories (index) to this block. We want to go down here since Index has special rules.

import pandas.util.testing as tm
from pandas.tests.extension import base

pytest.importorskip('pyarrow', minversion="0.10.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa didn't realize this worked now :>

return pd.isna(self._data.to_pandas())

def take(self, indices, allow_fill=False, fill_value=None):
from pandas.core.algorithms import take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could put import at the top

pandas/tests/extension/arrow/bool.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Going to fix the whatsnew conflict on merge.

Speaking of which... are we going to wait around for circleci to catch up today, or are people OK with merging smallish PRs that haven't finished there?

@jorisvandenbossche
Copy link
Member

I would go ahead and merge

@TomAugspurger TomAugspurger merged commit e568fb0 into pandas-dev:master Sep 20, 2018
@TomAugspurger TomAugspurger deleted the ea-is-bool-dtype branch September 20, 2018 14:02
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: CategoricalIndex can't be a boolen mask Clarify is_bool_indexer for Extension dtypes
5 participants