Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_bool_dtype for ExtensionArrays #22667

Merged
merged 12 commits into from
Sep 20, 2018
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,7 @@ ExtensionType Changes
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the ``ExtensionDtype`` has gained the method ``construct_array_type`` (:issue:`21185`)
- An ``ExtensionArray`` with a boolean dtype now works correctly as a boolean indexer. :meth:`pandas.api.types.is_bool_dtype` now properly considers them boolean (:issue:`22326`)
- Added ``ExtensionDtype._is_numeric`` for controlling whether an extension dtype is considered numeric (:issue:`22290`).
- The ``ExtensionArray`` constructor, ``_from_sequence`` now take the keyword arg ``copy=False`` (:issue:`21185`)
- Bug in :meth:`Series.get` for ``Series`` using ``ExtensionArray`` and integer index (:issue:`21257`)
Expand Down Expand Up @@ -609,6 +610,7 @@ Categorical
^^^^^^^^^^^

- Bug in :meth:`Categorical.from_codes` where ``NaN`` values in ``codes`` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of ``.from_codes([1.1, 2.0])``.
- Bug when indexing with a boolean-valued ``Categorical``. Now a boolean-valued ``Categorical`` is treated as a boolean mask (:issue:`22665`)

Datetimelike
^^^^^^^^^^^^
Expand Down
40 changes: 35 additions & 5 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@
from pandas import compat
from pandas.compat import iteritems, PY36, OrderedDict
from pandas.core.dtypes.generic import ABCSeries, ABCIndex, ABCIndexClass
from pandas.core.dtypes.common import is_integer
from pandas.core.dtypes.common import (
is_integer, is_bool_dtype, is_extension_array_dtype, is_array_like
)
from pandas.core.dtypes.inference import _iterable_not_string
from pandas.core.dtypes.missing import isna, isnull, notnull # noqa
from pandas.core.dtypes.cast import construct_1d_object_array_from_listlike
Expand Down Expand Up @@ -100,17 +102,45 @@ def maybe_box_datetimelike(value):


def is_bool_indexer(key):
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex)):
# type: (Any) -> bool
"""
Check whether `key` is a valid boolean indexer.

Parameters
----------
key : Any
Only list-likes may be considered boolean indexers.
All other types are not considered a boolean indexer.
For array-like input, boolean ndarrays or ExtensionArrays
with a boolean kind are considered boolean indexers.

Returns
-------
bool

Raises
------
ValueError
When the array is an object-dtype ndarray or ExtensionArray
and contains missing values.
"""
na_msg = 'cannot index with vector containing NA / NaN values'
if (isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if clause necessary? e.g. an EA type cannot match key.dtype == np.object_ (which actually should be is_object_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It my be redundant with the is_array_like(key). But IIRC the tests here were fairly light, and I don't want to risk breaking the old behavior.

(is_array_like(key) and is_extension_array_dtype(key.dtype))):
if key.dtype == np.object_:
key = np.asarray(values_from_object(key))

if not lib.is_bool_array(key):
if isna(key).any():
raise ValueError('cannot index with vector containing '
'NA / NaN values')
raise ValueError(na_msg)
return False
return True
elif key.dtype == np.bool_:
elif is_bool_dtype(key.dtype):
# an ndarray with bool-dtype by definition has no missing values.
# So we only need to check for NAs in ExtensionArrays
if is_extension_array_dtype(key.dtype):
if np.any(key.isna()):
raise ValueError(na_msg)
return True
elif isinstance(key, list):
try:
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,9 @@ def kind(self):
the extension type cannot be represented as a built-in NumPy
type.

This affect whether the ExtensionArray can be used as a boolean
mask. ExtensionArrays with ``kind == 'b'`` can be boolean masks.
Copy link
Contributor

@jreback jreback Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add _is_boolean as a property to EA base class (and default False), similar to how we have numeric introspection via _is_numeric=True for Integer & Decimal types


See Also
--------
numpy.dtype.kind
Expand Down
17 changes: 17 additions & 0 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1592,6 +1592,11 @@ def is_bool_dtype(arr_or_dtype):
-------
boolean : Whether or not the array or dtype is of a boolean dtype.

Notes
-----
An ExtensionArray is considered boolean when the ``.kind`` of the
dtype is ``'b'``.

Examples
--------
>>> is_bool_dtype(str)
Expand All @@ -1608,6 +1613,8 @@ def is_bool_dtype(arr_or_dtype):
False
>>> is_bool_dtype(np.array([True, False]))
True
>>> is_bool_dtype(pd.Categorical([True, False]))
True
"""

if arr_or_dtype is None:
Expand All @@ -1618,6 +1625,13 @@ def is_bool_dtype(arr_or_dtype):
# this isn't even a dtype
return False

if isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex)):
arr_or_dtype = arr_or_dtype.dtype

if isinstance(arr_or_dtype, CategoricalDtype):
arr_or_dtype = arr_or_dtype.categories
# now we use the special definition for Index

if isinstance(arr_or_dtype, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be elif here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has to be an if so that if someone passes a Categorical we go Categorical -> Categorical.categories (index) to this block. We want to go down here since Index has special rules.


# TODO(jreback)
Expand All @@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype):
# guess this
return (arr_or_dtype.is_object and
arr_or_dtype.inferred_type == 'boolean')
elif is_extension_array_dtype(arr_or_dtype):
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use is_bool_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is_bool_dtype 😄

return dtype.kind == 'b'

return issubclass(tipo, np.bool_)

Expand Down
27 changes: 26 additions & 1 deletion pandas/tests/arrays/categorical/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
import numpy as np

import pandas.util.testing as tm
from pandas import Categorical, Index, CategoricalIndex, PeriodIndex
from pandas import Categorical, Index, CategoricalIndex, PeriodIndex, Series
import pandas.core.common as com
from pandas.tests.arrays.categorical.common import TestCategorical


Expand Down Expand Up @@ -121,3 +122,27 @@ def test_get_indexer_non_unique(self, idx_values, key_values, key_class):

tm.assert_numpy_array_equal(expected, result)
tm.assert_numpy_array_equal(exp_miss, res_miss)


@pytest.mark.parametrize("index", [True, False])
def test_mask_with_boolean(index):
s = Series(range(3))
idx = Categorical([True, False, True])
if index:
idx = CategoricalIndex(idx)

assert com.is_bool_indexer(idx)
result = s[idx]
expected = s[idx.astype('object')]
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("index", [True, False])
def test_mask_with_boolean_raises(index):
s = Series(range(3))
idx = Categorical([True, False, None])
if index:
idx = CategoricalIndex(idx)

with tm.assert_raises_regex(ValueError, 'NA / NaN'):
s[idx]
Empty file.
99 changes: 99 additions & 0 deletions pandas/tests/extension/arrow/bool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import copy
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
import itertools

import numpy as np
import pyarrow as pa
import pandas as pd
from pandas.api.extensions import (
ExtensionDtype, ExtensionArray
)


# @register_extension_dtype
class ArrowBoolDtype(ExtensionDtype):

type = np.bool_
kind = 'b'
name = 'arrow_bool'
na_value = pa.NULL

@classmethod
def construct_from_string(cls, string):
if string == cls.name:
return cls()
else:
raise TypeError("Cannot construct a '{}' from "
"'{}'".format(cls, string))

@classmethod
def construct_array_type(cls):
return ArrowBoolArray


class ArrowBoolArray(ExtensionArray):
def __init__(self, values):
if not isinstance(values, pa.ChunkedArray):
raise ValueError

assert values.type == pa.bool_()
self._data = values
self._dtype = ArrowBoolDtype()

def __repr__(self):
return "ArrowBoolArray({})".format(repr(self._data))

@classmethod
def from_scalars(cls, values):
arr = pa.chunked_array([pa.array(np.asarray(values))])
return cls(arr)

@classmethod
def from_array(cls, arr):
assert isinstance(arr, pa.Array)
return cls(pa.chunked_array([arr]))

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls.from_scalars(scalars)

def __getitem__(self, item):
return self._data.to_pandas()[item]

def __len__(self):
return len(self._data)

@property
def dtype(self):
return self._dtype

@property
def nbytes(self):
return sum(x.size for chunk in self._data.chunks
for x in chunk.buffers()
if x is not None)

def isna(self):
return pd.isna(self._data.to_pandas())

def take(self, indices, allow_fill=False, fill_value=None):
from pandas.core.algorithms import take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could put import at the top

data = self._data.to_pandas()

if allow_fill and fill_value is None:
fill_value = self.dtype.na_value

result = take(data, indices, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result, dtype=self.dtype)

def copy(self, deep=False):
if deep:
return copy.deepcopy(self._data)
else:
return copy.copy(self._data)

def _concat_same_type(cls, to_concat):
chunks = list(itertools.chain.from_iterable(x._data.chunks
for x in to_concat))
arr = pa.chunked_array(chunks)
return cls(arr)
48 changes: 48 additions & 0 deletions pandas/tests/extension/arrow/test_bool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import numpy as np
import pytest
import pandas as pd
import pandas.util.testing as tm
from pandas.tests.extension import base

pytest.importorskip('pyarrow', minversion="0.10.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa didn't realize this worked now :>


from .bool import ArrowBoolDtype, ArrowBoolArray


@pytest.fixture
def dtype():
return ArrowBoolDtype()


@pytest.fixture
def data():
return ArrowBoolArray.from_scalars(np.random.randint(0, 2, size=100,
dtype=bool))


class BaseArrowTests(object):
pass


class TestDtype(BaseArrowTests, base.BaseDtypeTests):
def test_array_type_with_arg(self, data, dtype):
pytest.skip("GH-22666")


class TestInterface(BaseArrowTests, base.BaseInterfaceTests):
def test_repr(self, data):
raise pytest.skip("TODO")


class TestConstructors(BaseArrowTests, base.BaseConstructorsTests):
def test_from_dtype(self, data):
pytest.skip("GH-22666")


def test_is_bool_dtype(data):
assert pd.api.types.is_bool_dtype(data)
assert pd.core.common.is_bool_indexer(data)
s = pd.Series(range(len(data)))
result = s[data]
expected = s[np.asarray(data)]
tm.assert_series_equal(result, expected)