-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_bool_dtype for ExtensionArrays #22667
is_bool_dtype for ExtensionArrays #22667
Conversation
Hello @TomAugspurger! Thanks for updating the PR.
Comment last updated on September 20, 2018 at 14:00 Hours UTC |
The bot is wrong, right? |
Looks like there's an inconsistency in Returns In [2]: cat = pd.Categorical([True, False, np.nan])
In [3]: cat
Out[3]:
[True, False, NaN]
Categories (2, object): [False, True]
In [4]: pd.core.common.is_bool_indexer(cat)
Out[4]: True Raises for an In [5]: idx = pd.Index([True, False, np.nan])
In [6]: idx
Out[6]: Index([True, False, nan], dtype='object')
In [7]: pd.core.common.is_bool_indexer(idx)
---------------------------------------------------------------------------
ValueError: cannot index with vector containing NA / NaN values Also, it doesn't look like we have any existing tests for |
Hmm, that's unfortunate... Ideally we could avoid scanning the values but maybe that 's not possible here. |
Codecov Report
@@ Coverage Diff @@
## master #22667 +/- ##
==========================================
+ Coverage 92.17% 92.18% +<.01%
==========================================
Files 169 169
Lines 50780 50794 +14
==========================================
+ Hits 46809 46823 +14
Misses 3971 3971
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document how we decide something is considered boolean? (I think it is now the dtype.kind == 'b' check?)
Yes, fixed. |
@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype): | |||
# guess this | |||
return (arr_or_dtype.is_object and | |||
arr_or_dtype.inferred_type == 'boolean') | |||
elif is_extension_array_dtype(arr_or_dtype): | |||
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use is_bool_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is is_bool_dtype
😄
pandas/core/dtypes/base.py
Outdated
@@ -169,6 +169,9 @@ def kind(self): | |||
the extension type cannot be represented as a built-in NumPy | |||
type. | |||
|
|||
This affect whether the ExtensionArray can be used as a boolean | |||
mask. ExtensionArrays with ``kind == 'b'`` can be boolean masks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add _is_boolean
as a property to EA base class (and default False), similar to how we have numeric introspection via _is_numeric=True
for Integer & Decimal types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this more directly tests the EA array (alternatively these could be on the Dtype itself), however I am -1 on using .kind == 'b'
directlry for this purpose (and would prefer a method / property for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however I am -1 on using .kind == 'b' directlry for this purpose
Why's that? It's the stated purpose of .kind
. To be clear, I kind of share your opinion. My main reason for maybe not using .kind
is if the typecodes used by NumPy aren't flexible enough for our needs. But I'm not sure.
So I'm not against adding a _is_boolean
property, but would like a bit more convincing that it isn't covered by .kind
already :)
@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype): | |||
# guess this | |||
return (arr_or_dtype.is_object and | |||
arr_or_dtype.inferred_type == 'boolean') | |||
elif is_extension_array_dtype(arr_or_dtype): | |||
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is is_bool_dtype
😄
@jreback changed |
All green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. small question.
and contains missing values. | ||
""" | ||
na_msg = 'cannot index with vector containing NA / NaN values' | ||
if (isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this if clause necessary? e.g. an EA type cannot match key.dtype == np.object_ (which actually should be is_object_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It my be redundant with the is_array_like(key)
. But IIRC the tests here were fairly light, and I don't want to risk breaking the old behavior.
if isinstance(arr_or_dtype, CategoricalDtype): | ||
arr_or_dtype = arr_or_dtype.categories | ||
# now we use the special definition for Index | ||
|
||
if isinstance(arr_or_dtype, ABCIndexClass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be elif here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has to be an if
so that if someone passes a Categorical
we go Categorical
-> Categorical.categories (index)
to this block. We want to go down here since Index has special rules.
import pandas.util.testing as tm | ||
from pandas.tests.extension import base | ||
|
||
pytest.importorskip('pyarrow', minversion="0.10.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa didn't realize this worked now :>
pandas/tests/extension/arrow/bool.py
Outdated
return pd.isna(self._data.to_pandas()) | ||
|
||
def take(self, indices, allow_fill=False, fill_value=None): | ||
from pandas.core.algorithms import take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could put import at the top
Going to fix the whatsnew conflict on merge. Speaking of which... are we going to wait around for circleci to catch up today, or are people OK with merging smallish PRs that haven't finished there? |
I would go ahead and merge |
Closes #22326
Closes #22665
Two commits: fe35002 partially implements an Arrow-backed ExtensionArray. We need this as we don't currently have any boolean-values EAs.
04029ac has the fix for the bug.