is_bool_dtype for ExtensionArrays #22667

TomAugspurger · 2018-09-11T19:28:36Z

Two commits: fe35002 partially implements an Arrow-backed ExtensionArray. We need this as we don't currently have any boolean-values EAs.

04029ac has the fix for the bug.

Closes pandas-dev#22665 Closes pandas-dev#22326

pep8speaks · 2018-09-11T19:28:39Z

Hello @TomAugspurger! Thanks for updating the PR.

In the file pandas/core/common.py, following are the PEP8 issues :

Line 128:13: W504 line break after binary operator

There are no PEP8 issues in the file pandas/core/dtypes/base.py !
There are no PEP8 issues in the file pandas/core/dtypes/common.py !
There are no PEP8 issues in the file pandas/core/dtypes/dtypes.py !
There are no PEP8 issues in the file pandas/tests/arrays/categorical/test_indexing.py !
There are no PEP8 issues in the file pandas/tests/dtypes/test_dtypes.py !
There are no PEP8 issues in the file pandas/tests/extension/arrow/bool.py !
There are no PEP8 issues in the file pandas/tests/extension/arrow/test_bool.py !

Comment last updated on September 20, 2018 at 14:00 Hours UTC

TomAugspurger · 2018-09-11T19:33:52Z

The bot is wrong, right?

jschendel · 2018-09-11T21:58:32Z

Looks like there's an inconsistency in is_bool_indexer with how NaN is handled.

Returns True for a Categorical with NaN:

In [2]: cat = pd.Categorical([True, False, np.nan])

In [3]: cat
Out[3]:
[True, False, NaN]
Categories (2, object): [False, True]

In [4]: pd.core.common.is_bool_indexer(cat)
Out[4]: True

Raises for an Index with NaN:

In [5]: idx = pd.Index([True, False, np.nan])

In [6]: idx
Out[6]: Index([True, False, nan], dtype='object')

In [7]: pd.core.common.is_bool_indexer(idx)
---------------------------------------------------------------------------
ValueError: cannot index with vector containing NA / NaN values

Also, it doesn't look like we have any existing tests for is_bool_indexer? Might be nice to have some tests around it, but can probably create a separate issue/pr for that.

TomAugspurger · 2018-09-12T16:25:06Z

Hmm, that's unfortunate... Ideally we could avoid scanning the values but maybe that 's not possible here.

codecov · 2018-09-12T18:35:07Z

Codecov Report

Merging #22667 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22667      +/-   ##
==========================================
+ Coverage   92.17%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50780    50794      +14     
==========================================
+ Hits        46809    46823      +14     
  Misses       3971     3971

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`42.33% <50%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/base.py	`100% <100%> (ø)`	⬆️
pandas/core/dtypes/dtypes.py	`96.11% <100%> (+0.03%)`	⬆️
pandas/core/dtypes/common.py	`95.02% <100%> (+0.08%)`	⬆️
pandas/core/common.py	`97.44% <100%> (+0.05%)`	⬆️
pandas/core/arrays/categorical.py	`95.74% <0%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 117d0b1...29b1370. Read the comment docs.

jorisvandenbossche

Should we document how we decide something is considered boolean? (I think it is now the dtype.kind == 'b' check?)

TomAugspurger · 2018-09-13T11:39:08Z

Yes, fixed.

jreback · 2018-09-13T11:59:53Z

pandas/core/dtypes/common.py

@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype):
        # guess this
        return (arr_or_dtype.is_object and
                arr_or_dtype.inferred_type == 'boolean')
+    elif is_extension_array_dtype(arr_or_dtype):
+        dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)


should use is_bool_dtype

This is is_bool_dtype 😄

jreback · 2018-09-13T16:18:38Z

pandas/core/dtypes/base.py

@@ -169,6 +169,9 @@ def kind(self):
        the extension type cannot be represented as a built-in NumPy
        type.

+        This affect whether the ExtensionArray can be used as a boolean
+        mask. ExtensionArrays with ``kind == 'b'`` can be boolean masks.


maybe we should add _is_boolean as a property to EA base class (and default False), similar to how we have numeric introspection via _is_numeric=True for Integer & Decimal types

jreback

this more directly tests the EA array (alternatively these could be on the Dtype itself), however I am -1 on using .kind == 'b' directlry for this purpose (and would prefer a method / property for this

TomAugspurger

however I am -1 on using .kind == 'b' directlry for this purpose

Why's that? It's the stated purpose of .kind. To be clear, I kind of share your opinion. My main reason for maybe not using .kind is if the typecodes used by NumPy aren't flexible enough for our needs. But I'm not sure.

So I'm not against adding a _is_boolean property, but would like a bit more convincing that it isn't covered by .kind already :)

TomAugspurger · 2018-09-13T16:28:02Z

pandas/core/dtypes/common.py

@@ -1626,6 +1640,9 @@ def is_bool_dtype(arr_or_dtype):
        # guess this
        return (arr_or_dtype.is_object and
                arr_or_dtype.inferred_type == 'boolean')
+    elif is_extension_array_dtype(arr_or_dtype):
+        dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)


This is is_bool_dtype 😄

TomAugspurger · 2018-09-17T16:56:20Z

@jreback changed .kind to use a new _is_boolean property (False by default)

TomAugspurger · 2018-09-17T21:43:39Z

All green.

jreback

looks good. small question.

jreback · 2018-09-18T11:30:36Z

pandas/core/common.py

+        and contains missing values.
+    """
+    na_msg = 'cannot index with vector containing NA / NaN values'
+    if (isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or


is this if clause necessary? e.g. an EA type cannot match key.dtype == np.object_ (which actually should be is_object_dtype

It my be redundant with the is_array_like(key). But IIRC the tests here were fairly light, and I don't want to risk breaking the old behavior.

jreback · 2018-09-18T11:32:08Z

pandas/core/dtypes/common.py

+    if isinstance(arr_or_dtype, CategoricalDtype):
+        arr_or_dtype = arr_or_dtype.categories
+        # now we use the special definition for Index
+
    if isinstance(arr_or_dtype, ABCIndexClass):


could be elif here

Has to be an if so that if someone passes a Categorical we go Categorical -> Categorical.categories (index) to this block. We want to go down here since Index has special rules.

jreback · 2018-09-18T11:33:48Z

pandas/tests/extension/arrow/test_bool.py

+import pandas.util.testing as tm
+from pandas.tests.extension import base
+
+pytest.importorskip('pyarrow', minversion="0.10.0")


woa didn't realize this worked now :>

jreback · 2018-09-18T11:34:11Z

pandas/tests/extension/arrow/bool.py

+        return pd.isna(self._data.to_pandas())
+
+    def take(self, indices, allow_fill=False, fill_value=None):
+        from pandas.core.algorithms import take


could put import at the top

pandas/tests/extension/arrow/bool.py

TomAugspurger · 2018-09-20T13:54:00Z

Going to fix the whatsnew conflict on merge.

Speaking of which... are we going to wait around for circleci to catch up today, or are people OK with merging smallish PRs that haven't finished there?

jorisvandenbossche · 2018-09-20T13:57:21Z

I would go ahead and merge

Closes pandas-dev#22665 Closes pandas-dev#22326

TomAugspurger added 2 commits September 11, 2018 14:26

TST: Arrow-backed BoolArray

1f87ddd

BUG: EA-backed boolean indexers

47da6d3

Closes pandas-dev#22665 Closes pandas-dev#22326

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 11, 2018

TomAugspurger added this to the 0.24.0 milestone Sep 11, 2018

lint and skip

9d4eab6

Handle NAs

35f0575

jorisvandenbossche reviewed Sep 13, 2018

View reviewed changes

TomAugspurger added 2 commits September 13, 2018 06:32

Merge remote-tracking branch 'upstream/master' into ea-is-bool-dtype

412bd22

Document

20b2add

Merge remote-tracking branch 'upstream/master' into ea-is-bool-dtype

27b8b68

jreback requested changes Sep 13, 2018

View reviewed changes

jreback reviewed Sep 13, 2018

View reviewed changes

TomAugspurger commented Sep 13, 2018

View reviewed changes

TomAugspurger added 2 commits September 17, 2018 11:47

Merge remote-tracking branch 'upstream/master' into ea-is-bool-dtype

c94d235

kind -> attribute

b9c45bd

jreback requested changes Sep 18, 2018

View reviewed changes

TomAugspurger added 2 commits September 20, 2018 06:18

update

4d09509

Merge remote-tracking branch 'upstream/master' into ea-is-bool-dtype

d8bd054

jorisvandenbossche approved these changes Sep 20, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into ea-is-bool-dtype

29b1370

TomAugspurger merged commit e568fb0 into pandas-dev:master Sep 20, 2018

TomAugspurger deleted the ea-is-bool-dtype branch September 20, 2018 14:02

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

is_bool_dtype for ExtensionArrays (pandas-dev#22667)

4f44f84

Closes pandas-dev#22665 Closes pandas-dev#22326

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_bool_dtype for ExtensionArrays #22667

is_bool_dtype for ExtensionArrays #22667

TomAugspurger commented Sep 11, 2018

pep8speaks commented Sep 11, 2018 •

edited

Loading

TomAugspurger commented Sep 11, 2018

jschendel commented Sep 11, 2018

TomAugspurger commented Sep 12, 2018

codecov bot commented Sep 12, 2018 •

edited

Loading

jorisvandenbossche left a comment

TomAugspurger commented Sep 13, 2018

jreback Sep 13, 2018

TomAugspurger Sep 13, 2018

jreback Sep 13, 2018 •

edited

Loading

jreback left a comment

TomAugspurger left a comment

TomAugspurger Sep 13, 2018

TomAugspurger commented Sep 17, 2018

TomAugspurger commented Sep 17, 2018

jreback left a comment

jreback Sep 18, 2018

TomAugspurger Sep 20, 2018

jreback Sep 18, 2018

TomAugspurger Sep 20, 2018

jreback Sep 18, 2018

jreback Sep 18, 2018

TomAugspurger commented Sep 20, 2018

jorisvandenbossche commented Sep 20, 2018

is_bool_dtype for ExtensionArrays #22667

is_bool_dtype for ExtensionArrays #22667

Conversation

TomAugspurger commented Sep 11, 2018

pep8speaks commented Sep 11, 2018 • edited Loading

Comment last updated on September 20, 2018 at 14:00 Hours UTC

TomAugspurger commented Sep 11, 2018

jschendel commented Sep 11, 2018

TomAugspurger commented Sep 12, 2018

codecov bot commented Sep 12, 2018 • edited Loading

Codecov Report

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Sep 13, 2018 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 17, 2018

TomAugspurger commented Sep 17, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 20, 2018

jorisvandenbossche commented Sep 20, 2018

pep8speaks commented Sep 11, 2018 •

edited

Loading

codecov bot commented Sep 12, 2018 •

edited

Loading

jreback Sep 13, 2018 •

edited

Loading