-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/Perf: Support ExtensionArrays in where #24114
Conversation
We need some way to do `.where` on EA object for DatetimeArray. Adding it to the interface is, I think, the easiest way. Initially I started to write a version on ExtensionBlock, but it proved to be unwieldy. to write a version that performed well for all types. It *may* be possible to do using `_ndarray_values` but we'd need a few more things around that (missing values, converting an arbitrary array to the "same' ndarary_values, error handling, re-constructing). It seemed easier to push this down to the array. The implementation on ExtensionArray is readable, but likely slow since it'll involve a conversion to object-dtype. Closes pandas-dev#24077
Hello @TomAugspurger! Thanks for submitting the PR.
|
categorical perf: master
PR:
|
Codecov Report
@@ Coverage Diff @@
## master #24114 +/- ##
==========================================
+ Coverage 92.2% 92.2% +<.01%
==========================================
Files 162 162
Lines 51714 51782 +68
==========================================
+ Hits 47682 47747 +65
- Misses 4032 4035 +3
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #24114 +/- ##
==========================================
+ Coverage 92.2% 92.2% +<.01%
==========================================
Files 162 162
Lines 51714 51782 +68
==========================================
+ Hits 47682 47747 +65
- Misses 4032 4035 +3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24114 +/- ##
==========================================
+ Coverage 92.21% 92.21% +<.01%
==========================================
Files 162 162
Lines 51723 51761 +38
==========================================
+ Hits 47694 47731 +37
- Misses 4029 4030 +1
Continue to review full report at Codecov.
|
Hmm I don't like the return dtype depending on the values.
Perhaps we do this with a deprecation warning?
…On Wed, Dec 5, 2018 at 1:30 PM Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/whatsnew/v0.24.0.rst
<#24114 (comment)>:
> @@ -1262,6 +1264,7 @@ Categorical
- In meth:`Series.unstack`, specifying a ``fill_value`` not present in the categories now raises a ``TypeError`` rather than ignoring the ``fill_value`` (:issue:`23284`)
- Bug when resampling :meth:`Dataframe.resample()` and aggregating on categorical data, the categorical dtype was getting lost. (:issue:`23227`)
- Bug in many methods of the ``.str``-accessor, which always failed on calling the ``CategoricalIndex.str`` constructor (:issue:`23555`, :issue:`23556`)
+- Bug in :meth:`Series.where` losing the categorical dtype for categorical data (:issue:`24077`)
if its in the categories then this should work and return categorical, if
its not hen i think coercing to object is ok
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItYzZZM79tE07IEHNarQnkEaGctNks5u2B6_gaJpZM4ZDPyj>
.
|
pandas/core/arrays/base.py
Outdated
Series.where : Similar method for Series. | ||
DataFrame.where : Similar method for DataFrame. | ||
""" | ||
return type(self)._from_sequence(np.where(cond, self, other), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this turns it into an array. we have much special handling for this (e.g. see .where for DTI). i think this needs to dispatch somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see you override things. ok then.
@jorisvandenbossche do you have any objections to adding |
Just for context: how is this different from The behaviour change to keep the categorical dtype is certainly fine. |
I suppose that `np.where` would work on items that don't implement
`__setitem__`. But other than that they should be identical
for 1-d arrays, right?
…On Thu, Dec 6, 2018 at 9:44 AM Joris Van den Bossche < ***@***.***> wrote:
Just for context: how is this different from eaarray[cond] = other ?
The behaviour change to keep the categorical dtype is certainly fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkP1MIBozQDarm1XFzpLj6LMJeHJks5u2TtCgaJpZM4ZDPyj>
.
|
And since EAs are 1D, and our internal EAs support setitem, why is the new code needed? Or what in setitem is not working as it should right now? (maybe I am missing some context) |
This came out of the DatetimeArray refactor. I'll have to take another
look at exactly what the failures where. They were pretty deep in the
internals.
…On Thu, Dec 6, 2018 at 9:53 AM Joris Van den Bossche < ***@***.***> wrote:
But other than that they should be identical for 1-d arrays, right?
And since EAs are 1D, and our internal EAs support setitem, why is the new
code needed? Or what in setitem is not working as it should right now?
(maybe I am missing some context)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInMDbZ9aQt7DQXAVqcteU2aBIjFXks5u2T1ngaJpZM4ZDPyj>
.
|
pandas/core/arrays/base.py
Outdated
@@ -661,6 +662,42 @@ def take(self, indices, allow_fill=False, fill_value=None): | |||
# pandas.api.extensions.take | |||
raise AbstractMethodError(self) | |||
|
|||
def where(self, cond, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementations of where
(DataFrame.where
, Index.where
, etc.) have other
default to NA. Do we want to maintain that convention here too?
pandas/core/arrays/interval.py
Outdated
lother = other.left | ||
rother = other.right | ||
left = np.where(cond, self.left, lother) | ||
right = np.where(cond, self.right, rother) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left
/right
should have a where
method, so might be a bit safer to do something like:
left = self.left.where(cond, lother)
right = self.right.where(cond, rother)
np.where
looks like it can cause some problems depending on what left
/right
are:
In [2]: left = pd.date_range('2018', periods=3); left
Out[2]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D')
In [3]: np.where([True, False, True], left, pd.NaT)
Out[3]: array([1514764800000000000, NaT, 1514937600000000000], dtype=object)
pandas/core/arrays/interval.py
Outdated
@@ -777,6 +777,17 @@ def take(self, indices, allow_fill=False, fill_value=None, axis=None, | |||
|
|||
return self._shallow_copy(left_take, right_take) | |||
|
|||
def where(self, cond, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have IntervalIndex
use this implementation instead of the naive object array based implementation that it currently uses. Can certainly leave that for a follow-up PR though, and I'd be happy to do it.
pandas/core/arrays/interval.py
Outdated
@@ -777,6 +777,17 @@ def take(self, indices, allow_fill=False, fill_value=None, axis=None, | |||
|
|||
return self._shallow_copy(left_take, right_take) | |||
|
|||
def where(self, cond, other): | |||
if is_scalar(other) and isna(other): | |||
lother = rother = other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I think this should be lother = rother = self.left._na_value
to ensure that we're filling left
/right
with the correct NA value. If we use left/right.where
instead of np.where
this should be handled automatically iirc, so could maybe just do that instead.
pandas/core/arrays/interval.py
Outdated
def where(self, cond, other): | ||
if is_scalar(other) and isna(other): | ||
lother = rother = other | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this an elif
that checks that other
is interval-like (something like isinstance(other, Interval) or is_interval_dtype(other)
), then have an else
clause that raises a ValueError
saying other
must be interval-like?
As written I think this would raise a somewhat unclear AttributeError
in self._check_closed_matches
since it assumes other.closed
exists.
On further reflection, I realize that ndarrays don't have a I'll see if setitem on a copy is sufficient. |
@@ -501,10 +501,13 @@ def _can_reindex(self, indexer): | |||
|
|||
@Appender(_index_shared_docs['where']) | |||
def where(self, cond, other=None): | |||
# TODO: Investigate an alternative implementation with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for the type | ||
other = self.dtype.na_value | ||
|
||
if is_sparse(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, we fail in the
result = self._holder._from_sequence(
np.where(cond, self.values, other),
dtype=dtype,
since the where
may change the dtype, if NaN is introduced.
Implementing SparseArray.__setitem__
would allow us to remove this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an overriding method in Sparse then, not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a SparseBlock anymore. I can add one back if you want, but I figured it'd be easier not to since implementing SparseArray.__setitem__
will remove the need for this, and we'd just have to remove SparseBlock again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty hacky. This was why we had originally a .get_values()
methon on Sparse to do things like this. We need something to give back the underlying type of the object, which is useful for Categorical as well. Would rather create a generalized soln than hack it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't need this. I think we can just re-infer the dtype from the output of np.where
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing from master? Yes, in the sense that it'll return a SparseArray. But it still densifies when np.where
is called.
If you mean "is this changing in the future", yes it'll be removed when SparseArray.__setitem__
is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok, can you add a TODO comment
@jorisvandenbossche OK, here's some context :) The most immediate failure is mismatched block dimensions / shapes for in In [8]: df = pd.DataFrame({"A": pd.Categorical([1, 2, 3])})
In [9]: df.where(pd.DataFrame({"A": [True, False, True]})) ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-56dcebf7e672> in <module>
----> 1 df.where(pd.DataFrame({"A": [True, False, True]}))
...
~/sandbox/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
84 raise ValueError(
85 'Wrong number of items passed {val}, placement implies '
---> 86 '{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs)))
87
88 def _check_ndim(self, values, ndim):
ValueError: Wrong number of items passed 3, placement implies 1 The broadcasting is all messed up since the shapes aren't right (we're using ipdb> cond
array([[ True],
[False],
[ True]])
ipdb> values
[1, 2, 3]
Categories (3, int64): [1, 2, 3]
ipdb> other
nan A hacky, but shorter fix is to use the following (this is in Block.where) diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index 618b9eb12..2356a226d 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -1319,12 +1319,20 @@ class Block(PandasObject):
values = self.values
orig_other = other
+ if not self._can_consolidate:
+ transpose = False
+
if transpose:
values = values.T
other = getattr(other, '_values', getattr(other, 'values', other))
cond = getattr(cond, 'values', cond)
+ if not self._can_consolidate:
+ if cond.ndim == 2:
+ assert cond.shape[-1] == 1
+ cond = cond.ravel()
+
# If the default broadcasting would go in the wrong direction, then
# explicitly reshape other instead
if getattr(other, 'ndim', 0) >= 1: That fixes most of the issues I was having on the DTA branch. Still running the tests to see if any were re-broken. So, in summary
|
# for the type | ||
other = self.dtype.na_value | ||
|
||
if is_sparse(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an overriding method in Sparse then, not here
pandas/core/internals/blocks.py
Outdated
else: | ||
dtype = self.dtype | ||
|
||
# rough heuristic to see if the other array implements setitem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again you don't actually need to do this here, rather override in the appropriate class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will still need the check for extension, even if we create SparseBlock again.
@@ -122,6 +162,60 @@ def test_get_indexer_non_unique(self, idx_values, key_values, key_class): | |||
tm.assert_numpy_array_equal(expected, result) | |||
tm.assert_numpy_array_equal(exp_miss, res_miss) | |||
|
|||
def test_where_unobserved_nan(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is where all of the where tests are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There weren't any previously (we used to fall back to object).
Updated. Main outstanding point is whether or not we should create a SparseBlock just for this. I don't have a preference. |
if isinstance(other, (ABCIndexClass, ABCSeries)): | ||
other = other.array | ||
|
||
elif isinstance(other, ABCDataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments here
# for the type | ||
other = self.dtype.na_value | ||
|
||
if is_sparse(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty hacky. This was why we had originally a .get_values()
methon on Sparse to do things like this. We need something to give back the underlying type of the object, which is useful for Categorical as well. Would rather create a generalized soln than hack it like this.
pandas/core/internals/blocks.py
Outdated
|
||
# rough heuristic to see if the other array implements setitem | ||
if (self._holder.__setitem__ == ExtensionArray.__setitem__ | ||
or self._holder.__setitem__ == SparseArray.__setitem__): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general block is to check if the block implements __setitem__
. That specific line is backwards compat for SparseArray
, which implements __setitem__
to raise a TypeError instead of a NotImplementedError.
I suppose it'd be cleaner to do this in a try / except
block...
Cleaned things up a bit I think. |
All green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty reasonable. question about the sparse checks.
# for the type | ||
other = self.dtype.na_value | ||
|
||
if is_sparse(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this changing?
pandas/core/internals/blocks.py
Outdated
if lib.is_scalar(other): | ||
msg = object_msg.format(other) | ||
else: | ||
msg = compat.reprlib.repr(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't blow up with a long message for large categoricals. I messed it up though, one sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed all this stuff and just print out the text of the message.
With a bit of effort we could figure out exactly which of the new values is causing the fallback of object, but that'd take some work (we don't know the exact type /dtype of other
here, so there will be a lot of conditions). Not a high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small additional comments, lgtm otherwise. ping on green.
return np.random.choice(list(string.ascii_letters), size=100) | ||
while True: | ||
values = np.random.choice(list(string.ascii_letters), size=100) | ||
# ensure we meet the requirement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no repeated values but duplicates allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that the first two are distinct., since the where
test requires that data[0] != data[1]
.
@@ -11,7 +11,11 @@ def dtype(): | |||
|
|||
@pytest.fixture | |||
def data(): | |||
"""Length-100 array for this type.""" | |||
"""Length-100 array for this type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you copy this doc-string to the categorical one
@@ -2658,6 +2708,32 @@ def concat_same_type(self, to_concat, placement=None): | |||
values, placement=placement or slice(0, len(values), 1), | |||
ndim=self.ndim) | |||
|
|||
def where(self, other, cond, align=True, errors='raise', | |||
try_cast=False, axis=0, transpose=False): | |||
# This can all be deleted in favor of ExtensionBlock.where once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add TODO(EA) or someting here so we know to remove this
# for the type | ||
other = self.dtype.na_value | ||
|
||
if is_sparse(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok, can you add a TODO comment
All green. |
thanks! |
We need some way to do
.where
on EA object for DatetimeArray. Adding itto the interface is, I think, the easiest way.
Initially I started to write a version on ExtensionBlock, but it proved
to be unwieldy. to write a version that performed well for all types.
It may be possible to do using
_ndarray_values
but we'd need a few morethings around that (missing values, converting an arbitrary array to the
"same' ndarary_values, error handling, re-constructing). It seemed easier
to push this down to the array.
The implementation on ExtensionArray is readable, but likely slow since
it'll involve a conversion to object-dtype.
Closes #24077
Closes #24142
Closes #16983