-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumPyBackedExtensionArray #24227
NumPyBackedExtensionArray #24227
Conversation
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 28, 2018 at 17:40 Hours UTC |
this is ultimately going to be pretty non-performant, meaning now columns are separated. but it is where we are going. let's not do this for 0.24 :< |
This is about all we need to ensure that a numpy EA never ends up in a Series / index under normal conditions. Doing frame now. diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index f9b88b325..67314826e 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -56,6 +56,7 @@ class NumPyExtensionDtype(ExtensionDtype):
class NumPyExtensionArray(ExtensionArray, ExtensionOpsMixin):
+ _typ = "npy_extension"
__array_priority__ = 1000
def __init__(self, values):
diff --git a/pandas/core/dtypes/generic.py b/pandas/core/dtypes/generic.py
index 7a3ff5d29..7bdcdabef 100644
--- a/pandas/core/dtypes/generic.py
+++ b/pandas/core/dtypes/generic.py
@@ -67,7 +67,11 @@ ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
("extension",
"categorical",
"periodarray",
+ "npy_extension",
))
+ABCNumPyExtensionArray = create_pandas_abc_type("ABCNumPyExtensionArray",
+ "_typ",
+ ("npy_extension",))
class _ABCGeneric(type):
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 811d66c74..352c23189 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -27,7 +27,7 @@ import pandas.core.dtypes.concat as _concat
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCDateOffset, ABCDatetimeIndex, ABCIndexClass,
ABCMultiIndex, ABCPeriodIndex, ABCSeries, ABCTimedeltaArray,
- ABCTimedeltaIndex)
+ ABCTimedeltaIndex, ABCNumPyExtensionArray)
from pandas.core.dtypes.missing import array_equivalent, isna
from pandas.core import ops
@@ -261,6 +261,9 @@ class Index(IndexOpsMixin, PandasObject):
return cls._simple_new(data, name)
from .range import RangeIndex
+ if isinstance(data, ABCNumPyExtensionArray):
+ # ensure users don't accidentally put a NumPyEA in an index.
+ data = data._ndarray
# range
if isinstance(data, RangeIndex):
diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py
index c43745679..f08fbe2e7 100644
--- a/pandas/core/internals/construction.py
+++ b/pandas/core/internals/construction.py
@@ -24,11 +24,13 @@ from pandas.core.dtypes.common import (
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, pandas_dtype)
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCDatetimeIndex, ABCIndexClass, ABCPeriodIndex, ABCSeries,
- ABCTimedeltaIndex)
+ ABCTimedeltaIndex, ABCNumPyExtensionArray)
from pandas.core.dtypes.missing import isna
from pandas.core import algorithms, common as com
-from pandas.core.arrays import Categorical, ExtensionArray, period_array
+from pandas.core.arrays import (
+ Categorical, ExtensionArray, period_array,
+)
from pandas.core.index import (
Index, _get_objs_combined_axis, _union_indexes, ensure_index)
from pandas.core.indexes import base as ibase
@@ -577,6 +579,9 @@ def sanitize_array(data, index, dtype=None, copy=False,
# we will try to copy be-definition here
subarr = _try_cast(data, True, dtype, copy, raise_cast_failure)
+ elif isinstance(data, ABCNumPyExtensionArray):
+ # don't let people put NumPy EAs into Series.
+ subarr = data._ndarray
elif isinstance(data, ExtensionArray):
subarr = data
diff --git a/pandas/tests/indexes/test_base.py b/pandas/tests/indexes/test_base.py
index 2580a47e8..7c52a8a3e 100644
--- a/pandas/tests/indexes/test_base.py
+++ b/pandas/tests/indexes/test_base.py
@@ -260,6 +260,12 @@ class TestIndex(Base):
with pytest.raises(ValueError, match=msg):
Index(data, dtype=dtype)
+ def test_constructor_no_numpy_backed_ea(self):
+ ser = pd.Series([1, 2, 3])
+ result = pd.Index(ser.array)
+ expected = pd.Index([1, 2, 3])
+ tm.assert_index_equal(result, expected)
+
@pytest.mark.parametrize("klass,dtype,na_val", [
(pd.Float64Index, np.float64, np.nan),
(pd.DatetimeIndex, 'datetime64[ns]', pd.NaT)
diff --git a/pandas/tests/series/test_constructors.py b/pandas/tests/series/test_constructors.py
index f5a445e2c..6b0f0b02e 100644
--- a/pandas/tests/series/test_constructors.py
+++ b/pandas/tests/series/test_constructors.py
@@ -21,6 +21,7 @@ from pandas import (
Categorical, DataFrame, Index, IntervalIndex, MultiIndex, NaT, Series,
Timestamp, date_range, isna, period_range, timedelta_range)
from pandas.api.types import CategoricalDtype
+from pandas.core.internals.blocks import IntBlock
from pandas.core.arrays import period_array
import pandas.util.testing as tm
from pandas.util.testing import assert_series_equal
@@ -1238,3 +1239,9 @@ class TestSeriesConstructors():
result = Series(dt_list)
expected = Series(dt_list, dtype=object)
tm.assert_series_equal(result, expected)
+
+ def test_constructor_no_numpy_backed_ea(self):
+ ser = pd.Series([1, 2, 3])
+ result = pd.Series(ser.array)
+ tm.assert_series_equal(ser, result)
+ assert isinstance(result._data.blocks[0], IntBlock)
Do you think it's likely to delay things? Since pandas isn't using |
Codecov Report
@@ Coverage Diff @@
## master #24227 +/- ##
===========================================
- Coverage 92.21% 42.99% -49.23%
===========================================
Files 162 163 +1
Lines 51763 51938 +175
===========================================
- Hits 47733 22330 -25403
- Misses 4030 29608 +25578
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24227 +/- ##
==========================================
+ Coverage 92.29% 92.3% +<.01%
==========================================
Files 163 165 +2
Lines 51948 52181 +233
==========================================
+ Hits 47945 48165 +220
- Misses 4003 4016 +13
Continue to review full report at Codecov.
|
a1fecf4 has the changes that ensure these don't enter pandas by normal means (you can still create the block directly, or monekypatch as we do in the tests). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into the py2 compat later. Looks like an integer division difference.
pandas/core/arrays/numpy_.py
Outdated
def all(self, skipna=True): | ||
return nanops.nanall(self._ndarray, skipna=skipna) | ||
|
||
def sum(self, skipna=True, min_count=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned up a slight issue with the interface. Without min_count
, the groupby tests were failing because pandas tried to pass min_count
to sum.
I haven't isolated the cause yet, but I suspect defining the method sum
is meaningful (instead of just doing it in _reduce
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea why this is not an issue for the other EAs? (maybe we should add those reductions functions to the interface to be clear about expected signature -> but for another issue I suppose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just verified, it is indeed because we define PandasArray.sum
. IOW, this fails
diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index a5a572d42..39a4088af 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -233,6 +233,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
# Reductions
def _reduce(self, name, skipna=True, **kwargs):
+ # if name == 'sum':
+ # return self.sum_(skipna=skipna)
meth = getattr(self, name, None)
if meth is None:
# raise from the parent
@@ -254,9 +256,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
def all(self, skipna=True):
return nanops.nanall(self._ndarray, skipna=skipna)
- def sum(self, skipna=True, min_count=0):
- return nanops.nansum(self._ndarray, skipna=skipna,
- min_count=min_count)
+ def sum(self, skipna=True):
+ return nanops.nansum(self._ndarray, skipna=skipna)
def mean(self, skipna=True):
return nanops.nanmean(self._ndarray, skipna=skipna)
but this passes
diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index a5a572d42..d8a7b9a4e 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -233,6 +233,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
# Reductions
def _reduce(self, name, skipna=True, **kwargs):
+ if name == 'sum':
+ return self.sum_(skipna=skipna)
meth = getattr(self, name, None)
if meth is None:
# raise from the parent
@@ -254,9 +256,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
def all(self, skipna=True):
return nanops.nanall(self._ndarray, skipna=skipna)
- def sum(self, skipna=True, min_count=0):
- return nanops.nansum(self._ndarray, skipna=skipna,
- min_count=min_count)
+ def sum_(self, skipna=True):
+ return nanops.nansum(self._ndarray, skipna=skipna)
def mean(self, skipna=True):
return nanops.nanmean(self._ndarray, skipna=skipna)
None of the other EAs (in particular, IntegerArray) define .sum
. Instead they just do the op from _reduce
.
doc/source/api.rst
Outdated
@@ -2657,6 +2657,7 @@ objects. | |||
api.extensions.register_index_accessor | |||
api.extensions.ExtensionDtype | |||
api.extensions.ExtensionArray | |||
arrays.NumPyExtensionArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where #23581 is collecting our arrays.
On the name, what do people think about just |
A few questions
|
I like the separate PandasArray. That makes it easier to understand that it
gets unboxed as a special case, and we can save NumpyArray for the boxed
version. I don't think there are good reasons to limit its scope -- there
are probably some use cases for wrapping the ExtensionArray interface
directly.
…On Tue, Dec 11, 2018 at 8:00 PM Tom Augspurger ***@***.***> wrote:
A few questions
1. What downsides are we missing here? What can this break that we
aren't expecting? On the one hand this feels like a large change for this
late in the release cycle, but not much is using Seires/Index.array
yet so very few changes were needed to pass the test suite.
2. Should we intentionally limit the scope of PandasArray for now?
(remove ops, reductions) in the spirit of keeping things small?
3. Do we want to do this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1u9qm-kzqBhP0dpS2igb0i7_jwd5ks5u4H9pgaJpZM4ZNoQZ>
.
|
|
I share some of that sentiment, but I still thing this is good for 0.24. We have a new API in Plus... I think it's done :)
I needed a break from staring out our JSON serializer :) But I'll get back to that now. |
lol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this is a very very light integration, mainly as an export on .array
. This is not in-line with our model though. We store ExtensionArrays separately. So this is yet another break with the implementation. This makes understanding what is going on very confusing. Furthermore, you have changed this in a very few locations (e.g. to recognize this EA), but there are very many touch points internally where we expect a numpy array (that is just the innards of a Series.
I would like to see this go thru an entire cycle to find bugs / issues / perf problems. (meaning 0.25)
The light integration is intentional.
The catching of To me, it comes down to balancing some additional maintenance burden (the checks for So, my vote is for either
The 3rd option is adding |
Yes, that's what I more or less was commenting here: #23581 (comment). If doing this, I was assuming that we would not actually store then as ExtensionBlocks, but still as consolidated blocks. It's certainly an added complexity that Regarding 0.24/0.25: if we want to do this, I think it is the most logical to do it now for 0.24, since it is now that we are adding the |
Just to verify, we should do a release candidate with DatetimeArray ASAP, right? And then 1-2 weeks on master while the RC is out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few quick comments
pandas/core/arrays/numpy_.py
Outdated
def all(self, skipna=True): | ||
return nanops.nanall(self._ndarray, skipna=skipna) | ||
|
||
def sum(self, skipna=True, min_count=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea why this is not an issue for the other EAs? (maybe we should add those reductions functions to the interface to be clear about expected signature -> but for another issue I suppose)
Now that we've had some time to think on it, what are people's thoughts? @shoyer and @jorisvandenbossche are I think +1 (in broad terms. May disagree with certain parts of the implementation). I just think that
I think those benefits outweigh the costs of this PR (the checking for a It'd be nice to get a go / no go on this, so that I can update #23581 accordingly. I've actually already done the work for |
I am not not against this idea generally at all. Rather this needs quite some integration. I worry that this will be implemented, then we just don't change anything else, leaving the internals just to rot. |
I guess I'm not sure how PandasArray changes that, other than us implementing PandasArray later, after internals starts using |
Just because I was curious, I added support for The main thing this required was changing the signature of methods that numpy dispatches to like I'm really excited about the combination of |
All green. |
The tests have been moved. I don't think either of the suggestions can be done easily. Moving In [2]: a = pd.Series([1, 2])
In [3]: pd.Series(a, index=[1, 2, 3])
Out[3]:
1 2.0
2 NaN
3 NaN
dtype: float64 since we would call In [4]: pd.Series(a.values, index=[1, 2, 3])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-bc5ad1bde43b> in <module>
----> 1 pd.Series(a.values, index=[1, 2, 3])
~/sandbox/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
245 'Length of passed values is {val}, '
246 'index implies {ind}'
--> 247 .format(val=len(data), ind=len(index)))
248 except TypeError:
249 pass
ValueError: Length of passed values is 2, index implies 3 which we don't wan.t |
All green. |
self._dtype = PandasDtype(values.dtype) | ||
|
||
@classmethod | ||
def _from_sequence(cls, scalars, dtype=None, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be python or numpy scalars. are we allowing things, e.g. a Timestamp here? I think that would be weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would Timestamp be weird? It subclasses datetime.datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because these are numpy extension arrays, they should be pretty true to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. The type on _from_sequence
is Sequence[T]
where T
is scalars of the dtype. Categorical[T]
, for example, meets that type.
Why would we choose to exclude some sequences here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was confusing comment from __init__
with this one. This one is about the type of the scalars, not the container, so ignore my last comment.
But, NumPy handles Timestamps with object dtype
In [2]: pd.arrays.PandasArray._from_sequence([pd.Timestamp('2000'), pd.Timestamp('2000')])
Out[2]:
<PandasArray>
[Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-01 00:00:00')]
Length: 2, dtype: object
I don't think that inspecting the values in the sequence is a good idea. We should just pass them through to numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't really see a case where this would actually matter. If we're calling PandasArray._from_sequence
then we know we want a PandasArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there's an interesting NumPy issue for requiring object-dtype to be opt-in numpy/numpy#5353. That might be worth exploring (but it's blocked by strings for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that issue is from 4 years ago. its a good idea and should have been done a while ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open source, someone needs to do it. I might be able to once we have a strong need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you know I agree!
All green. |
from . import base | ||
|
||
|
||
@pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to test for all numpy dtypes i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would create ~7,000 tests. Probably not what we want.
It also won't work for types like int, since ops casting to float will mean some of the base tests expected arrays will be incorrect. That would require special casing which test methods are skipped based on the dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the problem is this doesn't test for int (as you have noted), nor str, not to mention things like datetime. So either we should just restrict this, or test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the very basics are tested for every dtype in https://github.com/pandas-dev/pandas/pull/24227/files#diff-941bc2d6a7667d26acf010e1072c134bR1199.
Do you have suggestions here? Exploding the number of tests is a no-go I think, so can we be more targeted?
Scanning through PandasArray it looks like the things that are dtype-dependent are
__setitem__
and
- PandasDtype._is_numeric
- PandasDtype._is_boolean
Do you see any others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding just construction tests would be enough for now
Added parametrized tests for construction, setitem, and is_numeric / is_boolean in cac2a8b. |
import pandas.util.testing as tm | ||
|
||
|
||
@pytest.fixture(params=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to reuse any_numpy_dtype
or combine this with L37/L54 below because they have different values for the expected
.
lgtm. ping on green. |
thanks @TomAugspurger very nice. |
Thanks.
…On Fri, Dec 28, 2018 at 12:21 PM Jeff Reback ***@***.***> wrote:
thanks @TomAugspurger <https://github.com/TomAugspurger> very nice.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItydtWCvxmS7I-hXj4NHbVuMCl4zks5u9mEXgaJpZM4ZNoQZ>
.
|
Adds a NumPyBckedExtensionArray, a thin wrapper around ndarray implementing
the EA interface.
We use this to ensure that
Series.array -> ExtensionArray
, rather thana
Union[ndarray, ExtensionArray]
.xref #23995
cc @jreback @jorisvandenbossche @shoyer.
Some idle thoughts:
Series.array
always being an ExtensionArray. It gives us so much more freedom going forward.TODO:
NumPyBackedExtensionArray.to_numpy()
?