Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPyBackedExtensionArray #24227

Merged
merged 16 commits into from
Dec 28, 2018
Merged

NumPyBackedExtensionArray #24227

merged 16 commits into from
Dec 28, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 11, 2018

Adds a NumPyBckedExtensionArray, a thin wrapper around ndarray implementing
the EA interface.

We use this to ensure that Series.array -> ExtensionArray, rather than
a Union[ndarray, ExtensionArray].

xref #23995


cc @jreback @jorisvandenbossche @shoyer.

Some idle thoughts:

  1. I'm not sure whether or not we should do this. In theory, I like the idea of Series.array always being an ExtensionArray. It gives us so much more freedom going forward.
  2. This is probably bad, but I think unavoidable (for now)
In [4]: ser = pd.Series([1, 2, 3])

In [5]: ser.array is ser.array
Out[5]: False
  1. Should add a method for going from a NumPyBackedExtensionArray to an ndarray.
  2. temporarily patching the Series /frame constructors to box all ndarrays in an ExtensionArray will be a good way to ensure that all of pandas supports EAs
  3. Should the Series / Index constructors unbox NumPyBackedExtensionArrays, so that you can't "accidentally" put one in? (I'm going to explore this quick).

TODO:

  • name (NumPyBackedExtensionArray is a bit long)
  • more docs
  • add public API for the ndarray? Maybe NumPyBackedExtensionArray.to_numpy()?

@pep8speaks
Copy link

pep8speaks commented Dec 11, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 17:40 Hours UTC

@TomAugspurger TomAugspurger mentioned this pull request Dec 11, 2018
2 tasks
@jreback
Copy link
Contributor

jreback commented Dec 11, 2018

this is ultimately going to be pretty non-performant, meaning now columns are separated. but it is where we are going. let's not do this for 0.24 :<

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 11, 2018

This is about all we need to ensure that a numpy EA never ends up in a Series / index under normal conditions. Doing frame now.

diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index f9b88b325..67314826e 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -56,6 +56,7 @@ class NumPyExtensionDtype(ExtensionDtype):
 
 
 class NumPyExtensionArray(ExtensionArray, ExtensionOpsMixin):
+    _typ = "npy_extension"
     __array_priority__ = 1000
 
     def __init__(self, values):
diff --git a/pandas/core/dtypes/generic.py b/pandas/core/dtypes/generic.py
index 7a3ff5d29..7bdcdabef 100644
--- a/pandas/core/dtypes/generic.py
+++ b/pandas/core/dtypes/generic.py
@@ -67,7 +67,11 @@ ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
                                            ("extension",
                                             "categorical",
                                             "periodarray",
+                                            "npy_extension",
                                             ))
+ABCNumPyExtensionArray = create_pandas_abc_type("ABCNumPyExtensionArray",
+                                                "_typ",
+                                                ("npy_extension",))
 
 
 class _ABCGeneric(type):
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 811d66c74..352c23189 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -27,7 +27,7 @@ import pandas.core.dtypes.concat as _concat
 from pandas.core.dtypes.generic import (
     ABCDataFrame, ABCDateOffset, ABCDatetimeIndex, ABCIndexClass,
     ABCMultiIndex, ABCPeriodIndex, ABCSeries, ABCTimedeltaArray,
-    ABCTimedeltaIndex)
+    ABCTimedeltaIndex, ABCNumPyExtensionArray)
 from pandas.core.dtypes.missing import array_equivalent, isna
 
 from pandas.core import ops
@@ -261,6 +261,9 @@ class Index(IndexOpsMixin, PandasObject):
                 return cls._simple_new(data, name)
 
         from .range import RangeIndex
+        if isinstance(data, ABCNumPyExtensionArray):
+            # ensure users don't accidentally put a NumPyEA in an index.
+            data = data._ndarray
 
         # range
         if isinstance(data, RangeIndex):
diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py
index c43745679..f08fbe2e7 100644
--- a/pandas/core/internals/construction.py
+++ b/pandas/core/internals/construction.py
@@ -24,11 +24,13 @@ from pandas.core.dtypes.common import (
     is_integer_dtype, is_iterator, is_list_like, is_object_dtype, pandas_dtype)
 from pandas.core.dtypes.generic import (
     ABCDataFrame, ABCDatetimeIndex, ABCIndexClass, ABCPeriodIndex, ABCSeries,
-    ABCTimedeltaIndex)
+    ABCTimedeltaIndex, ABCNumPyExtensionArray)
 from pandas.core.dtypes.missing import isna
 
 from pandas.core import algorithms, common as com
-from pandas.core.arrays import Categorical, ExtensionArray, period_array
+from pandas.core.arrays import (
+    Categorical, ExtensionArray, period_array,
+)
 from pandas.core.index import (
     Index, _get_objs_combined_axis, _union_indexes, ensure_index)
 from pandas.core.indexes import base as ibase
@@ -577,6 +579,9 @@ def sanitize_array(data, index, dtype=None, copy=False,
             # we will try to copy be-definition here
             subarr = _try_cast(data, True, dtype, copy, raise_cast_failure)
 
+    elif isinstance(data, ABCNumPyExtensionArray):
+        # don't let people put NumPy EAs into Series.
+        subarr = data._ndarray
     elif isinstance(data, ExtensionArray):
         subarr = data
 
diff --git a/pandas/tests/indexes/test_base.py b/pandas/tests/indexes/test_base.py
index 2580a47e8..7c52a8a3e 100644
--- a/pandas/tests/indexes/test_base.py
+++ b/pandas/tests/indexes/test_base.py
@@ -260,6 +260,12 @@ class TestIndex(Base):
         with pytest.raises(ValueError, match=msg):
             Index(data, dtype=dtype)
 
+    def test_constructor_no_numpy_backed_ea(self):
+        ser = pd.Series([1, 2, 3])
+        result = pd.Index(ser.array)
+        expected = pd.Index([1, 2, 3])
+        tm.assert_index_equal(result, expected)
+
     @pytest.mark.parametrize("klass,dtype,na_val", [
         (pd.Float64Index, np.float64, np.nan),
         (pd.DatetimeIndex, 'datetime64[ns]', pd.NaT)
diff --git a/pandas/tests/series/test_constructors.py b/pandas/tests/series/test_constructors.py
index f5a445e2c..6b0f0b02e 100644
--- a/pandas/tests/series/test_constructors.py
+++ b/pandas/tests/series/test_constructors.py
@@ -21,6 +21,7 @@ from pandas import (
     Categorical, DataFrame, Index, IntervalIndex, MultiIndex, NaT, Series,
     Timestamp, date_range, isna, period_range, timedelta_range)
 from pandas.api.types import CategoricalDtype
+from pandas.core.internals.blocks import IntBlock
 from pandas.core.arrays import period_array
 import pandas.util.testing as tm
 from pandas.util.testing import assert_series_equal
@@ -1238,3 +1239,9 @@ class TestSeriesConstructors():
         result = Series(dt_list)
         expected = Series(dt_list, dtype=object)
         tm.assert_series_equal(result, expected)
+
+    def test_constructor_no_numpy_backed_ea(self):
+        ser = pd.Series([1, 2, 3])
+        result = pd.Series(ser.array)
+        tm.assert_series_equal(ser, result)
+        assert isinstance(result._data.blocks[0], IntBlock)

but it is where we are going. let's not do this for 0.24 :<

Do you think it's likely to delay things? Since pandas isn't using .array in many places it's not touching that much.

@codecov
Copy link

codecov bot commented Dec 11, 2018

Codecov Report

Merging #24227 into master will decrease coverage by 49.22%.
The diff coverage is 38.17%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24227       +/-   ##
===========================================
- Coverage   92.21%   42.99%   -49.23%     
===========================================
  Files         162      163        +1     
  Lines       51763    51938      +175     
===========================================
- Hits        47733    22330    -25403     
- Misses       4030    29608    +25578
Flag Coverage Δ
#multiple ?
#single 42.99% <38.17%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 52.07% <0%> (-41.74%) ⬇️
pandas/core/reshape/reshape.py 13.62% <0%> (-85.94%) ⬇️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/arrays/categorical.py 41.8% <100%> (-53.51%) ⬇️
pandas/core/arrays/numpy_.py 36.74% <36.74%> (ø)
pandas/core/base.py 31.77% <60%> (-65.87%) ⬇️
pandas/core/dtypes/common.py 72.72% <71.42%> (-22.91%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
... and 127 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c994e98...a1fecf4. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 11, 2018

Codecov Report

Merging #24227 into master will increase coverage by <.01%.
The diff coverage is 93.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24227      +/-   ##
==========================================
+ Coverage   92.29%    92.3%   +<.01%     
==========================================
  Files         163      165       +2     
  Lines       51948    52181     +233     
==========================================
+ Hits        47945    48165     +220     
- Misses       4003     4016      +13
Flag Coverage Δ
#multiple 90.72% <93.87%> (+0.01%) ⬆️
#single 42.96% <35.91%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 97.7% <100%> (+0.02%) ⬆️
pandas/core/internals/blocks.py 93.81% <100%> (-0.01%) ⬇️
pandas/core/arrays/categorical.py 95.47% <100%> (+0.12%) ⬆️
pandas/core/reshape/reshape.py 99.56% <100%> (ø) ⬆️
pandas/core/internals/construction.py 96.66% <100%> (+0.01%) ⬆️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/dtypes/generic.py 100% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.28% <100%> (ø) ⬆️
pandas/core/internals/arrays.py 100% <100%> (ø)
pandas/core/arrays/numpy_.py 93.05% <93.05%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab55d05...35f50a5. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

a1fecf4 has the changes that ensure these don't enter pandas by normal means (you can still create the block directly, or monekypatch as we do in the tests).

Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into the py2 compat later. Looks like an integer division difference.

def all(self, skipna=True):
return nanops.nanall(self._ndarray, skipna=skipna)

def sum(self, skipna=True, min_count=0):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned up a slight issue with the interface. Without min_count, the groupby tests were failing because pandas tried to pass min_count to sum.

I haven't isolated the cause yet, but I suspect defining the method sum is meaningful (instead of just doing it in _reduce).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea why this is not an issue for the other EAs? (maybe we should add those reductions functions to the interface to be clear about expected signature -> but for another issue I suppose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just verified, it is indeed because we define PandasArray.sum. IOW, this fails

diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index a5a572d42..39a4088af 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -233,6 +233,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
     # Reductions
 
     def _reduce(self, name, skipna=True, **kwargs):
+        # if name == 'sum':
+        #     return self.sum_(skipna=skipna)
         meth = getattr(self, name, None)
         if meth is None:
             # raise from the parent
@@ -254,9 +256,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
     def all(self, skipna=True):
         return nanops.nanall(self._ndarray, skipna=skipna)
 
-    def sum(self, skipna=True, min_count=0):
-        return nanops.nansum(self._ndarray, skipna=skipna,
-                             min_count=min_count)
+    def sum(self, skipna=True):
+        return nanops.nansum(self._ndarray, skipna=skipna)
 
     def mean(self, skipna=True):
         return nanops.nanmean(self._ndarray, skipna=skipna)

but this passes

diff --git a/pandas/core/arrays/numpy_.py b/pandas/core/arrays/numpy_.py
index a5a572d42..d8a7b9a4e 100644
--- a/pandas/core/arrays/numpy_.py
+++ b/pandas/core/arrays/numpy_.py
@@ -233,6 +233,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
     # Reductions
 
     def _reduce(self, name, skipna=True, **kwargs):
+        if name == 'sum':
+            return self.sum_(skipna=skipna)
         meth = getattr(self, name, None)
         if meth is None:
             # raise from the parent
@@ -254,9 +256,8 @@ class PandasArray(ExtensionArray, ExtensionOpsMixin):
     def all(self, skipna=True):
         return nanops.nanall(self._ndarray, skipna=skipna)
 
-    def sum(self, skipna=True, min_count=0):
-        return nanops.nansum(self._ndarray, skipna=skipna,
-                             min_count=min_count)
+    def sum_(self, skipna=True):
+        return nanops.nansum(self._ndarray, skipna=skipna)
 
     def mean(self, skipna=True):
         return nanops.nanmean(self._ndarray, skipna=skipna)

None of the other EAs (in particular, IntegerArray) define .sum. Instead they just do the op from _reduce.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 11, 2018
@TomAugspurger TomAugspurger changed the title [WIP]NumPyBackedExtensionArray NumPyBackedExtensionArray Dec 11, 2018
@@ -2657,6 +2657,7 @@ objects.
api.extensions.register_index_accessor
api.extensions.ExtensionDtype
api.extensions.ExtensionArray
arrays.NumPyExtensionArray
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where #23581 is collecting our arrays.

@TomAugspurger
Copy link
Contributor Author

On the name, what do people think about just PandasArray? That gives us more flexibility into the future, since we don't tie the name to the implementation.

@TomAugspurger
Copy link
Contributor Author

A few questions

  1. What downsides are we missing here? What can this break that we aren't expecting? On the one hand this feels like a large change for this late in the release cycle, but not much is using Seires/Index.array yet so very few changes were needed to pass the test suite.
  2. Should we intentionally limit the scope of PandasArray for now? (remove ops, reductions) in the spirit of keeping things small?
  3. Do we want to do this?

@shoyer
Copy link
Member

shoyer commented Dec 12, 2018 via email

@jreback
Copy link
Contributor

jreback commented Dec 12, 2018

PandasArray sgtm. But I will say this needs much more integration into pandas proper. I don't think introducing this in 0.24.0 at this late date is a good idea. Rather finish DTA and release.

@TomAugspurger
Copy link
Contributor Author

I don't think introducing this in 0.24.0 at this late date is a good idea.

I share some of that sentiment, but I still thing this is good for 0.24. We have a new API in .array, and this really seems like something that we'll wish we had done.

Plus... I think it's done :)

Rather finish DTA and release.

I needed a break from staring out our JSON serializer :) But I'll get back to that now.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2018

I needed a break from staring out our JSON serializer :) But I'll get back to that now.

lol

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this is a very very light integration, mainly as an export on .array. This is not in-line with our model though. We store ExtensionArrays separately. So this is yet another break with the implementation. This makes understanding what is going on very confusing. Furthermore, you have changed this in a very few locations (e.g. to recognize this EA), but there are very many touch points internally where we expect a numpy array (that is just the innards of a Series.

I would like to see this go thru an entire cycle to find bugs / issues / perf problems. (meaning 0.25)

pandas/tests/api/test_api.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

The light integration is intentional.

We store ExtensionArrays separately. So this is yet another break with the implementation. This makes understanding what is going on very confusing.

The catching of PandasArray specifically and unboxing it is indeed different, and confusing. No argument there. But, I think we also agree that these shouldn't actually end up inside a Series / DataFrame (IIRC @shoyer proposed that, not sure if @jorisvandenbossche weighed in).

To me, it comes down to balancing some additional maintenance burden (the checks for ABCNumPyExtensionArray) in order to not break user-code if / when we decide that re-writing the block manager is a good idea. I think that it's likely enough to happen that we should consider this.

So, my vote is for either

  1. Not doing this at all (and maybe having to break code in 2.x if that's when we re-do the block manager)
  2. Taking this PR roughly as is (.array returns a PandasArray, but it isn't allowed inside our containers under normal usage).

The 3rd option is adding PandasArray and allowing it in pandas (no ABCNumPyExtensionArray checks) but I don't think we should do that.

@jorisvandenbossche
Copy link
Member

The catching of PandasArray specifically and unboxing it is indeed different, and confusing. No argument there. But, I think we also agree that these shouldn't actually end up inside a Series / DataFrame (IIRC @shoyer proposed that, not sure if @jorisvandenbossche weighed in).

Yes, that's what I more or less was commenting here: #23581 (comment). If doing this, I was assuming that we would not actually store then as ExtensionBlocks, but still as consolidated blocks.

It's certainly an added complexity that .array is sometimes the actual stored data, or sometimes a PandasArray with a view on (a part of) the stored numpy array, and indeed breaks the model. But the current situation where .array returns an ExtensionArray or a numpy ndarray depending on the dtype is also not that clean and consistent.

Regarding 0.24/0.25: if we want to do this, I think it is the most logical to do it now for 0.24, since it is now that we are adding the Series.array and pd.array.
And I think the argument of it needing more time in master is equally valid for DatetimeArray (for which IMO we will also need at least a couple of weeks in master).

@TomAugspurger
Copy link
Contributor Author

(for which IMO we will also need at least a couple of weeks in master)

Just to verify, we should do a release candidate with DatetimeArray ASAP, right? And then 1-2 weeks on master while the RC is out?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick comments

doc/source/basics.rst Show resolved Hide resolved
doc/source/basics.rst Outdated Show resolved Hide resolved
doc/source/dsintro.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/arrays/numpy_.py Outdated Show resolved Hide resolved
pandas/core/arrays/numpy_.py Show resolved Hide resolved
pandas/core/arrays/numpy_.py Outdated Show resolved Hide resolved
def all(self, skipna=True):
return nanops.nanall(self._ndarray, skipna=skipna)

def sum(self, skipna=True, min_count=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea why this is not an issue for the other EAs? (maybe we should add those reductions functions to the interface to be clear about expected signature -> but for another issue I suppose)

@TomAugspurger
Copy link
Contributor Author

Now that we've had some time to think on it, what are people's thoughts?

@shoyer and @jorisvandenbossche are I think +1 (in broad terms. May disagree with certain parts of the implementation).

I just think that

  1. Having .array -> ExtensionArray rather than -> Union[ndarray, ExtensionArray] is going to be more useful for people working with this.
  2. The basics of principal of "inplace modification on Series.array will be reflected back in the Series" is still true (and still probably a bad idea).
  3. We can swap out the underlying data backing the .array in the future without breaking user code (at least for users doing "normal" things).

I think those benefits outweigh the costs of this PR (the checking for a PandasArray in the constructors).

It'd be nice to get a go / no go on this, so that I can update #23581 accordingly. I've actually already done the work for pd.array to return a PandasArray, just haven't pushed it :)

@jreback
Copy link
Contributor

jreback commented Dec 13, 2018

I am not not against this idea generally at all. Rather this needs quite some integration. I worry that this will be implemented, then we just don't change anything else, leaving the internals just to rot.

@TomAugspurger
Copy link
Contributor Author

Rather this needs quite some integration. I worry that this will be implemented, then we just don't change anything else, leaving the internals just to rot.

I guess I'm not sure how PandasArray changes that, other than us implementing PandasArray later, after internals starts using .array, causing us to have to rewrite the parts of pandas' internals that were expecting .array to be an ndarray. If anything, we should be doing PandasArray before we start using .array all over the place.

@jorisvandenbossche jorisvandenbossche mentioned this pull request Dec 13, 2018
@TomAugspurger
Copy link
Contributor Author

Just because I was curious, I added support for __array_ufunc__ to PandasArray. Not sure if we should go forward with it for 0.24, since we only require NumPy 1.12.

The main thing this required was changing the signature of methods that numpy dispatches to like sum to have the signature expected by NumPy.

I'm really excited about the combination of __array_ufunc__ and ExtensionArray. We should encourage EA authors to implement it.

@TomAugspurger
Copy link
Contributor Author

All green.

pandas/core/arrays/numpy_.py Show resolved Hide resolved
pandas/tests/series/test_constructors.py Outdated Show resolved Hide resolved
pandas/tests/frame/test_constructors.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Show resolved Hide resolved
pandas/core/dtypes/common.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 27, 2018

The tests have been moved.

I don't think either of the suggestions can be done easily. Moving extract_array under core/internals caused circular import issues. And we don't want to use extract_array when checking specifically for ABCPandasArray, because that wouldn't be the correct behavior for a Series / Index. For example, this would raise

In [2]: a = pd.Series([1, 2])

In [3]: pd.Series(a, index=[1, 2, 3])
Out[3]:
1    2.0
2    NaN
3    NaN
dtype: float64

since we would call extract_array on a and it'd be just like passing the ndarray

In [4]: pd.Series(a.values, index=[1, 2, 3])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-bc5ad1bde43b> in <module>
----> 1 pd.Series(a.values, index=[1, 2, 3])

~/sandbox/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    245                             'Length of passed values is {val}, '
    246                             'index implies {ind}'
--> 247                             .format(val=len(data), ind=len(index)))
    248                 except TypeError:
    249                     pass

ValueError: Length of passed values is 2, index implies 3

which we don't wan.t

pandas/core/arrays/numpy_.py Outdated Show resolved Hide resolved
pandas/core/dtypes/common.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Show resolved Hide resolved
pandas/core/internals/construction.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

All green.

pandas/core/arrays/numpy_.py Show resolved Hide resolved
self._dtype = PandasDtype(values.dtype)

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be python or numpy scalars. are we allowing things, e.g. a Timestamp here? I think that would be weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would Timestamp be weird? It subclasses datetime.datetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because these are numpy extension arrays, they should be pretty true to that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. The type on _from_sequence is Sequence[T] where T is scalars of the dtype. Categorical[T], for example, meets that type.

Why would we choose to exclude some sequences here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was confusing comment from __init__ with this one. This one is about the type of the scalars, not the container, so ignore my last comment.

But, NumPy handles Timestamps with object dtype

In [2]: pd.arrays.PandasArray._from_sequence([pd.Timestamp('2000'), pd.Timestamp('2000')])
Out[2]:
<PandasArray>
[Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-01 00:00:00')]
Length: 2, dtype: object

I don't think that inspecting the values in the sequence is a good idea. We should just pass them through to numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really see a case where this would actually matter. If we're calling PandasArray._from_sequence then we know we want a PandasArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's an interesting NumPy issue for requiring object-dtype to be opt-in numpy/numpy#5353. That might be worth exploring (but it's blocked by strings for now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that issue is from 4 years ago. its a good idea and should have been done a while ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open source, someone needs to do it. I might be able to once we have a strong need for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you know I agree!

pandas/core/arrays/numpy_.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

All green.

from . import base


@pytest.fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to test for all numpy dtypes i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would create ~7,000 tests. Probably not what we want.

It also won't work for types like int, since ops casting to float will mean some of the base tests expected arrays will be incorrect. That would require special casing which test methods are skipped based on the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the problem is this doesn't test for int (as you have noted), nor str, not to mention things like datetime. So either we should just restrict this, or test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the very basics are tested for every dtype in https://github.com/pandas-dev/pandas/pull/24227/files#diff-941bc2d6a7667d26acf010e1072c134bR1199.

Do you have suggestions here? Exploding the number of tests is a no-go I think, so can we be more targeted?

Scanning through PandasArray it looks like the things that are dtype-dependent are

  • __setitem__

and

  • PandasDtype._is_numeric
  • PandasDtype._is_boolean

Do you see any others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding just construction tests would be enough for now

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 28, 2018

Added parametrized tests for construction, setitem, and is_numeric / is_boolean in cac2a8b.

import pandas.util.testing as tm


@pytest.fixture(params=[
Copy link
Contributor Author

@TomAugspurger TomAugspurger Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to reuse any_numpy_dtype or combine this with L37/L54 below because they have different values for the expected.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

lgtm. ping on green.

@jreback jreback merged commit 336b8d6 into pandas-dev:master Dec 28, 2018
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

thanks @TomAugspurger very nice.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 28, 2018 via email

@TomAugspurger TomAugspurger deleted the numpy-ea branch December 28, 2018 20:03
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants