-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Public data for Series and Index: .array and .to_numpy() #23623
Conversation
This adds two new methods for working with EA-backed Series / Index. - `.array -> Union[ExtensionArray, ndarray]`: the actual backing array - `.to_numpy() -> ndarray`: A NumPy representation of the data `.array` is always a reference to the actual data stored in the container. Updating it inplace (not recommended) will be reflected in the Series (or Index for that matter, so really not recommended). `to_numpy()` may (or may not) require data copying / coercion. Closes pandas-dev#19954
Hello @TomAugspurger! Thanks for submitting the PR.
|
Not having read the diff, first thought is that I’d like to wait for DatetimeArray to see what we can simplify vis-a-vis values/_values |
Can you expand on that? `.values` isn't changing (aside from recommending
against it in the docs).
`.array` is implemented in terms of `._values` right now, but we can adjust
remove that level of indirection if we want,
and that will have to wait till DatetimeArray is done.
…On Sun, Nov 11, 2018 at 1:22 PM jbrockmendel ***@***.***> wrote:
Not having read the diff, first thought is that I’d like to wait for
DatetimeArray to see what we can simplify vis-a-vis values/_values
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIs7PBHbB1FrTd8xFy5ZmGStpU4KCks5uuHj8gaJpZM4YYert>
.
|
Not at all a well thought out opinion, just gut reaction. |
Split from pandas-dev#23623, where it was causing issues with infer_dtype.
BUG: Ensure that Index._data is an ndarray Split from pandas-dev#23623, where it was causing issues with infer_dtype.
commit e4b21f6 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 16:09:58 2018 -0600 TST: Change rops tests commit e903550 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 09:31:38 2018 -0600 Add note [ci skip] ***NO CI*** commit fa8934a Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 06:16:53 2018 -0600 update errors commit 505970e Merge: a30bc02 3592a46 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Nov 12 05:55:31 2018 -0600 Merge remote-tracking branch 'upstream/master' into index-ndarray-data commit a30bc02 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Sun Nov 11 15:14:46 2018 -0600 remove assert commit 1f23ebc Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Sun Nov 11 15:01:13 2018 -0600 BUG: Ensure that Index._data is an ndarray BUG: Ensure that Index._data is an ndarray Split from pandas-dev#23623, where it was causing issues with infer_dtype.
Codecov Report
@@ Coverage Diff @@
## master #23623 +/- ##
==========================================
+ Coverage 92.31% 92.31% +<.01%
==========================================
Files 161 161
Lines 51513 51526 +13
==========================================
+ Hits 47554 47567 +13
Misses 3959 3959
Continue to review full report at Codecov.
|
@@ -1269,3 +1269,54 @@ def test_ndarray_values(array, expected): | |||
r_values = pd.Index(array)._ndarray_values | |||
tm.assert_numpy_array_equal(l_values, r_values) | |||
tm.assert_numpy_array_equal(l_values, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("array, attr", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put in pandas/tests/arrays/test_arrays.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't exist yet :), though my pd.array
PR is creating it.
This seemed a bit more appropriate since it's next to our tests for ndarray_values
.
Do time-zone naive datetimes use extension arrays? I'd love to have a clear rule for
|
That's being worked on right now. It's unclear to me how things will end up, but it's possible there will be an ExtensionArray between Series and the actual ndarray for tz-naive data. This will certainly be the case for DatetimeIndex. However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about
I like this rule. |
In that case I suppose it mostly comes down to:
|
"Ideally", if a user wants a numpy array, they use
I suppose that returning DatetimeArray instead of ndarray[datetime64[ns]] will be more future proof. But, that also relates to what I commented above (#23623 (comment)) related the back-compat guarantees we make. If we want to keep following the rule in the future, I have the feeling that we will need to change the return value at some point.
Some other possibilities that would go further than the above rule:
The rule might also whether it is implemented under the hood as an ExtensionArray or not. But this is course not necessarily clear to the user (it might be hidden somewhat), and @shoyer I assume that your proposal for a rule would be to have a clear expectation for the user? (so all the above rules that I mention would be more complicated). |
Would it make sense to make to consider having I imagine we could pretty quickly whip up an ExtensionArray for each NumPy dtype that simply defers to the underlying NumPy array for every operation. Internally, we could recognize these I guess we could also just clearly document: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, had the time to go through the full diff, and added some comments (and thanks a lot for the PR Tom!)
Additional comments:
- to what extent do we also want to mention
np.(as)array(..)
as alternative to.to_numpy()
in the docs? - I think we need to keep in some places the explanation about
values
, since it is not yet going away and users will encounter it in code (or at least mention something about it existing for historical reasons andDataFrame.values
being equivalent toto_numpy
, but not recommended anymore) - I would update the
Series.values
docstring as well, to add a note about its recommendation status, and to refer toarray
/to_numpy
(similar as what you did forDataFrame/Index.values
)
casting every value to a Python object. | ||
|
||
For ``df``, our :class:`DataFrame` of all floating-point values, | ||
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this, should we have a copy
keyword to be able to force a copy? (can be added later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. Don't care whether we do it here or later.
I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64
Regarding copy, would it actually make sense to have copy=True
the default? Then you have at least a consistent default (it is never a view on the data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think copy=True
is a good default since it's the only one that can be ensured for all cases.
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods` | ||
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse` | ||
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex` | ||
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this 'integer_na' point to? (I don't seem to find it in the docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#23617. I'm aiming for eventual consistency on the docs :)
pytest.skip("No index type for {}".format(array.dtype)) | ||
|
||
result = thing.to_numpy() | ||
tm.assert_numpy_array_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test for the case where it is not a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here? (in case you missed it, the first case is a regular ndarray, so that won't be a copy. Though perhaps you're saying I should assert this for that case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant. If we return a view, and people can rely on it, we should test it.
@shoyer I think the main problem is that would basically already get rid off the block-based internals? And I think that was more a change we were contemplating for 2.0 instead of 1.0. (apart from the additional work it would mean on the short term) |
This is all I was thinking of. |
Hmm, having |
What do people think about doing the remaining items as followup?
I think 2 and 3 will be easier to think about once we have a DatetimeArray. |
certainly fine as a followup. I am not sure 3 is actually a blocker for 0.24.0 (though 2 is) and 1) as-is is fine for now |
ping on green. |
All green. |
thanks @TomAugspurger very nice! |
Closes #19954.
TODO:
.values
in the docs to use.array
or.to_numpy()
.values
and the rest