-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Public data for Series and Index: .array and .to_numpy() #23623
Changes from 21 commits
7959eb6
5b15894
4781a36
15cc0b7
888853f
2cfca30
3e76f02
7e43cf0
bceb612
c19c9bb
fe813ff
8619790
639b6fb
95f19bc
3292e43
5a905ab
1e6eed4
4545d93
2d7abb4
a7a13a0
c0a63c0
661b9eb
52f5407
566a027
062c49f
78e5824
e805c26
f9eee65
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,8 +46,8 @@ of elements to display is five, but you may pass a custom number. | |
|
||
.. _basics.attrs: | ||
|
||
Attributes and the raw ndarray(s) | ||
--------------------------------- | ||
Attributes and Underlying Data | ||
------------------------------ | ||
|
||
pandas objects have a number of attributes enabling you to access the metadata | ||
|
||
|
@@ -65,14 +65,28 @@ Note, **these attributes can be safely assigned to**! | |
df.columns = [x.lower() for x in df.columns] | ||
df | ||
|
||
To get the actual data inside a data structure, one need only access the | ||
**values** property: | ||
Pandas objects (:class:`Index`, :class:`Series`, :class:`DataFrame`) can be | ||
thought of as containers for arrays, which hold the actual data and do the | ||
actual computation. For many types, the underlying array is a | ||
:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend* | ||
NumPy's type system to add support for custom arrays | ||
(see :ref:`basics.dtypes`). | ||
|
||
To get the actual data inside a :class:`Index` or :class:`Series`, use | ||
the **array** property | ||
|
||
.. ipython:: python | ||
|
||
s.values | ||
df.values | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
wp.values | ||
s.array | ||
s.index.array | ||
|
||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more | ||
complex. When your ``DataFrame`` only has a single data type for all the | ||
columns, :atr:`DataFrame.to_numpy` will return the underlying data: | ||
|
||
.. ipython:: python | ||
|
||
df.to_numpy() | ||
|
||
If a DataFrame or Panel contains homogeneously-typed data, the ndarray can | ||
actually be modified in-place, and the changes will be reflected in the data | ||
|
@@ -541,7 +555,7 @@ will exclude NAs on Series input by default: | |
.. ipython:: python | ||
|
||
np.mean(df['one']) | ||
np.mean(df['one'].values) | ||
np.mean(df['one'].array) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
:meth:`Series.nunique` will return the number of unique non-NA values in a | ||
Series: | ||
|
@@ -839,7 +853,7 @@ Series operation on each column or row: | |
|
||
tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], | ||
index=pd.date_range('1/1/2000', periods=10)) | ||
tsdf.values[3:7] = np.nan | ||
tsdf.iloc[3:7] = np.nan | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -1875,17 +1889,29 @@ dtypes | |
------ | ||
|
||
For the most part, pandas uses NumPy arrays and dtypes for Series or individual | ||
columns of a DataFrame. The main types allowed in pandas objects are ``float``, | ||
``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support | ||
timezone-aware datetimes). | ||
|
||
In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>` | ||
NumPy's type-system for a few cases. | ||
|
||
* :ref:`Categorical <categorical>` | ||
* :ref:`Datetime with Timezone <timeseries.timezone_series>` | ||
* :ref:`Period <timeseries.periods>` | ||
* :ref:`Interval <indexing.intervallindex>` | ||
columns of a DataFrame. NumPy provides support for ``float``, | ||
``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy | ||
does not support timezone-aware datetimes). | ||
|
||
Pandas and third-party libraries *extend* NumPy's type system in a few places. | ||
This section describes the extensions pandas has made internally. | ||
See :ref:`extending.extension-types` for how to write your own extension that | ||
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party | ||
libraries that have implemented an extension. | ||
|
||
The following table lists all of pandas extension types. See the respective | ||
documentation sections for more on each type. | ||
|
||
=================== ========================= ================== ============================= ============================= | ||
Kind of Data Data Type Scalar Array Documentation | ||
=================== ========================= ================== ============================= ============================= | ||
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone` | ||
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical` | ||
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods` | ||
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse` | ||
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex` | ||
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where does this 'integer_na' point to? (I don't seem to find it in the docs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #23617. I'm aiming for eventual consistency on the docs :) |
||
=================== ========================= ================== ============================= ============================= | ||
|
||
Pandas uses the ``object`` dtype for storing strings. | ||
|
||
|
@@ -1989,7 +2015,7 @@ force some *upcasting*. | |
|
||
.. ipython:: python | ||
|
||
df3.values.dtype | ||
df3.to_numpy().dtype | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
astype | ||
~~~~~~ | ||
|
@@ -2211,11 +2237,11 @@ dtypes: | |
'float64': np.arange(4.0, 7.0), | ||
'bool1': [True, False, True], | ||
'bool2': [False, True, False], | ||
'dates': pd.date_range('now', periods=3).values, | ||
'dates': pd.date_range('now', periods=3), | ||
'category': pd.Series(list("ABC")).astype('category')}) | ||
df['tdeltas'] = df.dates.diff() | ||
df['uint64'] = np.arange(3, 6).astype('u8') | ||
df['other_dates'] = pd.date_range('20130101', periods=3).values | ||
df['other_dates'] = pd.date_range('20130101', periods=3) | ||
df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern') | ||
df | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2436,22 +2436,22 @@ a convert on an aware stamp. | |
|
||
.. note:: | ||
|
||
Using the ``.values`` accessor on a ``Series``, returns an NumPy array of the data. | ||
Using :meth:`Series.to_numpy` on a ``Series``, returns a NumPy array of the data. | ||
These values are converted to UTC, as NumPy does not currently support timezones (even though it is *printing* in the local timezone!). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might need to be updated depending on the discussion of the default return value (object array of timestamps vs UTC converted naive datetime64) Could also use |
||
|
||
.. ipython:: python | ||
|
||
s_naive.values | ||
s_aware.values | ||
s_naive.to_numpy() | ||
s_aware.to_numpy() | ||
|
||
Further note that once converted to a NumPy array these would lose the tz tenor. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(s_aware.values) | ||
pd.Series(s_aware.to_numpy()) | ||
|
||
However, these can be easily converted: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(s_aware.values).dt.tz_localize('UTC').dt.tz_convert('US/Eastern') | ||
pd.Series(s_aware.to_numpy()).dt.tz_localize('UTC').dt.tz_convert('US/Eastern') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this, should we have a
copy
keyword to be able to force a copy? (can be added later)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. Don't care whether we do it here or later.
I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64
Regarding copy, would it actually make sense to have
copy=True
the default? Then you have at least a consistent default (it is never a view on the data)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think
copy=True
is a good default since it's the only one that can be ensured for all cases.