Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Public data for Series and Index: .array and .to_numpy() #23623

Merged
merged 28 commits into from
Nov 29, 2018
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7959eb6
API: Public data attributes for EA-backed containers
TomAugspurger Oct 30, 2018
5b15894
update
TomAugspurger Nov 6, 2018
4781a36
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
15cc0b7
more notes
TomAugspurger Nov 11, 2018
888853f
update
TomAugspurger Nov 11, 2018
2cfca30
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
3e76f02
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 13, 2018
7e43cf0
Squashed commit of the following:
TomAugspurger Nov 13, 2018
bceb612
DOC: updated docs
TomAugspurger Nov 13, 2018
c19c9bb
Added DataFrame.to_numpy
TomAugspurger Nov 17, 2018
fe813ff
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 17, 2018
8619790
clean
TomAugspurger Nov 17, 2018
639b6fb
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
95f19bc
doc update
TomAugspurger Nov 21, 2018
3292e43
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
5a905ab
update
TomAugspurger Nov 21, 2018
1e6eed4
fixed doctest
TomAugspurger Nov 21, 2018
4545d93
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 26, 2018
2d7abb4
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 27, 2018
a7a13a0
Fixed array error in docs
TomAugspurger Nov 27, 2018
c0a63c0
update docs
TomAugspurger Nov 27, 2018
661b9eb
Fixup for feedback
TomAugspurger Nov 28, 2018
52f5407
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
566a027
skip only on index box
TomAugspurger Nov 28, 2018
062c49f
Series.values
TomAugspurger Nov 28, 2018
78e5824
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
e805c26
remove stale todo
TomAugspurger Nov 28, 2018
f9eee65
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 29 additions & 2 deletions doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,13 +113,40 @@ Here is how to view the top and bottom rows of the frame:
df.head()
df.tail(3)

Display the index, columns, and the underlying NumPy data:
Display the index, columns:

.. ipython:: python

df.index
df.columns
df.values

:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
Note that his can be an expensive operation when your :class:`DataFrame` has
columns with different data types, which comes down to a fundamental difference
between pandas and NumPy: **NumPy arrays have one dtype for the entire array,
while pandas DataFrames have one dtype per column**. When you call
:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all*
of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.

For ``df``, our :class:`DataFrame` of all floating-point values,
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, should we have a copy keyword to be able to force a copy? (can be added later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Don't care whether we do it here or later.

I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64

Regarding copy, would it actually make sense to have copy=True the default? Then you have at least a consistent default (it is never a view on the data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think copy=True is a good default since it's the only one that can be ensured for all cases.


.. ipython:: python

df.to_numpy()

For ``df2``, the :class:`DataFrame` with multiple dtypes,
:meth:`DataFrame.to_numpy` is relatively expensive.

.. ipython:: python

df2.to_numpy()

.. note::

:meth:`DataFrame.to_numpy` does *not* include the index or column
labels in the output.

:func:`~DataFrame.describe` shows a quick statistic summary of your data:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ highly performant. If you want to see only the used levels, you can use the

.. ipython:: python

df[['foo', 'qux']].columns.values
df[['foo', 'qux']].columns.to_numpy()

# for a specific level
df[['foo', 'qux']].columns.get_level_values(0)
Expand Down
72 changes: 49 additions & 23 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ of elements to display is five, but you may pass a custom number.

.. _basics.attrs:

Attributes and the raw ndarray(s)
---------------------------------
Attributes and Underlying Data
------------------------------

pandas objects have a number of attributes enabling you to access the metadata

Expand All @@ -65,14 +65,28 @@ Note, **these attributes can be safely assigned to**!
df.columns = [x.lower() for x in df.columns]
df

To get the actual data inside a data structure, one need only access the
**values** property:
Pandas objects (:class:`Index`, :class:`Series`, :class:`DataFrame`) can be
thought of as containers for arrays, which hold the actual data and do the
actual computation. For many types, the underlying array is a
:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend*
NumPy's type system to add support for custom arrays
(see :ref:`basics.dtypes`).

To get the actual data inside a :class:`Index` or :class:`Series`, use
the **array** property

.. ipython:: python

s.values
df.values
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
wp.values
s.array
s.index.array

TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
complex. When your ``DataFrame`` only has a single data type for all the
columns, :atr:`DataFrame.to_numpy` will return the underlying data:

.. ipython:: python

df.to_numpy()

If a DataFrame or Panel contains homogeneously-typed data, the ndarray can
actually be modified in-place, and the changes will be reflected in the data
Expand Down Expand Up @@ -541,7 +555,7 @@ will exclude NAs on Series input by default:
.. ipython:: python

np.mean(df['one'])
np.mean(df['one'].values)
np.mean(df['one'].array)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

:meth:`Series.nunique` will return the number of unique non-NA values in a
Series:
Expand Down Expand Up @@ -839,7 +853,7 @@ Series operation on each column or row:

tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
index=pd.date_range('1/1/2000', periods=10))
tsdf.values[3:7] = np.nan
tsdf.iloc[3:7] = np.nan

.. ipython:: python

Expand Down Expand Up @@ -1875,17 +1889,29 @@ dtypes
------

For the most part, pandas uses NumPy arrays and dtypes for Series or individual
columns of a DataFrame. The main types allowed in pandas objects are ``float``,
``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support
timezone-aware datetimes).

In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>`
NumPy's type-system for a few cases.

* :ref:`Categorical <categorical>`
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
* :ref:`Period <timeseries.periods>`
* :ref:`Interval <indexing.intervallindex>`
columns of a DataFrame. NumPy provides support for ``float``,
``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy
does not support timezone-aware datetimes).

Pandas and third-party libraries *extend* NumPy's type system in a few places.
This section describes the extensions pandas has made internally.
See :ref:`extending.extension-types` for how to write your own extension that
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
libraries that have implemented an extension.

The following table lists all of pandas extension types. See the respective
documentation sections for more on each type.

=================== ========================= ================== ============================= =============================
Kind of Data Data Type Scalar Array Documentation
=================== ========================= ================== ============================= =============================
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical`
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this 'integer_na' point to? (I don't seem to find it in the docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23617. I'm aiming for eventual consistency on the docs :)

=================== ========================= ================== ============================= =============================

Pandas uses the ``object`` dtype for storing strings.

Expand Down Expand Up @@ -1989,7 +2015,7 @@ force some *upcasting*.

.. ipython:: python

df3.values.dtype
df3.to_numpy().dtype
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

astype
~~~~~~
Expand Down Expand Up @@ -2211,11 +2237,11 @@ dtypes:
'float64': np.arange(4.0, 7.0),
'bool1': [True, False, True],
'bool2': [False, True, False],
'dates': pd.date_range('now', periods=3).values,
'dates': pd.date_range('now', periods=3),
'category': pd.Series(list("ABC")).astype('category')})
df['tdeltas'] = df.dates.diff()
df['uint64'] = np.arange(3, 6).astype('u8')
df['other_dates'] = pd.date_range('20130101', periods=3).values
df['other_dates'] = pd.date_range('20130101', periods=3)
df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')
df

Expand Down
4 changes: 2 additions & 2 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ are consistent among all columns.

To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
categories for each column, the ``categories`` parameter can be determined programmatically by
``categories = pd.unique(df.values.ravel())``.
``categories = pd.unique(df.to_numpy().ravel())``.

If you already have ``codes`` and ``categories``, you can use the
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
Expand Down Expand Up @@ -955,7 +955,7 @@ Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
pd.concat([s1, s3])

pd.concat([s1, s3]).astype('category')
union_categoricals([s1.values, s3.values])
union_categoricals([s1.array, s3.array])


Following table summarizes the results of ``Categoricals`` related concatenations.
Expand Down
40 changes: 39 additions & 1 deletion doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,43 @@ However, operations such as slicing will also slice the index.
s[[4, 3, 1]]
np.exp(s)

We will address array-based indexing in a separate :ref:`section <indexing>`.
.. note::

We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section <indexing>`.

Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.

.. ipython:: python

s.dtype

This is often a NumPy dtype. However, pandas and 3rd-party libraries
extend NumPy's type system in a few places, in which case the dtype would
be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
for more.

If you need the actual array backing a ``Series``, use :attr:`Series.array`.

.. ipython:: python

s.array

Again, this is often a NumPy array, but may instead be a
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).

While Series is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.

.. ipython:: python

s.to_numpy()

Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
:meth:`Series.to_numpy` will return a NumPy ndarray.

Series is dict-like
~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -617,6 +653,8 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of labels in the
:ref:`section on reindexing <basics.reindexing>`.

.. _dsintro.alignment:

Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
8 changes: 5 additions & 3 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra

You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
to a Cython function. Instead pass the actual ``ndarray`` using the
``.values`` attribute of the ``Series``. The reason is that the Cython
:meth:`Series.to_numpy`. The reason is that the Cython
definition is specific to an ndarray and not the passed ``Series``.

So, do not do this:
Expand All @@ -230,11 +230,13 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra

apply_integrate_f(df['a'], df['b'], df['N'])

But rather, use ``.values`` to get the underlying ``ndarray``:
But rather, use :meth:`Series.to_numpy` to get the underlying ``ndarray``:

.. code-block:: python

apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
apply_integrate_f(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())

.. note::

Expand Down
2 changes: 1 addition & 1 deletion doc/source/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Instead, you should detect these cases and return ``NotImplemented``.
When pandas encounters an operation like ``op(Series, ExtensionArray)``, pandas
will

1. unbox the array from the ``Series`` (roughly ``Series.values``)
1. unbox the array from the ``Series`` (``Series.array``)
2. call ``result = op(values, ExtensionArray)``
3. re-box the result in a ``Series``

Expand Down
2 changes: 1 addition & 1 deletion doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ columns.

.. ipython:: python

df.loc[:,['B', 'A']] = df[['A', 'B']].values
df.loc[:,['B', 'A']] = df[['A', 'B']].to_numpy()
df[['A', 'B']]


Expand Down
2 changes: 1 addition & 1 deletion doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -678,7 +678,7 @@ Replacing more than one value is possible by passing a list.

.. ipython:: python

df00 = df.values[0, 0]
df00 = df.iloc[0, 0]
df.replace([1.5, df00], [np.nan, 'a'])
df[1].dtype

Expand Down
14 changes: 7 additions & 7 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@ Reshaping by pivoting DataFrame objects
tm.N = 3

def unpivot(frame):
N, K = frame.shape
data = {'value': frame.values.ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
columns = ['date', 'variable', 'value']
return pd.DataFrame(data, columns=columns)
N, K = frame.shape
data = {'value': frame.to_numpy().ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
columns = ['date', 'variable', 'value']
return pd.DataFrame(data, columns=columns)

df = unpivot(tm.makeTimeDataFrame())

Expand All @@ -54,7 +54,7 @@ For the curious here is how the above ``DataFrame`` was created:

def unpivot(frame):
N, K = frame.shape
data = {'value': frame.values.ravel('F'),
data = {'value': frame.to_numpy().ravel('F'),
'variable': np.asarray(frame.columns).repeat(N),
'date': np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
Expand Down
4 changes: 2 additions & 2 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -317,8 +317,8 @@ All one-dimensional list-likes can be combined in a list-like container (includi

s
u
s.str.cat([u.values,
u.index.astype(str).values], na_rep='-')
s.str.cat([u.array,
u.index.astype(str).array], na_rep='-')

All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:

Expand Down
10 changes: 5 additions & 5 deletions doc/source/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2436,22 +2436,22 @@ a convert on an aware stamp.

.. note::

Using the ``.values`` accessor on a ``Series``, returns an NumPy array of the data.
Using :meth:`Series.to_numpy` on a ``Series``, returns a NumPy array of the data.
These values are converted to UTC, as NumPy does not currently support timezones (even though it is *printing* in the local timezone!).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need to be updated depending on the discussion of the default return value (object array of timestamps vs UTC converted naive datetime64)

Could also use np.asarray to make the point here in case we go for object array


.. ipython:: python

s_naive.values
s_aware.values
s_naive.to_numpy()
s_aware.to_numpy()

Further note that once converted to a NumPy array these would lose the tz tenor.

.. ipython:: python

pd.Series(s_aware.values)
pd.Series(s_aware.to_numpy())

However, these can be easily converted:

.. ipython:: python

pd.Series(s_aware.values).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
pd.Series(s_aware.to_numpy()).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Loading