Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPyBackedExtensionArray #24227

Merged
merged 16 commits into from
Dec 28, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2681,6 +2681,7 @@ objects.
api.extensions.register_index_accessor
api.extensions.ExtensionDtype
api.extensions.ExtensionArray
arrays.PandasArray
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

.. This is to prevent warnings in the doc build. We don't want to encourage
.. these methods.
Expand Down
41 changes: 32 additions & 9 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,10 @@ the **array** property
s.array
s.index.array

Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
:attr:`~Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
The exact details of what an ``ExtensionArray`` is and why pandas uses them is a bit
jreback marked this conversation as resolved.
Show resolved Hide resolved
beyond the scope of this introduction. See :ref:`basics.dtypes` for more.

If you know you need a NumPy array, use :meth:`~Series.to_numpy`
or :meth:`numpy.asarray`.

Expand All @@ -81,10 +83,30 @@ or :meth:`numpy.asarray`.
s.to_numpy()
np.asarray(s)

For Series and Indexes backed by NumPy arrays (like we have here), this will
be the same as :attr:`~Series.array`. When the Series or Index is backed by
a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
may involve copying data and coercing values.
When the Series or Index is backed by
an :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
may involve copying data and coercing values. See :ref:`basics.dtypes` for more.

:meth:`~Series.to_numpy` gives some control over the ``dtype`` of the
resulting :class:`ndarray`. For example, consider datetimes with timezones.
NumPy doesn't have a dtype to represent timezone-aware datetimes, so there
are two possibly useful representations:

1. An object-dtype :class:`ndarray` with :class:`Timestamp` objects, each
with the correct ``tz``
2. A ``datetime64[ns]`` -dtype :class:`ndarray`, where the values have
been converted to UTC and the timezone discarded

Timezones may be preserved with ``dtype=object``

.. ipython:: python

ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))
ser.to_numpy(dtype=object)

Or thrown away with ``dtype='datetime64[ns]'``

ser.to_numpy(dtype="datetime64[ns]")

:meth:`~Series.to_numpy` gives some control over the ``dtype`` of the
resulting :class:`ndarray`. For example, consider datetimes with timezones.
Expand All @@ -109,7 +131,7 @@ Or thrown away with ``dtype='datetime64[ns]'``

Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
complex. When your ``DataFrame`` only has a single data type for all the
columns, :attr:`DataFrame.to_numpy` will return the underlying data:
columns, :meth:`DataFrame.to_numpy` will return the underlying data:

.. ipython:: python

Expand All @@ -136,8 +158,9 @@ drawbacks:

1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
:attr:`Series.array` will always return the actual array backing the Series,
while :meth:`Series.to_numpy` will always return a NumPy array.
:attr:`Series.array` will always return an ``ExtensionArray``, and will never
copy data. :meth:`Series.to_numpy` will always return a NumPy array,
potentially at the cost of copying / coercing values.
2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
involve copying data and coercing values to a common dtype, a relatively expensive
operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the
Expand Down
8 changes: 6 additions & 2 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,11 +146,15 @@ If you need the actual array backing a ``Series``, use :attr:`Series.array`.

s.array

Again, this is often a NumPy array, but may instead be a
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).

:attr:`Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
Briefly, an ExtensionArray is a thin wrapper around one or more *concrete* arrays like a
:class:`numpy.ndarray`. Pandas knows how to take an ``ExtensionArray`` and
store it in a ``Series`` or a column of a ``DataFrame``.
See :ref:`basics.dtypes` for more.

While Series is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.

Expand Down
9 changes: 6 additions & 3 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,11 @@ If you need an actual NumPy array, use :meth:`Series.to_numpy` or :meth:`Index.t
idx.to_numpy()
pd.Series(idx).to_numpy()

For Series and Indexes backed by normal NumPy arrays, this will be the same thing (and the same
as ``.values``).
For Series and Indexes backed by normal NumPy arrays, :attr:`Series.array` will return a
new :class:`arrays.PandasArray`, which is a thin (no-copy) wrapper around a
:class:`numpy.ndarray`. :class:`arrays.PandasArray` isn't especially useful on its own,
but it does provide the same interface as any extension array defined in pandas or by
a third-party library.

.. ipython:: python

Expand All @@ -75,7 +78,7 @@ as ``.values``).
ser.to_numpy()

We haven't removed or deprecated :attr:`Series.values` or :attr:`DataFrame.values`, but we
recommend and using ``.array`` or ``.to_numpy()`` instead.
highly recommend and using ``.array`` or ``.to_numpy()`` instead.

See :ref:`Dtypes <basics.dtypes>` and :ref:`Attributes and Underlying Data <basics.attrs>` for more.

Expand Down
1 change: 1 addition & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
from pandas.io.api import *
from pandas.util._tester import test
import pandas.testing
import pandas.arrays
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

# use the closest tagged version if possible
from ._version import get_versions
Expand Down
11 changes: 11 additions & 0 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""
All of pandas' ExtensionArrays.

See :ref:`extending.extension-types` for more.
"""
from pandas.core.arrays import PandasArray


__all__ = [
'PandasArray'
]
1 change: 1 addition & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@
from .integer import ( # noqa
IntegerArray, integer_array)
from .sparse import SparseArray # noqa
from .numpy_ import PandasArray, PandasDtype # noqa
13 changes: 6 additions & 7 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@
from pandas.core.dtypes.cast import (
coerce_indexer_dtype, maybe_infer_to_datetimelike)
from pandas.core.dtypes.common import (
ensure_int64, ensure_object, ensure_platform_int, is_categorical,
is_categorical_dtype, is_datetime64_dtype, is_datetimelike, is_dict_like,
is_dtype_equal, is_extension_array_dtype, is_float_dtype, is_integer_dtype,
is_iterator, is_list_like, is_object_dtype, is_scalar, is_sequence,
is_timedelta64_dtype)
ensure_int64, ensure_object, ensure_platform_int, extract_array,
is_categorical, is_categorical_dtype, is_datetime64_dtype, is_datetimelike,
is_dict_like, is_dtype_equal, is_extension_array_dtype, is_float_dtype,
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, is_scalar,
is_sequence, is_timedelta64_dtype)
from pandas.core.dtypes.dtypes import CategoricalDtype
from pandas.core.dtypes.generic import (
ABCCategoricalIndex, ABCIndexClass, ABCSeries)
Expand Down Expand Up @@ -2092,8 +2092,7 @@ def __setitem__(self, key, value):
`Categorical` does not have the same categories
"""

if isinstance(value, (ABCIndexClass, ABCSeries)):
value = value.array
value = extract_array(value, extract_numpy=True)

# require identical categories set
if isinstance(value, Categorical):
Expand Down
Loading