Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer NA docs #23617

Merged
merged 10 commits into from
Jan 1, 2019
24 changes: 22 additions & 2 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,8 +215,28 @@ arrays. For example:
s2.dtype

This trade-off is made largely for memory and performance reasons, and also so
that the resulting ``Series`` continues to be "numeric". One possibility is to
use ``dtype=object`` arrays instead.
that the resulting ``Series`` continues to be "numeric".

If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas

* :class:`Int8Dtype`
* :class:`Int16Dtype`
* :class:`Int32Dtype`
* :class:`Int64Dtype`

.. ipython:: python

s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
dtype=pd.Int64Dtype())
s_int
s_int.dtype

s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
s2_int
s2_int.dtype

See :ref:`integer_na` for more.

``NA`` type promotions
~~~~~~~~~~~~~~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ See the package overview for more detail about what's in the library.
timeseries
timedeltas
categorical
integer_na
visualization
style
io
Expand Down
101 changes: 101 additions & 0 deletions doc/source/integer_na.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
.. currentmodule:: pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not important, but the .. currentmodule:: pandas is also included in the {{ header }} (we only kept it in the api.rst because the autosummaries need it before the header is rendered)


{{ header }}

.. _integer_na:

**************************
Nullable Integer Data Type
**************************

.. versionadded:: 0.24.0

In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Pandas can represent integer data with possibly missing values using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll know better, but I had the impression that we use lowercase pandas even at the beginning of sentences.

:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:

.. ipython:: python

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
arr

Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from
NumPy's ``'int64'`` dtype:

.. ipython:: python

pd.array([1, 2, np.nan], dtype="Int64")

This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

.. ipython:: python

pd.Series(arr)

You can also pass the list-like object to the :class:`Series` constructor
with the dtype.

.. ipython:: python

s = pd.Series([1, 2, np.nan], dtype="Int64")
s

By default (if you don't specify ``dtype``), NumPy is used, and you'll end
up with a ``float64`` dtype Series:

.. ipython:: python

pd.Series([1, 2, np.nan])

Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and and the data will be coerced to another
dtype if needed.

.. ipython:: python

# arithmetic
s + 1

# comparison
s == 1

# indexing
s.iloc[1:3]

# operate with other dtypes
s + s.iloc[1:3].astype('Int8')

# coerce when needed
s + 0.01

These dtypes can operate as part of of ``DataFrame``.

.. ipython:: python

df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
df
df.dtypes


These dtypes can be merged & reshaped & casted.

.. ipython:: python

pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
df['A'].astype(float)

Reduction and groupby operations such as 'sum' work as well.

.. ipython:: python

df.sum()
df.groupby('B').A.sum()
69 changes: 43 additions & 26 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,6 @@ pandas.

See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.

Missing data basics
-------------------

When / why does data become missing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some might quibble over our usage of *missing*. By "missing" we simply mean
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with
missing data, either because it exists and was not collected or it never
existed. For example, in a collection of financial time series, some of the time
series might start on different dates. Thus, values prior to the start date
would generally be marked as missing.

In pandas, one of the most common ways that missing data is **introduced** into
a data set is by reindexing. For example:

.. ipython:: python

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

Values considered "missing"
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA".

.. _missing.isna:

.. ipython:: python

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

To make detecting missing values easier (and across different array dtypes),
pandas provides the :func:`isna` and
:func:`notna` functions, which are also methods on
Expand Down Expand Up @@ -90,6 +74,23 @@ Series and DataFrame objects:

df2['one'] == np.nan

Integer Dtypes and Missing Data
-------------------------------

Because ``NaN`` is a float, a column of integers with even one missing values
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas
provides a nullable integer array, which can be used by explicitly requesting
the dtype:

.. ipython:: python

pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())

Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be
used.

See :ref:`integer_na` for more.

Datetimes
---------

Expand Down Expand Up @@ -751,3 +752,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work

reindexed[crit.fillna(False)]
reindexed[crit.fillna(True)]

Pandas provides a nullable integer dtype, but you must explicitly request it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, if correct

when creating the series or column. Notice that we use a capital "I" in
the ``dtype="Int64"``.

.. ipython:: python

s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7],
dtype="Int64")
s > 0
(s > 0).dtype
crit = (s > 0).reindex(list(range(8)))
crit
crit.dtype

See :ref:`integer_na` for more.
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,9 @@ Reduction and groupby operations such as 'sum' work.

.. warning::

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.
The Integer NA support currently uses the capitalized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.

See :ref:`integer_na` for more.
datapythonista marked this conversation as resolved.
Show resolved Hide resolved

.. _whatsnew_0240.enhancements.array:

Expand Down