Skip to content

Commit

Permalink
Integer NA docs (pandas-dev#23617)
Browse files Browse the repository at this point in the history
* wip

* DOC: Integer NA

Closes pandas-dev#22003

* subsection

* update

* fixup

* add back construction for docs
  • Loading branch information
TomAugspurger authored and Pingviinituutti committed Feb 28, 2019
1 parent cda6c48 commit b6b343d
Show file tree
Hide file tree
Showing 5 changed files with 170 additions and 29 deletions.
24 changes: 22 additions & 2 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,8 +215,28 @@ arrays. For example:
s2.dtype
This trade-off is made largely for memory and performance reasons, and also so
that the resulting ``Series`` continues to be "numeric". One possibility is to
use ``dtype=object`` arrays instead.
that the resulting ``Series`` continues to be "numeric".

If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas

* :class:`Int8Dtype`
* :class:`Int16Dtype`
* :class:`Int32Dtype`
* :class:`Int64Dtype`

.. ipython:: python
s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
dtype=pd.Int64Dtype())
s_int
s_int.dtype
s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
s2_int
s2_int.dtype
See :ref:`integer_na` for more.

``NA`` type promotions
~~~~~~~~~~~~~~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ See the package overview for more detail about what's in the library.
timeseries
timedeltas
categorical
integer_na
visualization
style
io
Expand Down
101 changes: 101 additions & 0 deletions doc/source/integer_na.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
.. currentmodule:: pandas

{{ header }}

.. _integer_na:

**************************
Nullable Integer Data Type
**************************

.. versionadded:: 0.24.0

In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:

.. ipython:: python
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
arr
Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from
NumPy's ``'int64'`` dtype:

.. ipython:: python
pd.array([1, 2, np.nan], dtype="Int64")
This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

.. ipython:: python
pd.Series(arr)
You can also pass the list-like object to the :class:`Series` constructor
with the dtype.

.. ipython:: python
s = pd.Series([1, 2, np.nan], dtype="Int64")
s
By default (if you don't specify ``dtype``), NumPy is used, and you'll end
up with a ``float64`` dtype Series:

.. ipython:: python
pd.Series([1, 2, np.nan])
Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and and the data will be coerced to another
dtype if needed.

.. ipython:: python
# arithmetic
s + 1
# comparison
s == 1
# indexing
s.iloc[1:3]
# operate with other dtypes
s + s.iloc[1:3].astype('Int8')
# coerce when needed
s + 0.01
These dtypes can operate as part of of ``DataFrame``.

.. ipython:: python
df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
df
df.dtypes
These dtypes can be merged & reshaped & casted.

.. ipython:: python
pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
df['A'].astype(float)
Reduction and groupby operations such as 'sum' work as well.

.. ipython:: python
df.sum()
df.groupby('B').A.sum()
69 changes: 43 additions & 26 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,6 @@ pandas.

See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.

Missing data basics
-------------------

When / why does data become missing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some might quibble over our usage of *missing*. By "missing" we simply mean
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with
missing data, either because it exists and was not collected or it never
existed. For example, in a collection of financial time series, some of the time
series might start on different dates. Thus, values prior to the start date
would generally be marked as missing.

In pandas, one of the most common ways that missing data is **introduced** into
a data set is by reindexing. For example:

.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2
Values considered "missing"
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA".

.. _missing.isna:

.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2
To make detecting missing values easier (and across different array dtypes),
pandas provides the :func:`isna` and
:func:`notna` functions, which are also methods on
Expand Down Expand Up @@ -90,6 +74,23 @@ Series and DataFrame objects:
df2['one'] == np.nan
Integer Dtypes and Missing Data
-------------------------------

Because ``NaN`` is a float, a column of integers with even one missing values
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas
provides a nullable integer array, which can be used by explicitly requesting
the dtype:

.. ipython:: python
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be
used.

See :ref:`integer_na` for more.

Datetimes
---------

Expand Down Expand Up @@ -751,3 +752,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work
reindexed[crit.fillna(False)]
reindexed[crit.fillna(True)]
Pandas provides a nullable integer dtype, but you must explicitly request it
when creating the series or column. Notice that we use a capital "I" in
the ``dtype="Int64"``.

.. ipython:: python
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7],
dtype="Int64")
s > 0
(s > 0).dtype
crit = (s > 0).reindex(list(range(8)))
crit
crit.dtype
See :ref:`integer_na` for more.
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,9 @@ Reduction and groupby operations such as 'sum' work.
.. warning::

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.
The Integer NA support currently uses the capitalized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.

See :ref:`integer_na` for more.

.. _whatsnew_0240.enhancements.array:

Expand Down

0 comments on commit b6b343d

Please sign in to comment.