-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer NA docs #23617
Merged
TomAugspurger
merged 10 commits into
pandas-dev:master
from
TomAugspurger:integer-na-docs
Jan 1, 2019
Merged
Integer NA docs #23617
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
06f8568
wip
TomAugspurger 9e505a8
DOC: Integer NA
TomAugspurger 4ae4f8d
subsection
TomAugspurger a6a7ba7
Merge remote-tracking branch 'upstream/master' into integer-na-docs
TomAugspurger 8d1d026
update
TomAugspurger 15a7b65
Merge remote-tracking branch 'upstream/master' into integer-na-docs
TomAugspurger 51c4353
fixup
TomAugspurger 0ef696d
Merge remote-tracking branch 'upstream/master' into integer-na-docs
TomAugspurger d2a624d
Merge branch 'master' into PR_TOOL_MERGE_PR_23617
jreback 0c9995f
add back construction for docs
jreback File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
.. currentmodule:: pandas | ||
|
||
{{ header }} | ||
|
||
.. _integer_na: | ||
|
||
************************** | ||
Nullable Integer Data Type | ||
************************** | ||
|
||
.. versionadded:: 0.24.0 | ||
|
||
In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent | ||
missing data. Because ``NaN`` is a float, this forces an array of integers with | ||
any missing values to become floating point. In some cases, this may not matter | ||
much. But if your integer column is, say, an identifier, casting to float can | ||
be problematic. Some integers cannot even be represented as floating point | ||
numbers. | ||
|
||
Pandas can represent integer data with possibly missing values using | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'll know better, but I had the impression that we use lowercase |
||
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>` | ||
implemented within pandas. It is not the default dtype for integers, and will not be inferred; | ||
you must explicitly pass the dtype into :meth:`array` or :class:`Series`: | ||
|
||
.. ipython:: python | ||
|
||
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) | ||
arr | ||
|
||
Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from | ||
NumPy's ``'int64'`` dtype: | ||
|
||
.. ipython:: python | ||
|
||
pd.array([1, 2, np.nan], dtype="Int64") | ||
|
||
This array can be stored in a :class:`DataFrame` or :class:`Series` like any | ||
NumPy array. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(arr) | ||
|
||
You can also pass the list-like object to the :class:`Series` constructor | ||
with the dtype. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series([1, 2, np.nan], dtype="Int64") | ||
s | ||
|
||
By default (if you don't specify ``dtype``), NumPy is used, and you'll end | ||
up with a ``float64`` dtype Series: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 2, np.nan]) | ||
|
||
Operations involving an integer array will behave similar to NumPy arrays. | ||
Missing values will be propagated, and and the data will be coerced to another | ||
dtype if needed. | ||
|
||
.. ipython:: python | ||
|
||
# arithmetic | ||
s + 1 | ||
|
||
# comparison | ||
s == 1 | ||
|
||
# indexing | ||
s.iloc[1:3] | ||
|
||
# operate with other dtypes | ||
s + s.iloc[1:3].astype('Int8') | ||
|
||
# coerce when needed | ||
s + 0.01 | ||
|
||
These dtypes can operate as part of of ``DataFrame``. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')}) | ||
df | ||
df.dtypes | ||
|
||
|
||
These dtypes can be merged & reshaped & casted. | ||
|
||
.. ipython:: python | ||
|
||
pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes | ||
df['A'].astype(float) | ||
|
||
Reduction and groupby operations such as 'sum' work as well. | ||
|
||
.. ipython:: python | ||
|
||
df.sum() | ||
df.groupby('B').A.sum() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,32 +19,6 @@ pandas. | |
|
||
See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies. | ||
|
||
Missing data basics | ||
------------------- | ||
|
||
When / why does data become missing? | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Some might quibble over our usage of *missing*. By "missing" we simply mean | ||
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with | ||
missing data, either because it exists and was not collected or it never | ||
existed. For example, in a collection of financial time series, some of the time | ||
series might start on different dates. Thus, values prior to the start date | ||
would generally be marked as missing. | ||
|
||
In pandas, one of the most common ways that missing data is **introduced** into | ||
a data set is by reindexing. For example: | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], | ||
columns=['one', 'two', 'three']) | ||
df['four'] = 'bar' | ||
df['five'] = df['one'] > 0 | ||
df | ||
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) | ||
df2 | ||
|
||
Values considered "missing" | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
@@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA". | |
|
||
.. _missing.isna: | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], | ||
columns=['one', 'two', 'three']) | ||
df['four'] = 'bar' | ||
df['five'] = df['one'] > 0 | ||
df | ||
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) | ||
df2 | ||
|
||
To make detecting missing values easier (and across different array dtypes), | ||
pandas provides the :func:`isna` and | ||
:func:`notna` functions, which are also methods on | ||
|
@@ -90,6 +74,23 @@ Series and DataFrame objects: | |
|
||
df2['one'] == np.nan | ||
|
||
Integer Dtypes and Missing Data | ||
------------------------------- | ||
|
||
Because ``NaN`` is a float, a column of integers with even one missing values | ||
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas | ||
provides a nullable integer array, which can be used by explicitly requesting | ||
the dtype: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) | ||
|
||
Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be | ||
used. | ||
|
||
See :ref:`integer_na` for more. | ||
|
||
Datetimes | ||
--------- | ||
|
||
|
@@ -751,3 +752,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work | |
|
||
reindexed[crit.fillna(False)] | ||
reindexed[crit.fillna(True)] | ||
|
||
Pandas provides a nullable integer dtype, but you must explicitly request it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same, if correct |
||
when creating the series or column. Notice that we use a capital "I" in | ||
the ``dtype="Int64"``. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7], | ||
dtype="Int64") | ||
s > 0 | ||
(s > 0).dtype | ||
crit = (s > 0).reindex(list(range(8))) | ||
crit | ||
crit.dtype | ||
|
||
See :ref:`integer_na` for more. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not important, but the
.. currentmodule:: pandas
is also included in the{{ header }}
(we only kept it in theapi.rst
because the autosummaries need it before the header is rendered)