Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into isort-frame-test
Browse files Browse the repository at this point in the history
* upstream/master:
  BUG: output formatting with to_html(), index=False and/or index_names=False (pandas-dev#22579, pandas-dev#22747) (pandas-dev#22655)
  MAINT: Port _timelex in codebase (pandas-dev#24520)
  Implement unique+array parts of 24024 (pandas-dev#24527)
  Integer NA docs (pandas-dev#23617)
  • Loading branch information
thoo committed Jan 1, 2019
2 parents fbe5606 + b9284a2 commit 56250fc
Show file tree
Hide file tree
Showing 98 changed files with 2,690 additions and 119 deletions.
54 changes: 54 additions & 0 deletions LICENSES/DATEUTIL_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
Copyright 2017- Paul Ganssle <paul@ganssle.io>
Copyright 2017- dateutil contributors (see AUTHORS file)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

The above license applies to all contributions after 2017-12-01, as well as
all contributions that have been re-licensed (see AUTHORS file for the list of
contributors who have re-licensed their code).
--------------------------------------------------------------------------------
dateutil - Extensions to the standard Python datetime module.

Copyright (c) 2003-2011 - Gustavo Niemeyer <gustavo@niemeyer.net>
Copyright (c) 2012-2014 - Tomi Pieviläinen <tomi.pievilainen@iki.fi>
Copyright (c) 2014-2016 - Yaron de Leeuw <me@jarondl.net>
Copyright (c) 2015- - Paul Ganssle <paul@ganssle.io>
Copyright (c) 2015- - dateutil contributors (see AUTHORS file)

All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The above BSD License Applies to all code, even that also covered by Apache 2.0.
24 changes: 22 additions & 2 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,8 +215,28 @@ arrays. For example:
s2.dtype
This trade-off is made largely for memory and performance reasons, and also so
that the resulting ``Series`` continues to be "numeric". One possibility is to
use ``dtype=object`` arrays instead.
that the resulting ``Series`` continues to be "numeric".

If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas

* :class:`Int8Dtype`
* :class:`Int16Dtype`
* :class:`Int32Dtype`
* :class:`Int64Dtype`

.. ipython:: python
s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
dtype=pd.Int64Dtype())
s_int
s_int.dtype
s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
s2_int
s2_int.dtype
See :ref:`integer_na` for more.

``NA`` type promotions
~~~~~~~~~~~~~~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ See the package overview for more detail about what's in the library.
timeseries
timedeltas
categorical
integer_na
visualization
style
io
Expand Down
101 changes: 101 additions & 0 deletions doc/source/integer_na.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
.. currentmodule:: pandas

{{ header }}

.. _integer_na:

**************************
Nullable Integer Data Type
**************************

.. versionadded:: 0.24.0

In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:

.. ipython:: python
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
arr
Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from
NumPy's ``'int64'`` dtype:

.. ipython:: python
pd.array([1, 2, np.nan], dtype="Int64")
This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

.. ipython:: python
pd.Series(arr)
You can also pass the list-like object to the :class:`Series` constructor
with the dtype.

.. ipython:: python
s = pd.Series([1, 2, np.nan], dtype="Int64")
s
By default (if you don't specify ``dtype``), NumPy is used, and you'll end
up with a ``float64`` dtype Series:

.. ipython:: python
pd.Series([1, 2, np.nan])
Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and and the data will be coerced to another
dtype if needed.

.. ipython:: python
# arithmetic
s + 1
# comparison
s == 1
# indexing
s.iloc[1:3]
# operate with other dtypes
s + s.iloc[1:3].astype('Int8')
# coerce when needed
s + 0.01
These dtypes can operate as part of of ``DataFrame``.

.. ipython:: python
df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
df
df.dtypes
These dtypes can be merged & reshaped & casted.

.. ipython:: python
pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
df['A'].astype(float)
Reduction and groupby operations such as 'sum' work as well.

.. ipython:: python
df.sum()
df.groupby('B').A.sum()
69 changes: 43 additions & 26 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,6 @@ pandas.

See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.

Missing data basics
-------------------

When / why does data become missing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some might quibble over our usage of *missing*. By "missing" we simply mean
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with
missing data, either because it exists and was not collected or it never
existed. For example, in a collection of financial time series, some of the time
series might start on different dates. Thus, values prior to the start date
would generally be marked as missing.

In pandas, one of the most common ways that missing data is **introduced** into
a data set is by reindexing. For example:

.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2
Values considered "missing"
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA".

.. _missing.isna:

.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2
To make detecting missing values easier (and across different array dtypes),
pandas provides the :func:`isna` and
:func:`notna` functions, which are also methods on
Expand Down Expand Up @@ -90,6 +74,23 @@ Series and DataFrame objects:
df2['one'] == np.nan
Integer Dtypes and Missing Data
-------------------------------

Because ``NaN`` is a float, a column of integers with even one missing values
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas
provides a nullable integer array, which can be used by explicitly requesting
the dtype:

.. ipython:: python
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be
used.

See :ref:`integer_na` for more.

Datetimes
---------

Expand Down Expand Up @@ -751,3 +752,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work
reindexed[crit.fillna(False)]
reindexed[crit.fillna(True)]
Pandas provides a nullable integer dtype, but you must explicitly request it
when creating the series or column. Notice that we use a capital "I" in
the ``dtype="Int64"``.

.. ipython:: python
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7],
dtype="Int64")
s > 0
(s > 0).dtype
crit = (s > 0).reindex(list(range(8)))
crit
crit.dtype
See :ref:`integer_na` for more.
31 changes: 30 additions & 1 deletion doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,9 @@ Reduction and groupby operations such as 'sum' work.
.. warning::

The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.
The Integer NA support currently uses the capitalized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date.

See :ref:`integer_na` for more.

.. _whatsnew_0240.enhancements.array:

Expand Down Expand Up @@ -671,6 +673,31 @@ is the case with :attr:`Period.end_time`, for example
p.end_time
.. _whatsnew_0240.api_breaking.datetime_unique:

The return type of :meth:`Series.unique` for datetime with timezone values has changed
from an :class:`numpy.ndarray` of :class:`Timestamp` objects to a :class:`arrays.DatetimeArray` (:issue:`24024`).

.. ipython:: python
ser = pd.Series([pd.Timestamp('2000', tz='UTC'),
pd.Timestamp('2000', tz='UTC')])
*Previous Behavior*:

.. code-block:: ipython
In [3]: ser.unique()
Out[3]: array([Timestamp('2000-01-01 00:00:00+0000', tz='UTC')], dtype=object)
*New Behavior*:

.. ipython:: python
ser.unique()
.. _whatsnew_0240.api_breaking.sparse_values:

Sparse Data Structure Refactor
Expand Down Expand Up @@ -1569,6 +1596,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- :func:`read_sas()` will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (:issue:`16615`)
- Bug in :meth:`detect_client_encoding` where potential ``IOError`` goes unhandled when importing in a mod_wsgi process due to restricted access to stdout. (:issue:`21552`)
- Bug in :func:`to_html()` with ``index=False`` misses truncation indicators (...) on truncated DataFrame (:issue:`15019`, :issue:`22783`)
- Bug in :func:`to_html()` with ``index=False`` when both columns and row index are ``MultiIndex`` (:issue:`22579`)
- Bug in :func:`to_html()` with ``index_names=False`` displaying index name (:issue:`22747`)
- Bug in :func:`DataFrame.to_string()` that broke column alignment when ``index=False`` and width of first column's values is greater than the width of first column's header (:issue:`16839`, :issue:`13032`)
- Bug in :func:`DataFrame.to_string()` that caused representations of :class:`DataFrame` to not take up the whole window (:issue:`22984`)
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
Expand Down
Loading

0 comments on commit 56250fc

Please sign in to comment.