Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: warning to raise KeyError in the future if not all elements of a list are selected via .loc #17295

Merged
merged 1 commit into from
Oct 3, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1009,7 +1009,7 @@ The different indexing operation can potentially change the dtype of a ``Series`
series1 = pd.Series([1, 2, 3])
series1.dtype
res = series1[[0,4]]
res = series1.reindex([0, 4])
res.dtype
res
Expand Down
112 changes: 110 additions & 2 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,8 +333,15 @@ Selection By Label
dfl.loc['20130102':'20130104']
.. warning::

Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list with missing labels. In the future
this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated <indexing.deprecate_loc_reindex_listlike>`

pandas provides a suite of methods in order to have **purely label based indexing**. This is a strict inclusion based protocol.
**At least 1** of the labels for which you ask, must be in the index or a ``KeyError`` will be raised! When slicing, both the start bound **AND** the stop bound are *included*, if present in the index. Integers are valid labels, but they refer to the label **and not the position**.
All of the labels for which you ask, must be in the index or a ``KeyError`` will be raised!
When slicing, both the start bound **AND** the stop bound are *included*, if present in the index.
Integers are valid labels, but they refer to the label **and not the position**.

The ``.loc`` attribute is the primary access method. The following are valid inputs:

Expand Down Expand Up @@ -635,6 +642,107 @@ For getting *multiple* indexers, using ``.get_indexer``
dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
.. _indexing.deprecate_loc_reindex_listlike:

Indexing with list with missing labels is Deprecated
----------------------------------------------------

.. warning::

Starting in 0.21.0, using ``.loc`` or ``[]`` with a list with one or more missing labels, is deprecated, in favor of ``.reindex``.

In prior versions, using ``.loc[list-of-labels]`` would work as long as *at least 1* of the keys was found (otherwise it
would raise a ``KeyError``). This behavior is deprecated and will show a warning message pointing to this section. The
recommeded alternative is to use ``.reindex()``.

For example.

.. ipython:: python
s = pd.Series([1, 2, 3])
s
Selection with all keys found is unchanged.

.. ipython:: python
s.loc[[1, 2]]
Previous Behavior

.. code-block:: ipython
In [4]: s.loc[[1, 2, 3]]
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
Current Behavior

.. code-block:: ipython
In [4]: s.loc[[1, 2, 3]]
Passing list-likes to .loc with any non-matching elements will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
Reindexing
~~~~~~~~~~

The idiomatic way to achieve selecting potentially not-found elmenents is via ``.reindex()``. See also the section on :ref:`reindexing <basics.reindexing>`.

.. ipython:: python
s.reindex([1, 2, 3])
Alternatively, if you want to select only *valid* keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.

.. ipython:: python
labels = [1, 2, 3]
s.loc[s.index.intersection(labels)]
Having a duplicated index will raise for a ``.reindex()``:

.. ipython:: python
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['c', 'd']
.. code-block:: ipython
In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis
Generally, you can interesect the desired labels with the current
axis, and then reindex.

.. ipython:: python
s.loc[s.index.intersection(labels)].reindex(labels)
However, this would *still* raise if your resulting index is duplicated.

.. code-block:: ipython
In [41]: labels = ['a', 'd']
In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis
.. _indexing.basics.partial_setting:

Selecting Random Samples
Expand Down Expand Up @@ -852,7 +960,7 @@ when you don't know which of the sought labels are in fact present:
s[s.index.isin([2, 4, 6])]
# compare it to the following
s[[2, 4, 6]]
s.reindex([2, 4, 6])
In addition to that, ``MultiIndex`` allows selecting a separate level to use
in the membership check:
Expand Down
24 changes: 19 additions & 5 deletions doc/source/whatsnew/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -676,10 +676,19 @@ Other notable API changes:

Both will now return a frame reindex by [1,3]. E.g.

.. ipython:: python
.. code-block:: ipython

df.loc[[1,3]]
df.loc[[1,3],:]
In [3]: df.loc[[1,3]]
Out[3]:
0
1 a
3 NaN

In [4]: df.loc[[1,3],:]
Out[4]:
0
1 a
3 NaN

This can also be seen in multi-axis indexing with a ``Panel``.

Expand All @@ -693,9 +702,14 @@ Other notable API changes:

The following would raise ``KeyError`` prior to 0.15.0:

.. ipython:: python
.. code-block:: ipython

p.loc[['ItemA','ItemD'],:,'D']
In [5]:
Out[5]:
ItemA ItemD
1 3 NaN
2 7 NaN
3 11 NaN

Furthermore, ``.loc`` will raise If no values are found in a multi-index with a list-like indexer:

Expand Down
59 changes: 59 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,64 @@ If installed, we now require:
| Bottleneck | 1.0.0 | |
+--------------+-----------------+----------+

.. _whatsnew_0210.api_breaking.loc:

Indexing with a list with missing labels is Deprecated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, selecting with a list of labels, where one or more labels were missing would always succeed, returning ``NaN`` for missing labels.
This will now show a ``FutureWarning``, in the future this will raise a ``KeyError`` (:issue:`15747`).
This warning will trigger on a ``DataFrame`` or a ``Series`` for using ``.loc[]`` or ``[[]]`` when passing a list-of-labels with at least 1 missing label.
See the :ref:`deprecation docs <indexing.deprecate_loc_reindex_listlike>`.


.. ipython:: python

s = pd.Series([1, 2, 3])
s

Previous Behavior

.. code-block:: ipython

In [4]: s.loc[[1, 2, 3]]
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64


Current Behavior

.. code-block:: ipython

In [4]: s.loc[[1, 2, 3]]
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64

The idiomatic way to achieve selecting potentially not-found elmenents is via ``.reindex()``

.. ipython:: python

s.reindex([1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to mention the duplicate keys case in the whatsnew docs as well


Selection with all keys found is unchanged.

.. ipython:: python

s.loc[[1, 2]]


.. _whatsnew_0210.api_breaking.pandas_eval:

Improved error handling during item assignment in pd.eval
Expand Down Expand Up @@ -607,6 +665,7 @@ Deprecations
- ``pd.TimeGrouper`` is deprecated in favor of :class:`pandas.Grouper` (:issue:`16747`)
- ``cdate_range`` has been deprecated in favor of :func:`bdate_range`, which has gained ``weekmask`` and ``holidays`` parameters for building custom frequency date ranges. See the :ref:`documentation <timeseries.custom-freq-ranges>` for more details (:issue:`17596`)
- passing ``categories`` or ``ordered`` kwargs to :func:`Series.astype` is deprecated, in favor of passing a :ref:`CategoricalDtype <whatsnew_0210.enhancements.categorical_dtype>` (:issue:`17636`)
- Passing a non-existant column in ``.to_excel(..., columns=)`` is deprecated and will raise a ``KeyError`` in the future (:issue:`17295`)

.. _whatsnew_0210.deprecations.argmin_min:

Expand Down
32 changes: 26 additions & 6 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1419,13 +1419,33 @@ def _has_valid_type(self, key, axis):
if isinstance(key, tuple) and isinstance(ax, MultiIndex):
return True

# TODO: don't check the entire key unless necessary
if (not is_iterator(key) and len(key) and
np.all(ax.get_indexer_for(key) < 0)):
if not is_iterator(key) and len(key):

raise KeyError(u"None of [{key}] are in the [{axis}]"
.format(key=key,
axis=self.obj._get_axis_name(axis)))
# True indicates missing values
missing = ax.get_indexer_for(key) < 0

if np.any(missing):
if len(key) == 1 or np.all(missing):
raise KeyError(
u"None of [{key}] are in the [{axis}]".format(
key=key, axis=self.obj._get_axis_name(axis)))
else:

# we skip the warning on Categorical/Interval
# as this check is actually done (check for
# non-missing values), but a bit later in the
# code, so we want to avoid warning & then
# just raising
_missing_key_warning = textwrap.dedent("""
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike""") # noqa

if not (ax.is_categorical() or ax.is_interval()):
warnings.warn(_missing_key_warning,
FutureWarning, stacklevel=5)

return True

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -691,7 +691,7 @@ def _get_with(self, key):

if key_type == 'integer':
if self.index.is_integer() or self.index.is_floating():
return self.reindex(key)
return self.loc[key]
else:
return self._get_values(key)
elif key_type == 'boolean':
Expand Down
16 changes: 15 additions & 1 deletion pandas/io/formats/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,21 @@ def __init__(self, df, na_rep='', float_format=None, cols=None,
self.styler = None
self.df = df
if cols is not None:
self.df = df.loc[:, cols]

# all missing, raise
if not len(Index(cols) & df.columns):
raise KeyError(
"passes columns are not ALL present dataframe")

# deprecatedin gh-17295
# 1 missing is ok (for now)
if len(Index(cols) & df.columns) != len(cols):
warnings.warn(
"Not all names specified in 'columns' are found; "
"this will raise a KeyError in the future",
FutureWarning)

self.df = df.reindex(columns=cols)
self.columns = self.df.columns
self.float_format = float_format
self.index = index
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/indexing/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,8 @@ def test_loc_listlike(self):
assert_frame_equal(result, expected, check_index_type=True)

# not all labels in the categories
pytest.raises(KeyError, lambda: self.df2.loc[['a', 'd']])
with pytest.raises(KeyError):
self.df2.loc[['a', 'd']]

def test_loc_listlike_dtypes(self):
# GH 11586
Expand Down
8 changes: 6 additions & 2 deletions pandas/tests/indexing/test_datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,9 @@ def test_series_partial_set_datetime(self):
Timestamp('2011-01-03')]
exp = Series([np.nan, 0.2, np.nan],
index=pd.DatetimeIndex(keys, name='idx'), name='s')
tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True)
with tm.assert_produces_warning(FutureWarning,
check_stacklevel=False):
tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True)

def test_series_partial_set_period(self):
# GH 11497
Expand All @@ -248,5 +250,7 @@ def test_series_partial_set_period(self):
pd.Period('2011-01-03', freq='D')]
exp = Series([np.nan, 0.2, np.nan],
index=pd.PeriodIndex(keys, name='idx'), name='s')
result = ser.loc[keys]
with tm.assert_produces_warning(FutureWarning,
check_stacklevel=False):
result = ser.loc[keys]
tm.assert_series_equal(result, exp)
3 changes: 2 additions & 1 deletion pandas/tests/indexing/test_iloc.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,7 +617,8 @@ def test_iloc_non_unique_indexing(self):
expected = DataFrame(new_list)
expected = pd.concat([expected, DataFrame(index=idx[idx > sidx.max()])
])
result = df2.loc[idx]
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
result = df2.loc[idx]
tm.assert_frame_equal(result, expected, check_index_type=False)

def test_iloc_empty_list_indexer_is_ok(self):
Expand Down
Loading