API: warning to raise KeyError in the future if not all elements of a…

… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747
reef-technologies · Oct 16, 2017 · 6ac291a · 6ac291a
1 parent 6f67404
commit 6ac291a
Show file tree

Hide file tree

Showing 17 changed files with 386 additions and 67 deletions.
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -1009,7 +1009,7 @@ The different indexing operation can potentially change the dtype of a ``Series`
 
    series1 = pd.Series([1, 2, 3])
    series1.dtype
-   res = series1[[0,4]]
+   res = series1.reindex([0, 4])
    res.dtype
    res
 

diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -333,8 +333,15 @@ Selection By Label
 
      dfl.loc['20130102':'20130104']
 
+.. warning::
+
+   Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list with missing labels. In the future
+   this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated <indexing.deprecate_loc_reindex_listlike>`
+
 pandas provides a suite of methods in order to have **purely label based indexing**. This is a strict inclusion based protocol.
-**At least 1** of the labels for which you ask, must be in the index or a ``KeyError`` will be raised! When slicing, both the start bound **AND** the stop bound are *included*, if present in the index. Integers are valid labels, but they refer to the label **and not the position**.
+All of the labels for which you ask, must be in the index or a ``KeyError`` will be raised!
+When slicing, both the start bound **AND** the stop bound are *included*, if present in the index.
+Integers are valid labels, but they refer to the label **and not the position**.
 
 The ``.loc`` attribute is the primary access method. The following are valid inputs:
 
@@ -635,6 +642,107 @@ For getting *multiple* indexers, using ``.get_indexer``
   dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
 
 
+.. _indexing.deprecate_loc_reindex_listlike:
+
+Indexing with list with missing labels is Deprecated
+----------------------------------------------------
+
+.. warning::
+
+   Starting in 0.21.0, using ``.loc`` or ``[]`` with a list with one or more missing labels, is deprecated, in favor of ``.reindex``.
+
+In prior versions, using ``.loc[list-of-labels]`` would work as long as *at least 1* of the keys was found (otherwise it
+would raise a ``KeyError``). This behavior is deprecated and will show a warning message pointing to this section. The
+recommeded alternative is to use ``.reindex()``.
+
+For example.
+
+.. ipython:: python
+
+   s = pd.Series([1, 2, 3])
+   s
+
+Selection with all keys found is unchanged.
+
+.. ipython:: python
+
+   s.loc[[1, 2]]
+
+Previous Behavior
+
+.. code-block:: ipython
+
+   In [4]: s.loc[[1, 2, 3]]
+   Out[4]:
+   1    2.0
+   2    3.0
+   3    NaN
+   dtype: float64
+
+
+Current Behavior
+
+.. code-block:: ipython
+
+   In [4]: s.loc[[1, 2, 3]]
+   Passing list-likes to .loc with any non-matching elements will raise
+   KeyError in the future, you can use .reindex() as an alternative.
+
+   See the documentation here:
+   http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
+
+   Out[4]:
+   1    2.0
+   2    3.0
+   3    NaN
+   dtype: float64
+
+
+Reindexing
+~~~~~~~~~~
+
+The idiomatic way to achieve selecting potentially not-found elmenents is via ``.reindex()``. See also the section on :ref:`reindexing <basics.reindexing>`.
+
+.. ipython:: python
+
+  s.reindex([1, 2, 3])
+
+Alternatively, if you want to select only *valid* keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.
+
+.. ipython:: python
+
+   labels = [1, 2, 3]
+   s.loc[s.index.intersection(labels)]
+
+Having a duplicated index will raise for a ``.reindex()``:
+
+.. ipython:: python
+
+   s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
+   labels = ['c', 'd']
+
+.. code-block:: ipython
+
+   In [17]: s.reindex(labels)
+   ValueError: cannot reindex from a duplicate axis
+
+Generally, you can interesect the desired labels with the current
+axis, and then reindex.
+
+.. ipython:: python
+
+   s.loc[s.index.intersection(labels)].reindex(labels)
+
+However, this would *still* raise if your resulting index is duplicated.
+
+.. code-block:: ipython
+
+   In [41]: labels = ['a', 'd']
+
+   In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
+   ValueError: cannot reindex from a duplicate axis
+
+
 .. _indexing.basics.partial_setting:
 
 Selecting Random Samples
@@ -852,7 +960,7 @@ when you don't know which of the sought labels are in fact present:
    s[s.index.isin([2, 4, 6])]
 
    # compare it to the following
-   s[[2, 4, 6]]
+   s.reindex([2, 4, 6])
 
 In addition to that, ``MultiIndex`` allows selecting a separate level to use
 in the membership check:

diff --git a/doc/source/whatsnew/v0.15.0.txt b/doc/source/whatsnew/v0.15.0.txt
@@ -676,10 +676,19 @@ Other notable API changes:
 
   Both will now return a frame reindex by [1,3]. E.g.
 
-  .. ipython:: python
+  .. code-block:: ipython
 
-     df.loc[[1,3]]
-     df.loc[[1,3],:]
+     In [3]: df.loc[[1,3]]
+     Out[3]:
+          0
+     1    a
+     3  NaN
+
+     In [4]: df.loc[[1,3],:]
+     Out[4]:
+          0
+     1    a
+     3  NaN
 
   This can also be seen in multi-axis indexing with a ``Panel``.
 
@@ -693,9 +702,14 @@ Other notable API changes:
 
   The following would raise ``KeyError`` prior to 0.15.0:
 
-  .. ipython:: python
+  .. code-block:: ipython
 
-     p.loc[['ItemA','ItemD'],:,'D']
+     In [5]:
+     Out[5]:
+        ItemA  ItemD
+     1      3    NaN
+     2      7    NaN
+     3     11    NaN
 
   Furthermore, ``.loc`` will raise If no values are found in a multi-index with a list-like indexer:
 

diff --git a/doc/source/whatsnew/v0.21.0.txt b/doc/source/whatsnew/v0.21.0.txt
@@ -300,6 +300,64 @@ If installed, we now require:
    | Bottleneck   | 1.0.0           |          |
    +--------------+-----------------+----------+
 
+.. _whatsnew_0210.api_breaking.loc:
+
+Indexing with a list with missing labels is Deprecated
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, selecting with a list of labels, where one or more labels were missing would always succeed, returning ``NaN`` for missing labels.
+This will now show a ``FutureWarning``, in the future this will raise a ``KeyError`` (:issue:`15747`).
+This warning will trigger on a ``DataFrame`` or a ``Series`` for using ``.loc[]``  or ``[[]]`` when passing a list-of-labels with at least 1 missing label.
+See the :ref:`deprecation docs <indexing.deprecate_loc_reindex_listlike>`.
+
+
+.. ipython:: python
+
+   s = pd.Series([1, 2, 3])
+   s
+
+Previous Behavior
+
+.. code-block:: ipython
+
+   In [4]: s.loc[[1, 2, 3]]
+   Out[4]:
+   1    2.0
+   2    3.0
+   3    NaN
+   dtype: float64
+
+
+Current Behavior
+
+.. code-block:: ipython
+
+   In [4]: s.loc[[1, 2, 3]]
+   Passing list-likes to .loc or [] with any missing label will raise
+   KeyError in the future, you can use .reindex() as an alternative.
+
+   See the documentation here:
+   http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
+
+   Out[4]:
+   1    2.0
+   2    3.0
+   3    NaN
+   dtype: float64
+
+The idiomatic way to achieve selecting potentially not-found elmenents is via ``.reindex()``
+
+.. ipython:: python
+
+  s.reindex([1, 2, 3])
+
+Selection with all keys found is unchanged.
+
+.. ipython:: python
+
+   s.loc[[1, 2]]
+
+
 .. _whatsnew_0210.api_breaking.pandas_eval:
 
 Improved error handling during item assignment in pd.eval
@@ -607,6 +665,7 @@ Deprecations
 - ``pd.TimeGrouper`` is deprecated in favor of :class:`pandas.Grouper` (:issue:`16747`)
 - ``cdate_range`` has been deprecated in favor of :func:`bdate_range`, which has gained ``weekmask`` and ``holidays`` parameters for building custom frequency date ranges. See the :ref:`documentation <timeseries.custom-freq-ranges>` for more details (:issue:`17596`)
 - passing ``categories`` or ``ordered`` kwargs to :func:`Series.astype` is deprecated, in favor of passing a :ref:`CategoricalDtype <whatsnew_0210.enhancements.categorical_dtype>` (:issue:`17636`)
+- Passing a non-existant column in ``.to_excel(..., columns=)`` is deprecated and will raise a ``KeyError`` in the future (:issue:`17295`)
 
 .. _whatsnew_0210.deprecations.argmin_min:
 

diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
@@ -1419,13 +1419,33 @@ def _has_valid_type(self, key, axis):
             if isinstance(key, tuple) and isinstance(ax, MultiIndex):
                 return True
 
-            # TODO: don't check the entire key unless necessary
-            if (not is_iterator(key) and len(key) and
-                    np.all(ax.get_indexer_for(key) < 0)):
+            if not is_iterator(key) and len(key):
 
-                raise KeyError(u"None of [{key}] are in the [{axis}]"
-                               .format(key=key,
-                                       axis=self.obj._get_axis_name(axis)))
+                # True indicates missing values
+                missing = ax.get_indexer_for(key) < 0
+
+                if np.any(missing):
+                    if len(key) == 1 or np.all(missing):
+                        raise KeyError(
+                            u"None of [{key}] are in the [{axis}]".format(
+                                key=key, axis=self.obj._get_axis_name(axis)))
+                    else:
+
+                        # we skip the warning on Categorical/Interval
+                        # as this check is actually done (check for
+                        # non-missing values), but a bit later in the
+                        # code, so we want to avoid warning & then
+                        # just raising
+                        _missing_key_warning = textwrap.dedent("""
+                        Passing list-likes to .loc or [] with any missing label will raise
+                        KeyError in the future, you can use .reindex() as an alternative.
+
+                        See the documentation here:
+                        http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike""")  # noqa
+
+                        if not (ax.is_categorical() or ax.is_interval()):
+                            warnings.warn(_missing_key_warning,
+                                          FutureWarning, stacklevel=5)
 
             return True
 

diff --git a/pandas/core/series.py b/pandas/core/series.py
@@ -691,7 +691,7 @@ def _get_with(self, key):
 
             if key_type == 'integer':
                 if self.index.is_integer() or self.index.is_floating():
-                    return self.reindex(key)
+                    return self.loc[key]
                 else:
                     return self._get_values(key)
             elif key_type == 'boolean':

diff --git a/pandas/io/formats/excel.py b/pandas/io/formats/excel.py
@@ -356,7 +356,21 @@ def __init__(self, df, na_rep='', float_format=None, cols=None,
             self.styler = None
         self.df = df
         if cols is not None:
-            self.df = df.loc[:, cols]
+
+            # all missing, raise
+            if not len(Index(cols) & df.columns):
+                raise KeyError(
+                    "passes columns are not ALL present dataframe")
+
+            # deprecatedin gh-17295
+            # 1 missing is ok (for now)
+            if len(Index(cols) & df.columns) != len(cols):
+                warnings.warn(
+                    "Not all names specified in 'columns' are found; "
+                    "this will raise a KeyError in the future",
+                    FutureWarning)
+
+            self.df = df.reindex(columns=cols)
         self.columns = self.df.columns
         self.float_format = float_format
         self.index = index

diff --git a/pandas/tests/indexing/test_categorical.py b/pandas/tests/indexing/test_categorical.py
@@ -108,7 +108,8 @@ def test_loc_listlike(self):
         assert_frame_equal(result, expected, check_index_type=True)
 
         # not all labels in the categories
-        pytest.raises(KeyError, lambda: self.df2.loc[['a', 'd']])
+        with pytest.raises(KeyError):
+            self.df2.loc[['a', 'd']]
 
     def test_loc_listlike_dtypes(self):
         # GH 11586

diff --git a/pandas/tests/indexing/test_datetime.py b/pandas/tests/indexing/test_datetime.py
@@ -223,7 +223,9 @@ def test_series_partial_set_datetime(self):
                 Timestamp('2011-01-03')]
         exp = Series([np.nan, 0.2, np.nan],
                      index=pd.DatetimeIndex(keys, name='idx'), name='s')
-        tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True)
+        with tm.assert_produces_warning(FutureWarning,
+                                        check_stacklevel=False):
+            tm.assert_series_equal(ser.loc[keys], exp, check_index_type=True)
 
     def test_series_partial_set_period(self):
         # GH 11497
@@ -248,5 +250,7 @@ def test_series_partial_set_period(self):
                 pd.Period('2011-01-03', freq='D')]
         exp = Series([np.nan, 0.2, np.nan],
                      index=pd.PeriodIndex(keys, name='idx'), name='s')
-        result = ser.loc[keys]
+        with tm.assert_produces_warning(FutureWarning,
+                                        check_stacklevel=False):
+            result = ser.loc[keys]
         tm.assert_series_equal(result, exp)
diff --git a/pandas/tests/indexing/test_iloc.py b/pandas/tests/indexing/test_iloc.py
@@ -617,7 +617,8 @@ def test_iloc_non_unique_indexing(self):
         expected = DataFrame(new_list)
         expected = pd.concat([expected, DataFrame(index=idx[idx > sidx.max()])
                               ])
-        result = df2.loc[idx]
+        with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
+            result = df2.loc[idx]
         tm.assert_frame_equal(result, expected, check_index_type=False)
 
     def test_iloc_empty_list_indexer_is_ok(self):