Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: warning to raise KeyError in the future if not all elements of a list are selected via .loc #17295

Merged
merged 1 commit into from
Oct 3, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Aug 20, 2017

closes #15747

@jreback jreback added Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 20, 2017
@jreback
Copy link
Contributor Author

jreback commented Aug 20, 2017

still lots of warnings.

@codecov
Copy link

codecov bot commented Aug 20, 2017

Codecov Report

Merging #17295 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17295      +/-   ##
==========================================
- Coverage   91.25%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49859    49866       +7     
==========================================
  Hits        45499    45499              
- Misses       4360     4367       +7
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.24% <21.42%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexing.py 93.97% <100%> (+0.03%) ⬆️
pandas/core/series.py 95.04% <100%> (ø) ⬆️
pandas/io/formats/excel.py 96.69% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/core/internals.py 94.37% <0%> (-0.01%) ⬇️
pandas/plotting/_style.py 74.28% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.46% <0%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e2093e...44ce09d. Read the comment docs.

@jreback jreback changed the title WIP/API: warning to raise KeyError in the future if not all elements of a list are selected via .loc API: warning to raise KeyError in the future if not all elements of a list are selected via .loc Aug 22, 2017
@jreback jreback added this to the 0.21.0 milestone Aug 22, 2017
@jreback
Copy link
Contributor Author

jreback commented Aug 22, 2017

ok this is ready to go.

@jorisvandenbossche @TomAugspurger
cc @toobaz


s.loc[[1, 2]]

Previously Behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous Behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No?!

@toobaz
Copy link
Member

toobaz commented Aug 22, 2017

Cool!

Notice however that the warning is not being raised for MultiIndexes.

In [2]: pd.DataFrame([[i] for i in range(3)], index=pd.MultiIndex.from_product([[1], [5,6,7]])).loc[[(1,5), (3,7)]]
Out[2]: 
       0
1 5  0.0
3 7  NaN

I assume/hope that we will want to change the behavior with partial indexing too (despite the current behaviour being different):

In [3]: pd.DataFrame([[i] for i in range(3)], index=pd.MultiIndex.from_product([[1], [5,6,7]])).loc[[1,3]]
Out[3]: 
     0
1 5  0
  6  1
  7  2

So maybe this too should raise a warning (although the whatsnew notes would maybe require some addition). Can go in another PR however.

@jreback
Copy link
Contributor Author

jreback commented Aug 22, 2017

So maybe this too should raise a warning (although the whatsnew notes would maybe require some addition). Can go in another PR however.

yeah this doesn't cover MultiIndex cases at all, but agree it should/could. welcome for PR's on top.

@toobaz
Copy link
Member

toobaz commented Aug 22, 2017

One more thought: users might have in their code (I suspect I do) things like

a.loc[list_like].dropna()

... in which case the suggested replacement is not a.reindex(list_like).dropna() but rather a.loc[a.index.intersection(list_like)] (which avoids a step, and possibly a type casting of ints to floats). Should this second option maybe be mentioned?

@jreback
Copy link
Contributor Author

jreback commented Aug 22, 2017

@toobaz in your comment #17295 (comment)

yes I think the warning could refer to a new section in the docs (like we do for .ix). where a more extended example could be. (and include this effect as well). let me add.

@jreback
Copy link
Contributor Author

jreback commented Aug 22, 2017

updated with nicer message and doc section.

@TomAugspurger TomAugspurger mentioned this pull request Aug 22, 2017
6 tasks

.. warning::

Starting in 0.21.0, using ``.loc`` with missing keys in a list. This is deprecated, in favor of ``.reindex``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"using .loc with missing keys in a list. This is deprecated" -> "using .loc with a list-like containing missing keys"?

s.reindex([1, 2, 3])


Alternatively if you want to selected only *valid* keys, the following is idiomatic; furthermore this will preserve the dtype of the selection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to select

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I would also replace "this will preserve the dtype of the selection" with "this is more efficient, and is guaranteed to preserve the dtype of the selection"

@toobaz
Copy link
Member

toobaz commented Aug 23, 2017

I think it is OK, just the two sentences commented above and the "Previously Behavior" sound a bit weird.

@jreback
Copy link
Contributor Author

jreback commented Aug 23, 2017

ok pushed. @jorisvandenbossche @TomAugspurger if you'd have a look.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One concern I just realized is we may be stranding users with duplicates in their index.

In [15]: s = pd.Series(1, index=['a', 'a', 'b', 'c'])

In [16]: s.loc[['c', 'd']]
Out[16]:
c    1.0
d    NaN
dtype: float64

In [17]: s.reindex(['c', 'd'])
...
ValueError: cannot reindex from a duplicate axis

I haven't kept up with whether reindexing into a Index with dupes is supposed to work on not...


.. warning::

Starting in 0.21.0, using ``.loc`` with a list-like containing missing keys. This is deprecated, in favor of ``.reindex``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove “this”

s = Series([1, 2, 3])
s

Selection with all keys found is unchanged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this one to the end. Good to have in this section though.

Current Behavior

In [4]: s.loc[[1, 2, 3]]
/Users/jreback/miniconda3/envs/pandas/bin/ipython:1: FutureWarning: passing list-likes to .loc with any non-matching elements will raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove most of your user path here

3 NaN
dtype: float64

The idiomatic way to achieve selecting potentially not-found elmenents if via ``.reindex()``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If -> is

@toobaz
Copy link
Member

toobaz commented Aug 24, 2017

I haven't kept up with whether reindexing into a Index with dupes is supposed to work on not...

Indeed, it can't work in general - assume you had done s.reindex(['a', 'b']), the expected behavior would be unclear because 'a' is duplicated.

So maybe the s.loc[s.index.intersection(keys)] option (or does s.loc[s.index & keys] look better?) should be emphasized for solving this case too. To my eyes, it is actually the idiomatic replacement, because it is hard to conceive that one would use s.loc[list_like] and be interested in keys not in s...

@toobaz
Copy link
Member

toobaz commented Aug 24, 2017

(Edited my previous comment, which in that form didn't make any sense)

@jreback
Copy link
Contributor Author

jreback commented Aug 24, 2017

So maybe the s.loc[s.index.intersection(keys)] option (or does s.loc[s.index & keys] look better?) should be emphasized for solving this case too. To my eyes, it is actually the idiomatic replacement, because it is hard to conceive that one would use s.loc[list_like] and be interested in keys not in s...

changed; that is quite idiomatic

@jreback
Copy link
Contributor Author

jreback commented Aug 24, 2017

fixed according to @TomAugspurger and @toobaz comments.

@jorisvandenbossche
Copy link
Member

(Will take a look at this tomorrow evening)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only did a new pass through the docs, will look at the rest tomorrow.

@@ -333,8 +333,15 @@ Selection By Label

dfl.loc['20130102':'20130104']

.. warning::

Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list-of-lables and not ALL labels are present. In the future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lables -> labels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still al 'lables' -> labels (the first occurrence of the word)

.. warning::

Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list-of-lables and not ALL labels are present. In the future
this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated <indexing.deprecate_loc_reindex_listlike>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Using -> using (a capital in the middle of the sentence is a bit strange)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the text within the ref does not read as grammatically correct. Maybe the 'list-like' needs to be removed? (list is mentioned later in the sentence)


.. warning::

Starting in 0.21.0, using ``.loc`` with a list-like containing missing key, is deprecated, in favor of ``.reindex``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only loc, also []


.. ipython:: python

s = Series([1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one


.. ipython:: python

s.loc[s.index & ['c', 'd']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating my comment:

  • I would use intersection instead of &
  • this is not a correct example (the alternative you provide does not give the same result as the deprecated one. As I say above I am not sure if there is actually an alternative, but if there is not, we should just remove this example)


Previously, selecting at least 1 valid key with a list-like indexer would succeed and return ``NaN`` for non-found elements.
This is exactly the function of ``.reindex()``. This will now show a ``FutureWarning`` message; in the future this will raise ``KeyError`` (:issue:`15747`).
This warning will trigger on a ``DataFrame`` or a ``Series`` for using ``.loc[]`` on the Index, or ``[[]]`` on a ``Series``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove on the Index, as you don't call loc on the index.

Also, maybe we could just say that it is triggered for both .loc[] and [] (without needing to make the distinction of dataframe and series. Could mention after that that [] on dataframe already raises KeyError now)

# code, so we want to avoid warning & then
# just raising
_missing_key_warning = textwrap.dedent("""
Passing list-likes to .loc with any non-matching labels will raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only .loc, also []
(now you get this error message also when using s[], which will be confusing)

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2017

updated

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2017

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good!
Added another bunch of doc comments.

Also, there are still two warnings in the docs (I didn't check whether you just need to add :okwarning: or rather need to update the example):

>>>-------------------------------------------------------------------------

Warning in /tmp/doc/source/advanced.rst at block ending on line 1015

Specify :okwarning: as an option in the ipython:: block to suppress this message

----------------------------------------------------------------------------

/home/travis/build/pandas-dev/pandas/pandas/core/series.py:694: FutureWarning: 

Passing list-likes to .loc or [] with any missing label will raise

KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

  return self.loc[key]

<<<-------------------------------------------------------------------------
>>>-------------------------------------------------------------------------

Warning in /tmp/doc/source/indexing.rst at block ending on line 963

Specify :okwarning: as an option in the ipython:: block to suppress this message

----------------------------------------------------------------------------

/home/travis/build/pandas-dev/pandas/pandas/core/series.py:694: FutureWarning: 

Passing list-likes to .loc or [] with any missing label will raise

KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

  return self.loc[key]

<<<-------------------------------------------------------------------------

@@ -333,8 +333,15 @@ Selection By Label

dfl.loc['20130102':'20130104']

.. warning::

Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list-of-lables and not ALL labels are present. In the future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still al 'lables' -> labels (the first occurrence of the word)

.. warning::

Starting in 0.21.0, pandas will show a ``FutureWarning`` if indexing with a list-of-lables and not ALL labels are present. In the future
this will raise a ``KeyError``. See :ref:`list-like Using loc with missing keys in a list is Deprecated <indexing.deprecate_loc_reindex_listlike>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the text within the ref does not read as grammatically correct. Maybe the 'list-like' needs to be removed? (list is mentioned later in the sentence)


Current Behavior

In [4]: s.loc[[1, 2, 3]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing .. code-block:: ipython

s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['c', 'd']

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -> ipython


However, this would *still* raise if your resulting index is duplicated.

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -> ipython

@@ -268,6 +268,63 @@ We have updated our minimum supported versions of dependencies (:issue:`15206`,
| Bottleneck | 1.0.0 | |
+--------------+-----------------+----------+

.. _whatsnew_0210.api_breaking.loc:

Indexing with missing list-of-labels is Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "Indexing with list with missing labels .. " ? (I find the 'missing list-of-labels' a bit confusing)


Current Behavior

In [4]: s.loc[[1, 2, 3]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing .. code-block:: ipython


# all missing, raise
if not len(Index(cols) & df.columns):
raise KeyError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide an error message?

# 1 missing is ok
# TODO(jreback) this should raise
# on *any* missing columns
self.df = df.reindex(columns=cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed currently works, but I think it is certainly OK to change that. Or maybe also raise futurewarning as with loc?

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2017

push to fix all @jorisvandenbossche comments. I deprecated passing 1 or more missing .to_excel(..., columns=)

@jreback jreback modified the milestone: 0.21.0 Oct 2, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about wording in error message. For the rest good to merge!

warnings.warn(
"columns must be a subset of the "
"dataframe columns; this will raise "
"a KeyError in the future",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you only see that error message, it is not directly clear that on the the names you passed in columns is missing (you directly say the result). Can you clarify that?
Maybe like "Not all names specified in 'columns' are found(or present); this will raise a KeyError in the future"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2017

ok pushed

@jorisvandenbossche
Copy link
Member

This is really nice, thanks @jreback !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal to change behaviour with .loc and missing keys
4 participants