DOC: update DF.set_index #24762

h-vetinari · 2019-01-14T07:01:51Z

Split off from #24697 by request of @jorisvandenbossche & @jreback

I kept the change for the whatsnew of #22486, to at least not emphasize that there are now ambiguous list-likes available for DataFrame.set_index (which haven't seen a release yet and would be removed again by #24697), which would/will make moving forward on this a bit easier. @toobaz

codecov · 2019-01-14T07:24:22Z

Codecov Report

Merging #24762 into master will decrease coverage by 49.47%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #24762       +/-   ##
===========================================
- Coverage   92.38%   42.91%   -49.48%     
===========================================
  Files         166      166               
  Lines       52363    52363               
===========================================
- Hits        48376    22471    -25905     
- Misses       3987    29892    +25905

Flag	Coverage Δ
#multiple	`?`
#single	`42.91% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`35.74% <ø> (-61.19%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 453fa85...faf8bcc. Read the comment docs.

codecov · 2019-01-14T07:24:22Z

Codecov Report

Merging #24762 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24762      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52379    52377       -2     
==========================================
- Hits        48392    48389       -3     
- Misses       3987     3988       +1

Flag	Coverage Δ
#multiple	`90.81% <100%> (-0.01%)`	⬇️
#single	`42.92% <55.55%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.92% <100%> (-0.01%)`	⬇️
pandas/util/testing.py	`88.04% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512830b...613ebed. Read the comment docs.

doc/source/whatsnew/v0.24.0.rst

h-vetinari · 2019-01-16T07:45:57Z

@jreback: is this a change in current behavior? or just original text?

Before #22486, df.set_index only allowed explicitly enumerated types (keys, Series, Index, MultiIndex, np.ndarray and list). You required me to change that to is_list_like - I cautioned that this was changing behaviour (see here), but complied with your review.

Strictly speaking this was orthogonal to the goal of fixing #22484, which is likely part of the reason why other devs like @toobaz missed it and now object.

#24697 would rectify this, but since that is on hold, the least one can do (as noted in the OP) is to not advertise the addition of these ambiguous (and contested) list-likes. Hence, I'm changing the whatsnew-note added by #22486 to only reflect what it was supposed to fix: the three points from #22484.

jreback · 2019-01-16T13:59:28Z

@TomAugspurger can you have a look here, I think we may need to revert #22486 to avoid any changes

TomAugspurger · 2019-01-16T14:22:30Z

Sorry I can't keep this straight. What behavior changed? Something to do with set_index and tuples?

jreback · 2019-01-16T15:12:49Z

I thought there was no changes, but I guess there are in #22486. @h-vetinari can you give show what changed. There was a lot of back and forth on that PR. I don't think anything should have changed but it seems it did.

h-vetinari · 2019-01-16T15:53:39Z

@jreback: @h-vetinari can you give show what changed

For the sake of discussion let's assume we have a list keys and elements col

before API: better error-handling for df.set_index #22486: elif-chain of instance-checks for col (specifically: MultiIndex, Index, Series, np.ndarray, list), ending in frame[col] (hence KeyErrors for types that were not tested, see API: better error-handling for df.set_index #22484)
first approach for API: better error-handling for df.set_index #22486 (no change in the allowed types, except for using the ABC versions, as you had asked)

if (is_scalar(col) or isinstance(col, tuple)) and col in frame:
    # all good here
elif isinstance(col, (ABCSeries, ABCIndexClass, np.ndarray, list)):
    # all good here as well (ABCIndexClass includes MultiIndex)
else:
    # raise

you required me to use is_list_like instead of instance-checks in this thread. I warned about the tuple case and change in behaviour, but eventually wanted to move forward and complied.
then I used if is_list_like(col) and not isinstance(col, set) to avoid sets (which would be broken from the start), which you also struck down, requiring me to implement Add allow_sets-kwarg to is_list_like #23065
final form used in API: better error-handling for df.set_index #22486, clearly mentioning the tuple-ambiguity, resp. how it is checked first as a key and then as a list-like.

        for col in keys:
            if (is_scalar(col) or isinstance(col, tuple)) and col in self:
                # tuples can be both column keys or list-likes
                # if they are valid column keys, everything is fine
                continue
            elif is_scalar(col) and col not in self:
                # tuples that are not column keys are considered list-like,
                # not considered missing
                missing.append(col)
            elif (not is_list_like(col, allow_sets=False)
                  or getattr(col, 'ndim', 1) > 1):
                raise TypeError('The parameter "keys" may only contain a '
                                'combination of valid column keys and '
                                'one-dimensional list-likes')

The source of the back-and-forth in #22486 was your orthogonal review requirement. I warned about this and went to great length (i.e. #23065) to avoid introducing something fundamentally broken (i.e. sets), but you were crystal-clear about wanting to avoid (the pre-existing) instance checks. But ok, you're reviewing basically everything here - oversights and misunderstandings can happen.

To me, the right approach would be merging #24697, which is a small change, and also paves the way to (eventually) solve #24046 and #22225.

TomAugspurger · 2019-01-16T16:17:29Z

@h-vetinari do you have a short example that behaves differently between 0.23.4 and master?

h-vetinari · 2019-01-16T19:03:39Z

@TomAugspurger

Let's start with:

>>> df = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=['A', 'B', ('t', 'p', 'l')])
>>> df
   A  B  (t, p, l)
0  0  1          2
1  3  4          5
2  6  7          8

call	0.23.4	master	comment
`df.set_index(['A', 'A'], drop=False)`	works	works
`df.set_index(['A', 'A'], drop=True)`	`KeyError`	works	#22484
`df.set_index(['C', 'D', 'E'])`	`KeyError: 'C'` (big stacktrace)	`KeyError: 'C', 'D', 'E'`	#22484
`df.set_index(frozenset('A'))`	cryptic `KeyError`	reasonable `TypeError`	#22484 (any input type outside of a handful is tested as a key in 0.23.4)
`df.set_index(iter([1, 2, 3]))`	cryptic `KeyError`	sets `Index([1, 2 ,3])`	#22484; list-likes (of correct length) now pass
`df.set_index(('t', 'p', 'l'))`	sets index to column `('t', 'p', 'l')`	sets index to column `('t', 'p', 'l')`	here and below, it doesn't matter if tuple is wrapped in list
`df.set_index(('t', 'p', 'm'))`	`KeyError: ('t', 'p', 'm')`	sets `Index(['t', 'p', 'm'])`	on master, tuples get tried first as key, then as list-likes

It's the tuples that are now ambiguous (although with well-defined precedence; length must match of course for the list-like case). #24697 would take the fixes of #22484, keep the master-behaviour for tuples, and solve the list-ambiguity:

call	0.23.4	#24697	comment
`df.set_index(['A', 'A'], drop=False)`	works	works
`df.set_index(['A', 'A'], drop=True)`	`KeyError`	works	#22484
`df.set_index(['C', 'D', 'E'])`	`KeyError: 'C'` (big stacktrace)	`KeyError: 'C', 'D', 'E'`	#22484
`df.set_index(frozenset('A'))`	cryptic `KeyError`	reasonable `ValueError`	#22484
`df.set_index(iter([1, 2, 3]))`	cryptic `KeyError`	reasonable `ValueError`	#22484
`df.set_index(('t', 'p', 'l'))`	sets index to column `('t', 'p', 'l')`	same
`df.set_index(('t', 'p', 'm'))`	`KeyError: ('t', 'p', 'm')`	same
`df.set_index(['A', 'B', 'A'])`	sets to MultiIndex of columns `['A', 'B', 'A']`	same
`df.set_index([['A', 'B', 'A']])`	sets to `Index(['A', 'B', 'A'])`	deprecated

TomAugspurger · 2019-01-16T20:24:02Z

It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

toobaz · 2019-01-16T21:46:03Z

It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

I also didn't follow everything, but

In [3]: pd.DataFrame([[1, 2], [3, 4]]).set_index([('a', 'b')])
Out[3]: 
   0  1
a  1  2
b  3  4

while this used (until last release) to raise KeyError: ('a', 'b'), which is the desired behavior because we don't want to encourage using tuples as list-likes.

h-vetinari · 2019-01-16T22:09:13Z

@TomAugspurger: It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

@toobaz: I also didn't follow everything

Not sure how I can make it easier. I gave single line examples with behaviour before/after. Another way to look at it is that:

before API: better error-handling for df.set_index #22486, the list items were applied as the list was unpacked (within an elif-chain based on type)
with API: better error-handling for df.set_index #22486, there's an inspection step for the (outer) list that comes first, and which raises in case columns are missing or the wrong types are used. That check does not fundamentally lead to the ambiguity of tuples (or lists) - this was due to the request to change from roughly isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False).

I'd suggest to keep that inspection step, but change the requirements the list-elements have to satisfy.

I agree with you that there's nothing fundamentally broken (behaviour has well-defined rules, plus DFs are usually way longer than tuples), just that now there's more ambiguity instead of less (there's already list_of_scalars as keys vs [list_of_scalars] as an array; plus now tuple_as_key vs tuple_as_array). #24697 would get rid of both those ambiguities.

toobaz · 2019-01-16T22:25:56Z

@toobaz: I also didn't follow everything

Not sure how I can make it easier.

Sorry, I should have more honestly written "I didn't have time to read everything". Will do it before the end of the week.

jreback · 2019-01-17T12:29:53Z

@h-vetinari ok so this looks like this change slipped thru. Pls update to switch back to the original inspection code: isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False), and only this change.

h-vetinari · 2019-01-17T22:55:51Z

@jreback: @h-vetinari ok so this looks like this change slipped thru. Pls update to switch back to the original inspection code: isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False), and only this change.

I pondered opening a new PR for this, but in any case, this would have to be coupled with doc changes, so why not do it here.

I removed the types that were added by #22486, and reinstated the instance-checks. This needs to come in two points though - first for the case that Index/Series/np.ndarray are passed bare (wrapping them in a list to be able to iterate over the container), and then when inspecting the container.

The changes in the test are just for removing the iter/tuple tests from the ones that aren't supposed to pass anymore, and move them down to the tests that should fail.

h-vetinari · 2019-01-17T23:01:33Z

pandas/core/frame.py

+        if (is_scalar(keys) or isinstance(keys, tuple)
+                or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
+            # make sure we have a container of keys/arrays we can iterate over
+            # tuples can appear as valid column keys!
            keys = [keys]


strictly speaking, it would be possible to just keep wrapping everything that's not a list into a list, and raise in the for-loop below. But that's a bit hard to grok, and explicit is better than implicit, no?

h-vetinari · 2019-01-18T17:27:56Z

@jreback
I added the requested changes. PTAL.

jreback · 2019-01-19T21:18:56Z

thanks @h-vetinari

h-vetinari · 2019-01-20T02:39:47Z

Glad we could fix this before the release. Thanks.

This reverts commit e984947.

DOC: update DF.set_index

894a080

h-vetinari mentioned this pull request Jan 14, 2019

DEPR/API: disallow lists within list for set_index #24697

Closed

5 tasks

oversights

faf8bcc

jorisvandenbossche added the Docs label Jan 14, 2019

jorisvandenbossche added this to the 0.24.0 milestone Jan 14, 2019

jreback requested changes Jan 16, 2019

View reviewed changes

doc/source/whatsnew/v0.24.0.rst Show resolved Hide resolved

h-vetinari added 2 commits January 17, 2019 22:58

Merge remote-tracking branch 'upstream/master' into set_index_docs

8401fad

Revert addition of list-likes to df.set_index

18597e2

Remove dead code

613ebed

h-vetinari commented Jan 17, 2019

View reviewed changes

jreback approved these changes Jan 19, 2019

View reviewed changes

jreback merged commit e984947 into pandas-dev:master Jan 19, 2019

h-vetinari deleted the set_index_docs branch January 20, 2019 02:31

jorisvandenbossche mentioned this pull request Jan 28, 2019

Regression in DataFrame.set_index with class instance column keys #24969

Closed

h-vetinari mentioned this pull request Feb 1, 2019

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged

3 tasks

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Feb 1, 2019

Revert "DOC: update DF.set_index (pandas-dev#24762)"

c397839

This reverts commit e984947.

h-vetinari mentioned this pull request Feb 1, 2019

Revert set_index inspection/error handling for 0.24.1 #25085

Merged

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

DOC: update DF.set_index (pandas-dev#24762)

287a5d7

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

DOC: update DF.set_index (pandas-dev#24762)

33d7915

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: update DF.set_index #24762

DOC: update DF.set_index #24762

h-vetinari commented Jan 14, 2019

codecov bot commented Jan 14, 2019

codecov bot commented Jan 14, 2019 •

edited

Loading

h-vetinari commented Jan 16, 2019 •

edited

Loading

jreback commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

jreback commented Jan 16, 2019

h-vetinari commented Jan 16, 2019 •

edited

Loading

TomAugspurger commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

toobaz commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

toobaz commented Jan 16, 2019

jreback commented Jan 17, 2019

h-vetinari commented Jan 17, 2019

h-vetinari Jan 17, 2019 •

edited

Loading

h-vetinari commented Jan 18, 2019

jreback commented Jan 19, 2019

h-vetinari commented Jan 20, 2019

DOC: update DF.set_index #24762

DOC: update DF.set_index #24762

Conversation

h-vetinari commented Jan 14, 2019

codecov bot commented Jan 14, 2019

Codecov Report

codecov bot commented Jan 14, 2019 • edited Loading

Codecov Report

h-vetinari commented Jan 16, 2019 • edited Loading

jreback commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

jreback commented Jan 16, 2019

h-vetinari commented Jan 16, 2019 • edited Loading

TomAugspurger commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

TomAugspurger commented Jan 16, 2019

toobaz commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

toobaz commented Jan 16, 2019

jreback commented Jan 17, 2019

h-vetinari commented Jan 17, 2019

h-vetinari Jan 17, 2019 • edited Loading

Choose a reason for hiding this comment

h-vetinari commented Jan 18, 2019

jreback commented Jan 19, 2019

h-vetinari commented Jan 20, 2019

codecov bot commented Jan 14, 2019 •

edited

Loading

h-vetinari commented Jan 16, 2019 •

edited

Loading

h-vetinari commented Jan 16, 2019 •

edited

Loading

h-vetinari Jan 17, 2019 •

edited

Loading