Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update DF.set_index #24762

Merged
merged 5 commits into from
Jan 19, 2019
Merged

DOC: update DF.set_index #24762

merged 5 commits into from
Jan 19, 2019

Conversation

h-vetinari
Copy link
Contributor

Split off from #24697 by request of @jorisvandenbossche & @jreback

I kept the change for the whatsnew of #22486, to at least not emphasize that there are now ambiguous list-likes available for DataFrame.set_index (which haven't seen a release yet and would be removed again by #24697), which would/will make moving forward on this a bit easier. @toobaz

@codecov
Copy link

codecov bot commented Jan 14, 2019

Codecov Report

Merging #24762 into master will decrease coverage by 49.47%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24762       +/-   ##
===========================================
- Coverage   92.38%   42.91%   -49.48%     
===========================================
  Files         166      166               
  Lines       52363    52363               
===========================================
- Hits        48376    22471    -25905     
- Misses       3987    29892    +25905
Flag Coverage Δ
#multiple ?
#single 42.91% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 35.74% <ø> (-61.19%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 453fa85...faf8bcc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 14, 2019

Codecov Report

Merging #24762 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24762      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52379    52377       -2     
==========================================
- Hits        48392    48389       -3     
- Misses       3987     3988       +1
Flag Coverage Δ
#multiple 90.81% <100%> (-0.01%) ⬇️
#single 42.92% <55.55%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 96.92% <100%> (-0.01%) ⬇️
pandas/util/testing.py 88.04% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512830b...613ebed. Read the comment docs.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jan 14, 2019
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 16, 2019

@jreback: is this a change in current behavior? or just original text?

Before #22486, df.set_index only allowed explicitly enumerated types (keys, Series, Index, MultiIndex, np.ndarray and list). You required me to change that to is_list_like - I cautioned that this was changing behaviour (see here), but complied with your review.

Strictly speaking this was orthogonal to the goal of fixing #22484, which is likely part of the reason why other devs like @toobaz missed it and now object.

#24697 would rectify this, but since that is on hold, the least one can do (as noted in the OP) is to not advertise the addition of these ambiguous (and contested) list-likes. Hence, I'm changing the whatsnew-note added by #22486 to only reflect what it was supposed to fix: the three points from #22484.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

@TomAugspurger can you have a look here, I think we may need to revert #22486 to avoid any changes

@TomAugspurger
Copy link
Contributor

Sorry I can't keep this straight. What behavior changed? Something to do with set_index and tuples?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

I thought there was no changes, but I guess there are in #22486. @h-vetinari can you give show what changed. There was a lot of back and forth on that PR. I don't think anything should have changed but it seems it did.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 16, 2019

@jreback: @h-vetinari can you give show what changed

For the sake of discussion let's assume we have a list keys and elements col

if (is_scalar(col) or isinstance(col, tuple)) and col in frame:
    # all good here
elif isinstance(col, (ABCSeries, ABCIndexClass, np.ndarray, list)):
    # all good here as well (ABCIndexClass includes MultiIndex)
else:
    # raise
  • you required me to use is_list_like instead of instance-checks in this thread. I warned about the tuple case and change in behaviour, but eventually wanted to move forward and complied.
  • then I used if is_list_like(col) and not isinstance(col, set) to avoid sets (which would be broken from the start), which you also struck down, requiring me to implement Add allow_sets-kwarg to is_list_like #23065
  • final form used in API: better error-handling for df.set_index #22486, clearly mentioning the tuple-ambiguity, resp. how it is checked first as a key and then as a list-like.
        for col in keys:
            if (is_scalar(col) or isinstance(col, tuple)) and col in self:
                # tuples can be both column keys or list-likes
                # if they are valid column keys, everything is fine
                continue
            elif is_scalar(col) and col not in self:
                # tuples that are not column keys are considered list-like,
                # not considered missing
                missing.append(col)
            elif (not is_list_like(col, allow_sets=False)
                  or getattr(col, 'ndim', 1) > 1):
                raise TypeError('The parameter "keys" may only contain a '
                                'combination of valid column keys and '
                                'one-dimensional list-likes')

The source of the back-and-forth in #22486 was your orthogonal review requirement. I warned about this and went to great length (i.e. #23065) to avoid introducing something fundamentally broken (i.e. sets), but you were crystal-clear about wanting to avoid (the pre-existing) instance checks. But ok, you're reviewing basically everything here - oversights and misunderstandings can happen.

To me, the right approach would be merging #24697, which is a small change, and also paves the way to (eventually) solve #24046 and #22225.

@TomAugspurger
Copy link
Contributor

@h-vetinari do you have a short example that behaves differently between 0.23.4 and master?

@h-vetinari
Copy link
Contributor Author

@TomAugspurger

Let's start with:

>>> df = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=['A', 'B', ('t', 'p', 'l')])
>>> df
   A  B  (t, p, l)
0  0  1          2
1  3  4          5
2  6  7          8
call 0.23.4 master comment
df.set_index(['A', 'A'], drop=False) works works
df.set_index(['A', 'A'], drop=True) KeyError works #22484
df.set_index(['C', 'D', 'E']) KeyError: 'C' (big stacktrace) KeyError: 'C', 'D', 'E' #22484
df.set_index(frozenset('A')) cryptic KeyError reasonable TypeError #22484 (any input type outside of a handful is tested as a key in 0.23.4)
df.set_index(iter([1, 2, 3])) cryptic KeyError sets Index([1, 2 ,3]) #22484; list-likes (of correct length) now pass
df.set_index(('t', 'p', 'l')) sets index to column ('t', 'p', 'l') sets index to column ('t', 'p', 'l') here and below, it doesn't matter if tuple is wrapped in list
df.set_index(('t', 'p', 'm')) KeyError: ('t', 'p', 'm') sets Index(['t', 'p', 'm']) on master, tuples get tried first as key, then as list-likes

It's the tuples that are now ambiguous (although with well-defined precedence; length must match of course for the list-like case). #24697 would take the fixes of #22484, keep the master-behaviour for tuples, and solve the list-ambiguity:

call 0.23.4 #24697 comment
df.set_index(['A', 'A'], drop=False) works works
df.set_index(['A', 'A'], drop=True) KeyError works #22484
df.set_index(['C', 'D', 'E']) KeyError: 'C' (big stacktrace) KeyError: 'C', 'D', 'E' #22484
df.set_index(frozenset('A')) cryptic KeyError reasonable ValueError #22484
df.set_index(iter([1, 2, 3])) cryptic KeyError reasonable ValueError #22484
df.set_index(('t', 'p', 'l')) sets index to column ('t', 'p', 'l') same
df.set_index(('t', 'p', 'm')) KeyError: ('t', 'p', 'm') same
df.set_index(['A', 'B', 'A']) sets to MultiIndex of columns ['A', 'B', 'A'] same
df.set_index([['A', 'B', 'A']]) sets to Index(['A', 'B', 'A']) deprecated

@TomAugspurger
Copy link
Contributor

It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

@toobaz
Copy link
Member

toobaz commented Jan 16, 2019

It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

I also didn't follow everything, but

In [3]: pd.DataFrame([[1, 2], [3, 4]]).set_index([('a', 'b')])
Out[3]: 
   0  1
a  1  2
b  3  4

while this used (until last release) to raise KeyError: ('a', 'b'), which is the desired behavior because we don't want to encourage using tuples as list-likes.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger: It's still a bit hard to follow, but I don't see anything in there that needs to be reverted.

@toobaz: I also didn't follow everything

Not sure how I can make it easier. I gave single line examples with behaviour before/after. Another way to look at it is that:

  • before API: better error-handling for df.set_index #22486, the list items were applied as the list was unpacked (within an elif-chain based on type)
  • with API: better error-handling for df.set_index #22486, there's an inspection step for the (outer) list that comes first, and which raises in case columns are missing or the wrong types are used. That check does not fundamentally lead to the ambiguity of tuples (or lists) - this was due to the request to change from roughly isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False).

I'd suggest to keep that inspection step, but change the requirements the list-elements have to satisfy.

I agree with you that there's nothing fundamentally broken (behaviour has well-defined rules, plus DFs are usually way longer than tuples), just that now there's more ambiguity instead of less (there's already list_of_scalars as keys vs [list_of_scalars] as an array; plus now tuple_as_key vs tuple_as_array). #24697 would get rid of both those ambiguities.

@toobaz
Copy link
Member

toobaz commented Jan 16, 2019

@toobaz: I also didn't follow everything

Not sure how I can make it easier.

Sorry, I should have more honestly written "I didn't have time to read everything". Will do it before the end of the week.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2019

@h-vetinari ok so this looks like this change slipped thru. Pls update to switch back to the original inspection code: isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False), and only this change.

@h-vetinari
Copy link
Contributor Author

@jreback: @h-vetinari ok so this looks like this change slipped thru. Pls update to switch back to the original inspection code: isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray, list)) to is_list_like(col, allow_sets=False), and only this change.

I pondered opening a new PR for this, but in any case, this would have to be coupled with doc changes, so why not do it here.

I removed the types that were added by #22486, and reinstated the instance-checks. This needs to come in two points though - first for the case that Index/Series/np.ndarray are passed bare (wrapping them in a list to be able to iterate over the container), and then when inspecting the container.

The changes in the test are just for removing the iter/tuple tests from the ones that aren't supposed to pass anymore, and move them down to the tests that should fail.

if (is_scalar(keys) or isinstance(keys, tuple)
or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
# make sure we have a container of keys/arrays we can iterate over
# tuples can appear as valid column keys!
keys = [keys]
Copy link
Contributor Author

@h-vetinari h-vetinari Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking, it would be possible to just keep wrapping everything that's not a list into a list, and raise in the for-loop below. But that's a bit hard to grok, and explicit is better than implicit, no?

@h-vetinari
Copy link
Contributor Author

@jreback
I added the requested changes. PTAL.

@jreback jreback merged commit e984947 into pandas-dev:master Jan 19, 2019
@jreback
Copy link
Contributor

jreback commented Jan 19, 2019

thanks @h-vetinari

@h-vetinari h-vetinari deleted the set_index_docs branch January 20, 2019 02:31
@h-vetinari
Copy link
Contributor Author

Glad we could fix this before the release. Thanks.

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Feb 1, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants