DEPR/API: disallow lists within list for set_index #24697

h-vetinari · 2019-01-10T08:14:24Z

one way to close API: capabilities of df.set_index #24046
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
add deprecation warning & test it

I wanted to add this before the RC gets cut for two reasons.
On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

keys : label or list of label

Name or names of the columns that will be used as the index.

and 0.23.4:

keys : column label or list of column labels / arrays

This is IMO a breaking change. I'd wait for the outcome of the discussion of #24046, but I feel that could easily fall under the table before the RC, so I wanted to provide a worked-out implementation.

Equally importantly, #22486 added capabilities for lots of list-likes within a list to df.set_index, and has not seen a release yet. Therefore, deprecation would be much easier now, than after 0.24.0rc.

@jreback @TomAugspurger @jorisvandenbossche @toobaz @WillAyd

codecov · 2019-01-10T08:40:30Z

Codecov Report

Merging #24697 into master will decrease coverage by 49.31%.
The diff coverage is 50%.

@@             Coverage Diff             @@
##           master   #24697       +/-   ##
===========================================
- Coverage   92.38%   43.06%   -49.32%     
===========================================
  Files         166      166               
  Lines       52310    52309        -1     
===========================================
- Hits        48326    22527    -25799     
- Misses       3984    29782    +25798

Flag	Coverage Δ
#multiple	`?`
#single	`43.06% <50%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`35.82% <50%> (-61.11%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d56db9d...ed0de1f. Read the comment docs.

codecov · 2019-01-10T08:40:30Z

Codecov Report

Merging #24697 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24697      +/-   ##
==========================================
+ Coverage   91.75%   91.75%   +<.01%     
==========================================
  Files         173      173              
  Lines       52960    52966       +6     
==========================================
+ Hits        48595    48601       +6     
  Misses       4365     4365

Flag	Coverage Δ
#multiple	`90.33% <100%> (ø)`	⬆️
#single	`41.71% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.86% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42b4c97...0c65876. Read the comment docs.

jorisvandenbossche · 2019-01-10T08:45:16Z

On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

I don't think we silently deprecated anything. There was a DOC PR (#22775) updating the docstring, and I think it was rather an oversight of a contributor that didn't know that behaviour and which was not catched during review.

And that is a good catch! So I think the documentation change in this PR is certainly welcome, and the code changes require some discussion. Would you want to do the docstring changes in a separate PR that can be merged more quickly?

jorisvandenbossche · 2019-01-10T08:47:04Z

@h-vetinari and to give you some expectations: I think most of us are rather swamped with trying to get the rc out, so I am not sure we will have time to finalize the set_index related discussions before the RC

h-vetinari · 2019-01-10T08:51:55Z

@jorisvandenbossche

I don't think we silently deprecated anything. There was a DOC PR (#22775) updating the docstring, and I think it was rather an oversight of a contributor that didn't know that behaviour and which was not catched during review.

There was not accusation, I used "silent deprecation" to mean that this functionality is suddenly not mentioned anymore.

@h-vetinari and to give you some expectations: I think most of us are rather swamped with trying to get the rc out, so I am not sure we will have time to finalize the set_index related discussions before the RC

Which is fair enough. That's why I put up this PR (even though I don't really have the time), to let you judge the trade-off (e.g. currently adding more list-likes due to #22486, vs. the ambiguity we're talking about in #24046).

toobaz · 2019-01-10T09:20:22Z

While I understand @jorisvandenbossche 's concern...

On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

... I think on aligning the docs to the functionalities we can all agree, and

#22486 added capabilities for lots of list-likes within a list to df.set_index, and has not seen a release yet

I think this is an important argument even leaving aside #24046 , because I see (sorry for not noticing before) that tuples are among these "list-likes", which I think is just wrong. We discussed in several other issues the fact that in pandas we want to distinguish tuples - which are either keys of a MultiIndex, or pairs of keys for the two axes of a DataFrame - from lists - which are collections.

So I think this PR really deserves to get considered for the release. Will review in detail later today.

h-vetinari · 2019-01-10T12:26:08Z

@toobaz: because I see (sorry for not noticing before) that tuples are among these "list-likes", which I think is just wrong.

I cautioned about the tuple case twice in this thread, but it was an explicit review request to use is_list_like and not instance-checks for specific classes (to the point that I had to implement #23065, because I strictly didn't want to allow sets, which would have been broken from the start).

In any case, I agree with you, hence this PR (as well as #24688).

toobaz · 2019-01-10T13:33:30Z

it was an explicit review request to use is_list_like and not instance-checks for specific classes

I see... in a similar case I resolved to

32ee973#diff-1e79abbbdd150d4771b91ea60a4e1cc7R2701

... but I agree this is annoying, so see #24702 .

h-vetinari · 2019-01-10T23:31:47Z

I saw @TomAugspurger mentioned cutting the RC soon. In the unlikely case that someone still wants to consider this on short notice, feel free to commit into this PR - I'm offline the next 6-7h.

h-vetinari · 2019-01-10T23:34:43Z

doc/source/whatsnew/v0.24.0.rst

- :meth:`DataFrame.set_index` now allows all one-dimensional list-likes, raises a ``TypeError`` for incorrect types,
-  has an improved ``KeyError`` message, and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`)
+- :meth:`DataFrame.set_index` now gives a better (and less frequent) KeyError, raises a ``ValueError`` for incorrect types,
+  and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`)


After reverting the added list-likes from #22486, what's left are the three points from #22484 that the former was meant to address.

h-vetinari · 2019-01-13T23:22:30Z

Well done everyone on getting the RC ready and out the door!

Was wondering if this PR should still be considered for 0.24 nevertheless? Seems unlikely, but not impossible, considering e.g. what @jreback wrote in #24060:

ok the RC, yeah small things are ok. If we have big then maybe need RC2

@TomAugspurger @jorisvandenbossche @jreback @toobaz @WillAyd

jreback · 2019-01-14T00:00:09Z

@h-vetinari would not be averse to: 1) improving the doc-string / examples 2) raising better errors on existing behavior, but the impl here does too many things.

Basically the non-controversial things would be fine.

h-vetinari · 2019-01-14T07:15:20Z

@jreback: would not be averse to: 1) improving the doc-string / examples [...]

Split off #24762.

[...] 2) raising better errors on existing behavior, but the impl here does too many things.

The chances here are as advertised, and really minimal. The only things worth mentioning are:

-         if not isinstance(keys, list):
-             keys = [keys]
+         if (is_scalar(keys) or isinstance(keys, tuple)
+                 or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
+             # make sure we have a container of keys/arrays we can iterate over
+             # tuples can appear as valid column keys!
+             keys = [keys]
+         elif not isinstance(keys, list):
+             raise ValueError(err_msg)

which replaces indiscriminate list-wrapping with a very reasonable list of allowed types, and

+         depr_warn = False
          for col in keys:
              if (is_scalar(col) or isinstance(col, tuple)) and col in self:
                  # if col is a valid column key, everything is fine
                  continue
              elif is_scalar(col) and col not in self:
                  # tuples that are not keys are not considered missing,
                  # but illegal (see below)
                  missing.append(col)
-             elif (not is_list_like(col, allow_sets=False)
-                   or getattr(col, 'ndim', 1) > 1):
-                 raise TypeError(...)
+             elif isinstance(col, list):
+                 depr_warn = True
+             elif (not isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray))
+                   or getattr(col, 'ndim', 1) > 1):
+                 raise ValueError(err_msg)

          if missing:
              raise KeyError('{}'.format(missing))
+         if depr_warn:
+             msg = ('passing lists within a list to the parameter "keys" is '
+                    'deprecated and will be removed in a future version.')
+             warnings.warn(msg, FutureWarning, stacklevel=2)

which does the actual deprecation and restricts to the allowed types (the TypeError we added for #22486 is I think inconsistent with how this is usually handled via ValueError elsewhere).

The rest is just doc changes, removing some test cases added by #22486 from the test parametrization, and catching the deprecation warnings in the tests (without changing the tests themselves).

TomAugspurger · 2019-01-14T13:03:49Z

Sorry, I haven't been able to keep up. Are either of the following being deprecated?

In [8]: df.set_index([['a', 'b', 'c']])
Out[8]:
   A  B
a  1  4
b  2  5
c  3  6

In [9]: df.set_index(['A', ['a', 'b', 'c']])
Out[9]:
     B
A
1 a  4
2 b  5
3 c  6

what's our recommendation to users doing this (the message should tell them).

h-vetinari · 2019-01-14T16:35:47Z

@TomAugspurger
Both those cases are indeed being deprecated in this PR. As mentioned above lists are hardly ever used as an array (outside of toy example), as in most real-world use cases, the required data is already residing in Series/Index/ndarrays anyway. Furthermore, deprecating lists inside lists removed the ambiguity of list_of_scalar (keys vs values), and sacrifices no essential functionality as it's easy to wrap a list in an array.

I've adapted the warning message to give a clearer example of that.

h-vetinari · 2019-01-15T21:47:11Z

@TomAugspurger @jorisvandenbossche @toobaz @jreback
friendly ping

jreback · 2019-01-16T01:54:42Z

not for 0.24.0

h-vetinari · 2019-01-20T02:44:31Z

@jreback @TomAugspurger @toobaz
Not in a rush with this one in particular, but the diff after #24762 is now much smaller/simpler.

h-vetinari · 2019-02-24T10:28:07Z

@TomAugspurger @jorisvandenbossche @toobaz @jreback
Coming back to this after #25085 and #24984 are in.

jorisvandenbossche · 2019-02-25T17:34:57Z

@h-vetinari To make sure there is a common understanding: this PR deprecates the usage of lists for the case of array-like of values (so not for the list of labels/array-likes).
So meaning: df.set_index(['A', 'B']) (list of labels) or df.set_index([np.array(['a', 'b', 'c']]) (list of array-likes) are still allowed, but df.set_index([['a', 'b', 'c']]) (list of array-likes where the array like is a list) would be deprecated?

Is there any ambiguous case this is solving? Or is it only with the intent to make it less complex/confusing for users?

h-vetinari · 2019-02-26T16:29:08Z

@jorisvandenbossche: @h-vetinari To make sure there is a common understanding: this PR deprecates the usage of lists for the case of array-like of values (so not for the list of labels/array-likes).

That's correct.

Is there any ambiguous case this is solving? Or is it only with the intent to make it less complex/confusing for users?

One of the main points of discussion about the capabilities of df.set_index (e.g. in #24046) is that @jreback @WillAyd are strongly opposed to the fundamental ambiguity of df.set_index(['a', 'b', 'c']) (see e.g. here or OP of #24046), because lists are overloaded as an array-like and a container for combining several index columns.

This PR tries to break the stalemate between longstanding capabilities (including mixing column labels and arrays since before v0.12.) and this ambiguity, by deprecating the array-like role of lists, which can functionally be easily replaced by Series/Index/np.ndarray.

This also improves the situation with the confusingly-similar-yet-totally-different
df.set_index(['a', 'b', 'c']) vs. df.set_index([['a', 'b', 'c']]).

jreback · 2019-02-27T13:54:26Z

doc/source/whatsnew/v0.25.0.rst

@@ -79,6 +79,7 @@ Deprecations
 ~~~~~~~~~~~~

 - Deprecated the `M (months)` and `Y (year)` `units` parameter of :func: `pandas.to_timedelta`, :func: `pandas.Timedelta` and :func: `pandas.TimedeltaIndex` (:issue:`16344`)
+- :meth:`DataFrame.set_index` has deprecated using lists of values *within* lists. It remains possible to pass array-likes, both directly and within a list.


what would help here is a sub-section note that shows a good example of this.

h-vetinari · 2019-03-01T18:20:00Z

@jreback: what would help here is a sub-section note that shows a good example of this.

I'm assuming you'll find the section I added too long, but it's easier to shorten than to add.

h-vetinari · 2019-03-04T06:55:07Z

@jreback
Now green after the doc changes.

h-vetinari · 2019-03-07T07:02:15Z

@jreback
PTAL at the doc section you requested

@jorisvandenbossche @TomAugspurger @gfyoung
Feel free to chime in as well. :)

jreback · 2019-03-10T15:27:25Z

after some discussion in real space. This is introducing more of a special case. This is not desired as adds user complexitly w/o a corresponding improvement. Thanks for the try here. This is a pretty tricky area of api and don't want to make it more confusing.

h-vetinari · 2019-03-10T16:53:50Z

@jreback @jorisvandenbossche @TomAugspurger
This is unfortunate - would have liked to make the case for it at the dev-chat. This point was also the principal (only?) complaint about introducing Series.set_index (#22225) with the same capabilities. How do we proceed there? (reminder: you closed that PR unilaterally, but the need of such a method for Series was not in question.)

PS. Lists might seem like a special case here (although #22264 did exactly the same for .str.cat), but it's IMO very important to start deprecating lists-as-arrays (as opposed to lists-as-collections) throughout the code base. That ambiguity is at the heart of a lot of issues.

DEPR/API: disallow lists within list for set_index

bf0eee0

h-vetinari mentioned this pull request Jan 10, 2019

API: capabilities of df.set_index #24046

Open

7 tasks

Add deprecation and whatsnew

ed0de1f

restore test for list-of-scalars interpreted as keys

dc274e3

Small doc fixes

623fc9a

toobaz mentioned this pull request Jan 10, 2019

is_list_like should return false for tuples #24702

Open

h-vetinari added 5 commits January 10, 2019 20:05

Improve docstring; small fixes

5f6e303

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

13d4e40

Remove last mention of "list-like"

813b4fc

rephrase "illegal"

4c130ee

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

8731834

h-vetinari commented Jan 10, 2019

View reviewed changes

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Deprecate Functionality to remove in pandas DataFrame DataFrame data structure labels Jan 11, 2019

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

29fbc6a

h-vetinari mentioned this pull request Jan 14, 2019

DOC: update DF.set_index #24762

Merged

h-vetinari added 2 commits January 14, 2019 17:31

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

cc04a64

Improve warning message (review TomAugspurger)

e1d999b

typo

726ef1c

h-vetinari added 3 commits January 16, 2019 20:15

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

0e1f709

Tuples always considered keys; KeyError, not ValueError if missing

b0b326f

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

6cbcc47

h-vetinari mentioned this pull request Feb 1, 2019

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged

3 tasks

h-vetinari added 3 commits February 24, 2019 11:16

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

c881aaa

Actually commit fix for conflict, duh

0214801

Move whatsnew to 0.25

61c511d

jreback requested changes Feb 27, 2019

View reviewed changes

h-vetinari added 2 commits March 1, 2019 08:35

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

7381245

Add deprecation-section (review jreback)

5fa544c

h-vetinari added 2 commits March 3, 2019 21:06

Merge remote-tracking branch 'upstream/master' into depr_LL_set_index

a016bf0

Fix doc fails

0c65876

jreback closed this Mar 10, 2019

h-vetinari mentioned this pull request May 31, 2019

ENH: Add set_index to Series #22225

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR/API: disallow lists within list for set_index #24697

DEPR/API: disallow lists within list for set_index #24697

h-vetinari commented Jan 10, 2019 •

edited

Loading

codecov bot commented Jan 10, 2019

codecov bot commented Jan 10, 2019 •

edited

Loading

jorisvandenbossche commented Jan 10, 2019

jorisvandenbossche commented Jan 10, 2019

h-vetinari commented Jan 10, 2019

toobaz commented Jan 10, 2019

h-vetinari commented Jan 10, 2019 •

edited

Loading

toobaz commented Jan 10, 2019

h-vetinari commented Jan 10, 2019

h-vetinari Jan 10, 2019

h-vetinari commented Jan 13, 2019

jreback commented Jan 14, 2019

h-vetinari commented Jan 14, 2019 •

edited

Loading

TomAugspurger commented Jan 14, 2019

h-vetinari commented Jan 14, 2019 •

edited

Loading

h-vetinari commented Jan 15, 2019

jreback commented Jan 16, 2019

h-vetinari commented Jan 20, 2019 •

edited

Loading

h-vetinari commented Feb 24, 2019

jorisvandenbossche commented Feb 25, 2019

h-vetinari commented Feb 26, 2019

jreback Feb 27, 2019

h-vetinari commented Mar 1, 2019

h-vetinari commented Mar 4, 2019

h-vetinari commented Mar 7, 2019

jreback commented Mar 10, 2019

h-vetinari commented Mar 10, 2019

DEPR/API: disallow lists within list for set_index #24697

DEPR/API: disallow lists within list for set_index #24697

Conversation

h-vetinari commented Jan 10, 2019 • edited Loading

codecov bot commented Jan 10, 2019

Codecov Report

codecov bot commented Jan 10, 2019 • edited Loading

Codecov Report

jorisvandenbossche commented Jan 10, 2019

jorisvandenbossche commented Jan 10, 2019

h-vetinari commented Jan 10, 2019

toobaz commented Jan 10, 2019

h-vetinari commented Jan 10, 2019 • edited Loading

toobaz commented Jan 10, 2019

h-vetinari commented Jan 10, 2019

h-vetinari Jan 10, 2019

Choose a reason for hiding this comment

h-vetinari commented Jan 13, 2019

jreback commented Jan 14, 2019

h-vetinari commented Jan 14, 2019 • edited Loading

TomAugspurger commented Jan 14, 2019

h-vetinari commented Jan 14, 2019 • edited Loading

h-vetinari commented Jan 15, 2019

jreback commented Jan 16, 2019

h-vetinari commented Jan 20, 2019 • edited Loading

h-vetinari commented Feb 24, 2019

jorisvandenbossche commented Feb 25, 2019

h-vetinari commented Feb 26, 2019

jreback Feb 27, 2019

Choose a reason for hiding this comment

h-vetinari commented Mar 1, 2019

h-vetinari commented Mar 4, 2019

h-vetinari commented Mar 7, 2019

jreback commented Mar 10, 2019

h-vetinari commented Mar 10, 2019

h-vetinari commented Jan 10, 2019 •

edited

Loading

codecov bot commented Jan 10, 2019 •

edited

Loading

h-vetinari commented Jan 10, 2019 •

edited

Loading

h-vetinari commented Jan 14, 2019 •

edited

Loading

h-vetinari commented Jan 14, 2019 •

edited

Loading

h-vetinari commented Jan 20, 2019 •

edited

Loading