Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged
merged 36 commits into from
Feb 24, 2019
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
3e01681
API: re-enable custom label types in set_index
h-vetinari Jan 28, 2019
1b71e68
Fix doc pattern?
h-vetinari Jan 28, 2019
caeb125
Review (TomAugspurger)
h-vetinari Jan 29, 2019
8bd5340
Review (jreback)
h-vetinari Jan 29, 2019
3c8b69a
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 29, 2019
cdfd86a
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 30, 2019
d76ecfb
Review (jreback & jorisvandenbossche)
h-vetinari Jan 30, 2019
d2ffb81
Revert last two commits
h-vetinari Jan 30, 2019
5863678
Review (jorisvandenbossche)
h-vetinari Jan 31, 2019
0a7d783
Fix hashable listlikes (review jorisvandenbossche)
h-vetinari Jan 31, 2019
087d4f1
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 31, 2019
794f61d
Stabilize repr of frozenset
h-vetinari Feb 1, 2019
0761633
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 1, 2019
c58e8b6
Review (WillAyd)
h-vetinari Feb 1, 2019
29fa8c0
Unambiguous KeyError message
h-vetinari Feb 1, 2019
b5c8fa8
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 3, 2019
5590433
Remove redundant whatsnew
h-vetinari Feb 3, 2019
7767ff7
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 7, 2019
37c12d0
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 9, 2019
2c4eaea
Review (jorisvandenbossche)
h-vetinari Feb 9, 2019
6c78816
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 10, 2019
2ccd9a9
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 14, 2019
b03c43b
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 17, 2019
ea10359
Review (jreback)
h-vetinari Feb 17, 2019
ca17895
Retrigger after connectivity issues
h-vetinari Feb 17, 2019
a401eea
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 17, 2019
f4deacc
Review (jorisvandenbossche)
h-vetinari Feb 18, 2019
125b0ca
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 18, 2019
9bfcfde
move test for easier diff
h-vetinari Feb 18, 2019
6838613
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 19, 2019
ca2ac60
Review (jreback)
h-vetinari Feb 19, 2019
87bd0a6
Add 'conda list' for azure/posix after activate 'pandas-dev'
h-vetinari Feb 19, 2019
759b369
Revert "Add 'conda list' for azure/posix after activate 'pandas-dev'"
h-vetinari Feb 20, 2019
40f1aaa
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 20, 2019
ecc7d03
Reflect change in docstring
h-vetinari Feb 21, 2019
5f99b15
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ Bug Fixes
**Reshaping**

- Bug in :meth:`DataFrame.groupby` with :class:`Grouper` when there is a time change (DST) and grouping frequency is ``'1d'`` (:issue:`24972`)
- Fixed regression where custom hashable types could not be used as column keys in :meth:`DataFrame.set_index` (:issue:`24969`)

**Visualization**

Expand Down
34 changes: 24 additions & 10 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4143,13 +4143,8 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
'array, or a list containing only valid column keys and '
'one-dimensional arrays.')

if (is_scalar(keys) or isinstance(keys, tuple)
or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
# make sure we have a container of keys/arrays we can iterate over
# tuples can appear as valid column keys!
if not isinstance(keys, list):
keys = [keys]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If supporting custom types, we need to go back to the state pre-#22486 that just puts everything that's not listlike in list. Otherwise we can't guarantee that iteration below will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is best, let's just revert entirely to pre 22486

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-#22486 had some other problems we do not need to re-instate (e.g. weird KeyErrors if it's an iter).

I agree that the code complexity must stay maintainable, and so the most reasonable thing (IMO), ist to just not bother trying to "fix" unhashable custom types being used as keys. At that point, the user is so far gone off the reservation that it's really not our job to give them a good error message (again, even the repr of such a frame would crash).

elif not isinstance(keys, list):
raise ValueError(err_msg)

missing = []
for col in keys:
Expand All @@ -4158,10 +4153,29 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
# tuples are always considered keys, never as list-likes
if col not in self:
missing.append(col)
elif (not isinstance(col, (ABCIndexClass, ABCSeries,
np.ndarray, list))
or getattr(col, 'ndim', 1) > 1):
raise ValueError(err_msg)
elif isinstance(col, (ABCIndexClass, ABCSeries,
np.ndarray, list)):
# arrays are fine as long as they are one-dimensional
if getattr(col, 'ndim', 1) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!= 1, is a 0-dim numpy scalar valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

raise ValueError(err_msg)
elif is_list_like(col) and not hasattr(col, '__len__'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alterative (in light of the general comment I made), an alternative might be to only specifically check for isinstance(col, collections.abc.Iterator) ? That should catch the common cases you mentioned like map, iter, generator, without the risk of catching things we didn't mean to catch.

BTW, one example that still passes here is a range object (so df.set_index(range(n)) raises a "KeyError: "None of [Int64Index([0, 1, 2], dtype='int64')] are in the [columns]"", because it is seen as a missing key.
That might be another case for which to add a specific check here (or isinstance(col, range)). We could also consider actually allowing range objects (converting it to an array), but that discussion is maybe better for another issue :-)

# various iterators/generators are hashable, but should not
# raise a KeyError; we identify them by their lack of __len__.
# hashable listlikes with __len__ get tested as keys below.
tipo = type(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just put type(col) directly rather than adding another line here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise TypeError(err_msg + ' Received column of '
'type {}'.format(tipo))
else:
# everything else gets tried as a key; see GH 24969
try:
found = col in self.columns
except TypeError:
tipo = type(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

raise TypeError(err_msg + ' Received column of '
'type {}'.format(tipo))
else:
if not found:
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
missing.append(col)
jreback marked this conversation as resolved.
Show resolved Hide resolved

if missing:
raise KeyError('{}'.format(missing))
Expand Down
115 changes: 110 additions & 5 deletions pandas/tests/frame/test_alter_axes.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,21 +255,126 @@ def test_set_index_raise_keys(self, frame_of_index_cols, drop, append):

@pytest.mark.parametrize('append', [True, False])
@pytest.mark.parametrize('drop', [True, False])
@pytest.mark.parametrize('box', [set, iter])
@pytest.mark.parametrize('box', [set, iter, lambda x: (y for y in x)],
ids=['set', 'iter', 'generator'])
def test_set_index_raise_on_type(self, frame_of_index_cols, box,
drop, append):
df = frame_of_index_cols

msg = 'The parameter "keys" may be a column key, .*'
# forbidden type, e.g. set/tuple/iter
with pytest.raises(ValueError, match=msg):
# forbidden type, e.g. set/iter/generator
with pytest.raises(TypeError, match=msg):
df.set_index(box(df['A']), drop=drop, append=append)

# forbidden type in list, e.g. set/tuple/iter
with pytest.raises(ValueError, match=msg):
# forbidden type in list, e.g. set/iter/generator
with pytest.raises(TypeError, match=msg):
df.set_index(['A', df['A'], box(df['A'])],
drop=drop, append=append)

def test_set_index_custom_label_type(self):
# GH 24969

class Thing(object):
def __init__(self, name, color):
self.name = name
self.color = color

def __str__(self):
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
return "<Thing %r>" % (self.name,)

# necessary for pretty KeyError
__repr__ = __str__

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = DataFrame({thing1: [0, 1], thing2: [2, 3]})
expected = DataFrame({thing1: [0, 1]},
index=Index([2, 3], name=thing2))
h-vetinari marked this conversation as resolved.
Show resolved Hide resolved

# use custom label directly
result = df.set_index(thing2)
tm.assert_frame_equal(result, expected)

# custom label wrapped in list
result = df.set_index([thing2])
tm.assert_frame_equal(result, expected)

TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
# missing key
thing3 = Thing('Three', 'pink')
msg = "<Thing 'Three'>"
with pytest.raises(KeyError, match=msg):
# missing label directly
df.set_index(thing3)

with pytest.raises(KeyError, match=msg):
# missing label in list
df.set_index([thing3])

def test_set_index_custom_label_hashable_iterable(self):
# GH 24969

# actual example discussed in GH 24984 was e.g. for shapely.geometry
# objects (e.g. a collection of Points) that can be both hashable and
# iterable; using frozenset as a stand-in for testing here

class Thing(frozenset):
# need to stabilize repr for KeyError (due to random order in sets)
def __repr__(self):
tmp = sorted(list(self))
# double curly brace prints one brace in format string
return "frozenset({{{}}})".format(', '.join(map(repr, tmp)))

thing1 = Thing(['One', 'red'])
thing2 = Thing(['Two', 'blue'])
df = DataFrame({thing1: [0, 1], thing2: [2, 3]})
expected = DataFrame({thing1: [0, 1]},
index=Index([2, 3], name=thing2))

# use custom label directly
result = df.set_index(thing2)
tm.assert_frame_equal(result, expected)

# custom label wrapped in list
result = df.set_index([thing2])
tm.assert_frame_equal(result, expected)

# missing key
thing3 = Thing(['Three', 'pink'])
msg = r"frozenset\(\{'Three', 'pink'\}\)"
with pytest.raises(KeyError, match=msg):
# missing label directly
df.set_index(thing3)

with pytest.raises(KeyError, match=msg):
# missing label in list
df.set_index([thing3])

def test_set_index_custom_label_type_raises(self):
# GH 24969

# purposefully inherit from something unhashable
class Thing(set):
def __init__(self, name, color):
self.name = name
self.color = color

def __str__(self):
return "<Thing %r>" % (self.name,)

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = DataFrame([[0, 2], [1, 3]], columns=[thing1, thing2])

msg = 'The parameter "keys" may be a column key, .*'

with pytest.raises(TypeError, match=msg):
# use custom label directly
df.set_index(thing2)

with pytest.raises(TypeError, match=msg):
# custom label wrapped in list
df.set_index([thing2])

def test_construction_with_categorical_index(self):
ci = tm.makeCategoricalIndex(10)
ci.name = 'B'
Expand Down