-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST/CLN: break up & parametrize tests for df.set_index #22236
Conversation
Codecov Report
@@ Coverage Diff @@
## master #22236 +/- ##
=======================================
Coverage 92.17% 92.17%
=======================================
Files 169 169
Lines 50708 50708
=======================================
Hits 46740 46740
Misses 3968 3968
Continue to review full report at Codecov.
|
pandas/tests/frame/common.py
Outdated
@@ -103,6 +103,15 @@ def simple(self): | |||
return pd.DataFrame(arr, columns=['one', 'two', 'three'], | |||
index=['a', 'b', 'c']) | |||
|
|||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this? pls use a fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure, but don't the attributes of TestData
have the roles of fixtures for class-based tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that is a very bad pattern. we don't want to do that. pls use fixtures instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's fair to blame me for following an existing pattern - just look at
@cache_readonly
def simple(self):
arr = np.array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])
directly above.
Do you want me to change all of test_alter_axes
from classes to functions? That's the only (clean) way I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blaming you, asking you to use fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is orthogonal to switching from classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it
|
||
import pandas.util.testing as tm | ||
|
||
from pandas.tests.frame.common import TestData | ||
|
||
key = lambda x: x.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are "container"-types for testing, in the sense that -- applied to df['A']
-- they give the correct type of container. They are to be thought of like Series
or Index
, but since the bare MultiIndex()
- constructor does not take a vector, I wrote it as a lambda. Makes testing all allowed cases nicely parametrisable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is obtuse, I can't tell what you are doing. if you need something create it where you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable
|
||
class TestDataFrameAlterAxes(TestData): | ||
|
||
def test_set_index_manually(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you find a different name than manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think df.index = idx
deserves the name "manual". It should also be discouraged, that's why there's set_index
in the first place. df.set_index(idx)
is then the API-supported (i.e. "non-manual") way of doing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, how do you like "directly"? ;-)
|
||
class TestDataFrameAlterAxes(TestData): | ||
|
||
def test_set_index_manually(self): | ||
df = self.mixed_frame.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would rather make mixed_frame a fixture to avoid lots of boilerplate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a class-based test module. mixed_frame
already existed before, and has - for all intents and purposes - the role of a fixture.
@pytest.mark.parametrize('inplace', [True, False]) | ||
@pytest.mark.parametrize('drop', [True, False]) | ||
def test_set_index_drop_inplace(self, drop, inplace, keys): | ||
df = self.dummy.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make dummy a fixture (and rename it to something else)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is better (still pls make dummy a fixture)
result = df2.set_index('key') | ||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize('container', [Series, Index, np.array, mi]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this box
put comments before the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "box" is less clear than "container", but OK
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize('container', [Series, Index, np.array, list, mi]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize('elem2', [key, Series, Index, np.array, list, mi]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elem is not very descriptive
|
||
keys = [elem1(df['A']), elem2(df['A'])] | ||
|
||
# == gives ambiguous Boolean for Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? was this here before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said
strongly parametrized test_set_index_pass_arrays and test_set_index_duplicate_names, including several combinations and cases that were not tested before
This is (essentially) a very beefed-up version of test_set_index_duplicate_names
. It test appending duplicate arrays in various forms (and with various kwargs, e.g. against drop
), and tests for the error of passing duplicate column keys directly.
pandas/core/frame.py
Outdated
if not isinstance(keys, list): | ||
keys = [keys] | ||
|
||
col_labels = [x for x in keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is all this for? if this is just reorging tests, why are you changing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that:
I also added better warnings in case there are duplicate column labels in keys, and some corresponding tests.
Currently, df.set_index(['A', 'A'])
yields something like:
ValueError: Duplicated level name: "A", assigned to level 1, is already used for level 0.
which was
- untested
- not a very clear warning
- bad for perf, because all arrays get processed already and only upon creation of the
MultiIndex
is the error raised.
I fixed that with the code block above ('better warnings" is also in the title).
f773b94
to
ee569b3
Compare
pandas/tests/frame/common.py
Outdated
@@ -103,6 +103,15 @@ def simple(self): | |||
return pd.DataFrame(arr, columns=['one', 'two', 'three'], | |||
index=['a', 'b', 'c']) | |||
|
|||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that is a very bad pattern. we don't want to do that. pls use fixtures instead
|
||
import pandas.util.testing as tm | ||
|
||
from pandas.tests.frame.common import TestData | ||
|
||
key = lambda x: x.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is obtuse, I can't tell what you are doing. if you need something create it where you need it.
@pytest.mark.parametrize('inplace', [True, False]) | ||
@pytest.mark.parametrize('drop', [True, False]) | ||
def test_set_index_drop_inplace(self, drop, inplace, keys): | ||
df = self.dummy.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above as well
Hello @h-vetinari! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 31, 2018 at 20:35 Hours UTC |
7aa1af4
to
f46c1e4
Compare
@jreback Don't why my comments aren't showing up here - copying them to the thread.
I don't think it's fair to blame me for following an existing pattern - just look at
directly above (https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/tests/frame/common.py#L97-104) Do you want me to change all of test_alter_axes from classes to functions? That's the only other (clean) way I can think of.
Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable |
pandas/core/frame.py
Outdated
if not isinstance(x, (Series, Index, MultiIndex, | ||
list, np.ndarray))] | ||
if any(x not in self for x in col_labels): | ||
missing = [x for x in col_labels if x not in self] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there explict tests for each of the branches. pls add a comment to each one indicating what you are checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate
and test_set_index_raise
. Added comments
pandas/tests/frame/common.py
Outdated
@@ -103,6 +103,15 @@ def simple(self): | |||
return pd.DataFrame(arr, columns=['one', 'two', 'three'], | |||
index=['a', 'b', 'c']) | |||
|
|||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blaming you, asking you to use fixtures.
pandas/tests/frame/common.py
Outdated
@@ -103,6 +103,15 @@ def simple(self): | |||
return pd.DataFrame(arr, columns=['one', 'two', 'three'], | |||
index=['a', 'b', 'c']) | |||
|
|||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is orthogonal to switching from classes.
@pytest.mark.parametrize('inplace', [True, False]) | ||
@pytest.mark.parametrize('drop', [True, False]) | ||
def test_set_index_drop_inplace(self, drop, inplace, keys): | ||
df = self.dummy.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is better (still pls make dummy a fixture)
tm.assert_frame_equal(result, expected) | ||
|
||
# also test index name if append=True (name is duplicate here for B) | ||
@pytest.mark.parametrize('box', [Series, Index, np.array, 'MultiIndex']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont' use a string, just directly put the lambda here. ideally we have NO if/else in test functions, sometimes its unavoidable, but making tests as simple as possible is the key
df = self.dummy.copy() | ||
df.index.name = index_name | ||
|
||
# update constructor in case of MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
df.index.name = index_name | ||
|
||
# transform strings to correct box constructor | ||
def rebox(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not good at all, pls don't do this inside the test function, just use a lambda in the box itself
# keep the timezone | ||
result = i.to_series(keep_tz=True) | ||
assert_series_equal(result.reset_index(drop=True), expected) | ||
# convert to series while keeping the timezone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i
to idx
for better readability.
Finally, idx.to_series(keep_tz=True)
yields:
B
2013-01-01 13:00:00-08:00 2013-01-01 13:00:00-08:00
2013-01-02 14:00:00-08:00 2013-01-02 14:00:00-08:00
Name: B, dtype: datetime64[ns, US/Pacific]
so needs an index change to fit with expected
. I just found that using the index
-kwarg of to_series
is cleaner and more understandable than using reset_index
- and since it's the conversion that's being tested and not the manner of the index reset, I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review. Should be getting close now.
pandas/core/frame.py
Outdated
if not isinstance(x, (Series, Index, MultiIndex, | ||
list, np.ndarray))] | ||
if any(x not in self for x in col_labels): | ||
missing = [x for x in col_labels if x not in self] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate
and test_set_index_raise
. Added comments
pandas/tests/frame/common.py
Outdated
@@ -103,6 +103,15 @@ def simple(self): | |||
return pd.DataFrame(arr, columns=['one', 'two', 'three'], | |||
index=['a', 'b', 'c']) | |||
|
|||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it
# keep the timezone | ||
result = i.to_series(keep_tz=True) | ||
assert_series_equal(result.reset_index(drop=True), expected) | ||
# convert to series while keeping the timezone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i
to idx
for better readability.
Finally, idx.to_series(keep_tz=True)
yields:
B
2013-01-01 13:00:00-08:00 2013-01-01 13:00:00-08:00
2013-01-02 14:00:00-08:00 2013-01-02 14:00:00-08:00
Name: B, dtype: datetime64[ns, US/Pacific]
so needs an index change to fit with expected
. I just found that using the index
-kwarg of to_series
is cleaner and more understandable than using reset_index
- and since it's the conversion that's being tested and not the manner of the index reset, I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok looks closer a few more comments
pandas/core/frame.py
Outdated
raise KeyError('{}'.format(missing)) | ||
elif len(set(col_labels)) < len(col_labels): | ||
# if all are valid labels, but there are duplicates | ||
dup = Series(col_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather use a set difference operation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm not seeing it, but IMO set difference isn't gonna show duplicates (because as a set, they'll be the same)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subtract the sets of all - others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not about the set of all inputs - it's strictly about column labels. E.g.
df.set_index(['A', 'A', 'B', df.A, some_series, some_index])
[...]
col_labels = ['A', 'A', 'B'] # at the line of your comment (in this case)
No set operation that I can think of would yield A
, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated
, IMO)
|
||
import pandas.util.testing as tm | ||
|
||
from pandas.tests.frame.common import TestData | ||
|
||
|
||
@pytest.fixture | ||
def frame_of_index_cols(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move to pandas/tests/frame/conftest.py
need to start this.
class TestDataFrameAlterAxes(TestData): | ||
|
||
def test_set_index_directly(self): | ||
df = self.mixed_frame.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this to the conftest as well as a fixture
I translated all the "fixture"-attributes from |
a1da79c
to
32b618f
Compare
@jreback All green, and should be good to go. |
@jreback Another ping :) |
i will look soon |
@jreback I know you're busy, but you already said "ok looks closer a few more comments", and the changes since then are easy: just the fixtures you asked for. Would like to get this through as it's blocking two other PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, pls open a new issue that refs this, to remove use of TestData in favor of fixtures
pandas/core/frame.py
Outdated
if not isinstance(keys, list): | ||
keys = [keys] | ||
|
||
# collect elements from "keys" that are not allowed array types | ||
col_labels = [x for x in keys | ||
if not isinstance(x, (Series, Index, MultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the ABC forms here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't this just isinstance(x, tuple) or is_scalar(x)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:
pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2]))
# KeyError: <map object at 0x000001D380115550>
Could be changed, but then I need to know the desired allowed signature / error reporting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.
____________________ TestDataFrameAlterAxes.test_set_index ____________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400>
mixed_frame = A B C D foo
lem22AYh2f -0.599136 -0.811715 -0.135242 0.764100 bar
iBogSI...05 bar
vDdWVTLtoj -0.433490 -1.416721 0.460720 -0.539860 bar
Pw5bUHt4sR 0.180016 1.651358 -1.041539 -0.832112 bar
def test_set_index(self, mixed_frame):
df = mixed_frame
idx = Index(np.arange(len(df))[::-1])
> df = df.set_index(idx)
pandas\tests\frame\test_alter_axes.py:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13,
12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
dtype='int64')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'Int64Index'
pandas\core\indexes\base.py:2021: TypeError
_________________ TestDataFrameAlterAxes.test_set_index_cast __________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668>
def test_set_index_cast(self):
# issue casting an index then set_index
df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]},
index=[2010, 2011, 2012])
> df2 = df.set_index(df.index.astype(np.int32))
pandas\tests\frame\test_alter_axes.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Int64Index([2010, 2011, 2012], dtype='int64')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'Int64Index'
pandas\core\indexes\base.py:2021: TypeError
_______________ TestDataFrameAlterAxes.test_set_index_timezone ________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0>
def test_set_index_timezone(self):
# GH 12358
# tz-aware Series should retain the tz
idx = to_datetime(["2014-01-01 10:10:10"],
utc=True).tz_convert('Europe/Rome')
df = DataFrame({'A': idx})
> assert df.set_index(idx).index[0].hour == 11
pandas\tests\frame\test_alter_axes.py:343:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'DatetimeIndex'
pandas\core\indexes\base.py:2021: TypeError
______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68>
def test_dti_set_index_reindex(self):
# GH 6631
df = DataFrame(np.random.random(6))
idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern')
idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo')
> df = df.set_index(idx1)
pandas\tests\frame\test_alter_axes.py:413:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00',
'2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', freq='M')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'DatetimeIndex'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.
pandas/core/frame.py
Outdated
# if there are any labels that are invalid, we raise a KeyError | ||
missing = [x for x in col_labels if x not in self] | ||
raise KeyError('{}'.format(missing)) | ||
elif len(set(col_labels)) < len(col_labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line here
pandas/core/frame.py
Outdated
raise KeyError('{}'.format(missing)) | ||
elif len(set(col_labels)) < len(col_labels): | ||
# if all are valid labels, but there are duplicates | ||
dup = Series(col_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subtract the sets of all - others
@@ -28,244 +25,284 @@ | |||
|
|||
class TestDataFrameAlterAxes(TestData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't need TestData any longer
32b618f
to
d2f1e78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review; opened #22471 as requested
pandas/core/frame.py
Outdated
if not isinstance(keys, list): | ||
keys = [keys] | ||
|
||
# collect elements from "keys" that are not allowed array types | ||
col_labels = [x for x in keys | ||
if not isinstance(x, (Series, Index, MultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:
pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2]))
# KeyError: <map object at 0x000001D380115550>
Could be changed, but then I need to know the desired allowed signature / error reporting
pandas/core/frame.py
Outdated
raise KeyError('{}'.format(missing)) | ||
elif len(set(col_labels)) < len(col_labels): | ||
# if all are valid labels, but there are duplicates | ||
dup = Series(col_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not about the set of all inputs - it's strictly about column labels. E.g.
df.set_index(['A', 'A', 'B', df.A, some_series, some_index])
[...]
col_labels = ['A', 'A', 'B'] # at the line of your comment (in this case)
No set operation that I can think of would yield A
, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated
, IMO)
pandas/core/frame.py
Outdated
if not isinstance(keys, list): | ||
keys = [keys] | ||
|
||
# collect elements from "keys" that are not allowed array types | ||
col_labels = [x for x in keys | ||
if not isinstance(x, (Series, Index, MultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.
____________________ TestDataFrameAlterAxes.test_set_index ____________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400>
mixed_frame = A B C D foo
lem22AYh2f -0.599136 -0.811715 -0.135242 0.764100 bar
iBogSI...05 bar
vDdWVTLtoj -0.433490 -1.416721 0.460720 -0.539860 bar
Pw5bUHt4sR 0.180016 1.651358 -1.041539 -0.832112 bar
def test_set_index(self, mixed_frame):
df = mixed_frame
idx = Index(np.arange(len(df))[::-1])
> df = df.set_index(idx)
pandas\tests\frame\test_alter_axes.py:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13,
12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
dtype='int64')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'Int64Index'
pandas\core\indexes\base.py:2021: TypeError
_________________ TestDataFrameAlterAxes.test_set_index_cast __________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668>
def test_set_index_cast(self):
# issue casting an index then set_index
df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]},
index=[2010, 2011, 2012])
> df2 = df.set_index(df.index.astype(np.int32))
pandas\tests\frame\test_alter_axes.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Int64Index([2010, 2011, 2012], dtype='int64')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'Int64Index'
pandas\core\indexes\base.py:2021: TypeError
_______________ TestDataFrameAlterAxes.test_set_index_timezone ________________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0>
def test_set_index_timezone(self):
# GH 12358
# tz-aware Series should retain the tz
idx = to_datetime(["2014-01-01 10:10:10"],
utc=True).tz_convert('Europe/Rome')
df = DataFrame({'A': idx})
> assert df.set_index(idx).index[0].hour == 11
pandas\tests\frame\test_alter_axes.py:343:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'DatetimeIndex'
pandas\core\indexes\base.py:2021: TypeError
______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________
self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68>
def test_dti_set_index_reindex(self):
# GH 6631
df = DataFrame(np.random.random(6))
idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern')
idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo')
> df = df.set_index(idx1)
pandas\tests\frame\test_alter_axes.py:413:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00',
'2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', freq='M')
def __hash__(self):
> raise TypeError("unhashable type: %r" % type(self).__name__)
E TypeError: unhashable type: 'DatetimeIndex'
pandas/core/frame.py
Outdated
if not isinstance(keys, list): | ||
keys = [keys] | ||
|
||
# collect elements from "keys" that are not allowed array types | ||
col_labels = [x for x in keys | ||
if not isinstance(x, (Series, Index, MultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.
Fair enough, this was actually a good intuition. I split out all the new things into a new issue/PR (#22484 #22486), and reverted all the new warnings here. |
@@ -0,0 +1,121 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things about this file:
- Do you use all of these fixtures in your changes?
- I would prefer if the naming is a little more consistent e.g.:
- You have the word "frame" is some of your fixture names but not othres
- You have underscores between words in some names but not others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, I turned an often-used DF into a fixture - together with the other attributes of TestData
. @jreback then told me to start a conftest.py
and put it there, together with the other TestData
-attributes used in this module (frame
, mixed_frame
).
I translated all the attributes into conftest.py
without renaming them, so that they can be replaced on a per-module-basis as laid out in #22471. The names of the fixtures are clearly suboptimal, but following up on fixturizing the other modules would be much harder if I start renaming now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. Let's save for a follow-up then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the names now changed after all, I thought I'd ask for your opinion on the naming/consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari : This looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung
Thanks. I just added a few more consistency fixes (but squashed to simplify reviewing), but only the first three fixture names were affected.
c175d85
to
befb356
Compare
79541b8
to
d050112
Compare
I've had a |
d050112
to
f42793a
Compare
f42793a
to
4ac9633
Compare
@jreback Green (after 5 retriggers...) |
thanks. in the future, not necessary to push multiple times, just needed a single one, if the failure is unrelated. |
While working on #22225 I had the strong urge to clean up the tests for
df.set_index
(to be able to work off of them forseries.set_index
)Since the diff is pretty busted, here's a description. In
tests/frame/test_alter_axes.py
, I did:test_set_index2
into several piecestest_set_index_bug
totest_set_index_after_mutation
(following corresponding GH issue)test_set_index_nonuniq
totest_set_index_verify_integrity
(also added a MI-case)test_set_index_pass_arrays
andtest_set_index_duplicate_names
, including several combinations and cases that were not tested beforepd.
with direct imports (best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):assert_series_equal
etc. withtm.assert_series_equal
(best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):check_names=False
in several cases.I also added better warnings in case there are duplicate column labels in
keys
, and some corresponding tests.