Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST/CLN: break up & parametrize tests for df.set_index #22236

Merged
merged 13 commits into from
Sep 15, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Aug 7, 2018

While working on #22225 I had the strong urge to clean up the tests for df.set_index (to be able to work off of them for series.set_index)

Since the diff is pretty busted, here's a description. In tests/frame/test_alter_axes.py, I did:

  • externalised an often-used df as a fixture
  • broke up test_set_index2 into several pieces
  • renamed test_set_index_bug to test_set_index_after_mutation (following corresponding GH issue)
  • renamed test_set_index_nonuniq to test_set_index_verify_integrity (also added a MI-case)
  • strongly parametrized test_set_index_pass_arrays and test_set_index_duplicate_names, including several combinations and cases that were not tested before
  • replaced everything pd. with direct imports (best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):

    don’t import pd, directly import instead

  • replaced assert_series_equal etc. with tm.assert_series_equal (best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):

    use tm; don’t import assert_series_equal

  • cleaned up the other tests a bit too, including several open TODOs and getting rid of the check_names=False in several cases.

I also added better warnings in case there are duplicate column labels in keys, and some corresponding tests.

@h-vetinari h-vetinari mentioned this pull request Aug 7, 2018
5 tasks
@codecov
Copy link

codecov bot commented Aug 8, 2018

Codecov Report

Merging #22236 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22236   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <ø> (ø) ⬆️
#single 42.35% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0976e12...81ff9c2. Read the comment docs.

@@ -103,6 +103,15 @@ def simple(self):
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])

@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this? pls use a fixture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, but don't the attributes of TestData have the roles of fixtures for class-based tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that is a very bad pattern. we don't want to do that. pls use fixtures instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's fair to blame me for following an existing pattern - just look at

    @cache_readonly
    def simple(self):
        arr = np.array([[1., 2., 3.],
                        [4., 5., 6.],
                        [7., 8., 9.]])

        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

directly above.

Do you want me to change all of test_alter_axes from classes to functions? That's the only (clean) way I can think of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blaming you, asking you to use fixtures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is orthogonal to switching from classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it


import pandas.util.testing as tm

from pandas.tests.frame.common import TestData

key = lambda x: x.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are "container"-types for testing, in the sense that -- applied to df['A'] -- they give the correct type of container. They are to be thought of like Series or Index, but since the bare MultiIndex()- constructor does not take a vector, I wrote it as a lambda. Makes testing all allowed cases nicely parametrisable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable


class TestDataFrameAlterAxes(TestData):

def test_set_index_manually(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you find a different name than manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think df.index = idx deserves the name "manual". It should also be discouraged, that's why there's set_index in the first place. df.set_index(idx) is then the API-supported (i.e. "non-manual") way of doing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, how do you like "directly"? ;-)


class TestDataFrameAlterAxes(TestData):

def test_set_index_manually(self):
df = self.mixed_frame.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would rather make mixed_frame a fixture to avoid lots of boilerplate here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a class-based test module. mixed_frame already existed before, and has - for all intents and purposes - the role of a fixture.

@pytest.mark.parametrize('inplace', [True, False])
@pytest.mark.parametrize('drop', [True, False])
def test_set_index_drop_inplace(self, drop, inplace, keys):
df = self.dummy.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make dummy a fixture (and rename it to something else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is better (still pls make dummy a fixture)

result = df2.set_index('key')
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize('container', [Series, Index, np.array, mi])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this box
put comments before the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "box" is less clear than "container", but OK


tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize('container', [Series, Index, np.array, list, mi])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize('elem2', [key, Series, Index, np.array, list, mi])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elem is not very descriptive


keys = [elem1(df['A']), elem2(df['A'])]

# == gives ambiguous Boolean for Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? was this here before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said

strongly parametrized test_set_index_pass_arrays and test_set_index_duplicate_names, including several combinations and cases that were not tested before

This is (essentially) a very beefed-up version of test_set_index_duplicate_names. It test appending duplicate arrays in various forms (and with various kwargs, e.g. against drop), and tests for the error of passing duplicate column keys directly.

if not isinstance(keys, list):
keys = [keys]

col_labels = [x for x in keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is all this for? if this is just reorging tests, why are you changing code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that:

I also added better warnings in case there are duplicate column labels in keys, and some corresponding tests.

Currently, df.set_index(['A', 'A']) yields something like:
ValueError: Duplicated level name: "A", assigned to level 1, is already used for level 0.
which was

  • untested
  • not a very clear warning
  • bad for perf, because all arrays get processed already and only upon creation of the MultiIndex is the error raised.

I fixed that with the code block above ('better warnings" is also in the title).

@jreback jreback added Testing pandas testing functions or related to the test suite Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 8, 2018
@@ -103,6 +103,15 @@ def simple(self):
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])

@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that is a very bad pattern. we don't want to do that. pls use fixtures instead


import pandas.util.testing as tm

from pandas.tests.frame.common import TestData

key = lambda x: x.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

@pytest.mark.parametrize('inplace', [True, False])
@pytest.mark.parametrize('drop', [True, False])
def test_set_index_drop_inplace(self, drop, inplace, keys):
df = self.dummy.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above as well

@pep8speaks
Copy link

pep8speaks commented Aug 9, 2018

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 31, 2018 at 20:35 Hours UTC

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 9, 2018

@jreback Don't why my comments aren't showing up here - copying them to the thread.

no that is a very bad pattern. we don't want to do that. pls use fixtures instead

I don't think it's fair to blame me for following an existing pattern - just look at

    @cache_readonly
    def simple(self):
        arr = np.array([[1., 2., 3.],
                        [4., 5., 6.],
                        [7., 8., 9.]])

        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

directly above (https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/tests/frame/common.py#L97-104)

Do you want me to change all of test_alter_axes from classes to functions? That's the only other (clean) way I can think of.

this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable

if not isinstance(x, (Series, Index, MultiIndex,
list, np.ndarray))]
if any(x not in self for x in col_labels):
missing = [x for x in col_labels if x not in self]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there explict tests for each of the branches. pls add a comment to each one indicating what you are checking.

Copy link
Contributor Author

@h-vetinari h-vetinari Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate and test_set_index_raise. Added comments

@@ -103,6 +103,15 @@ def simple(self):
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])

@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blaming you, asking you to use fixtures.

@@ -103,6 +103,15 @@ def simple(self):
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])

@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is orthogonal to switching from classes.

@pytest.mark.parametrize('inplace', [True, False])
@pytest.mark.parametrize('drop', [True, False])
def test_set_index_drop_inplace(self, drop, inplace, keys):
df = self.dummy.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is better (still pls make dummy a fixture)

tm.assert_frame_equal(result, expected)

# also test index name if append=True (name is duplicate here for B)
@pytest.mark.parametrize('box', [Series, Index, np.array, 'MultiIndex'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' use a string, just directly put the lambda here. ideally we have NO if/else in test functions, sometimes its unavoidable, but making tests as simple as possible is the key

df = self.dummy.copy()
df.index.name = index_name

# update constructor in case of MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

df.index.name = index_name

# transform strings to correct box constructor
def rebox(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not good at all, pls don't do this inside the test function, just use a lambda in the box itself

# keep the timezone
result = i.to_series(keep_tz=True)
assert_series_equal(result.reset_index(drop=True), expected)
# convert to series while keeping the timezone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i to idx for better readability.

Finally, idx.to_series(keep_tz=True) yields:

B
2013-01-01 13:00:00-08:00   2013-01-01 13:00:00-08:00
2013-01-02 14:00:00-08:00   2013-01-02 14:00:00-08:00
Name: B, dtype: datetime64[ns, US/Pacific]

so needs an index change to fit with expected. I just found that using the index-kwarg of to_series is cleaner and more understandable than using reset_index - and since it's the conversion that's being tested and not the manner of the index reset, I changed it.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review. Should be getting close now.

if not isinstance(x, (Series, Index, MultiIndex,
list, np.ndarray))]
if any(x not in self for x in col_labels):
missing = [x for x in col_labels if x not in self]
Copy link
Contributor Author

@h-vetinari h-vetinari Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate and test_set_index_raise. Added comments

@@ -103,6 +103,15 @@ def simple(self):
return pd.DataFrame(arr, columns=['one', 'two', 'three'],
index=['a', 'b', 'c'])

@cache_readonly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it

# keep the timezone
result = i.to_series(keep_tz=True)
assert_series_equal(result.reset_index(drop=True), expected)
# convert to series while keeping the timezone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i to idx for better readability.

Finally, idx.to_series(keep_tz=True) yields:

B
2013-01-01 13:00:00-08:00   2013-01-01 13:00:00-08:00
2013-01-02 14:00:00-08:00   2013-01-02 14:00:00-08:00
Name: B, dtype: datetime64[ns, US/Pacific]

so needs an index change to fit with expected. I just found that using the index-kwarg of to_series is cleaner and more understandable than using reset_index - and since it's the conversion that's being tested and not the manner of the index reset, I changed it.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok looks closer a few more comments

raise KeyError('{}'.format(missing))
elif len(set(col_labels)) < len(col_labels):
# if all are valid labels, but there are duplicates
dup = Series(col_labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather use a set difference operation here

Copy link
Contributor Author

@h-vetinari h-vetinari Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm not seeing it, but IMO set difference isn't gonna show duplicates (because as a set, they'll be the same)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subtract the sets of all - others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not about the set of all inputs - it's strictly about column labels. E.g.

df.set_index(['A', 'A', 'B', df.A, some_series, some_index])
[...]
col_labels = ['A', 'A', 'B']  # at the line of your comment (in this case)

No set operation that I can think of would yield A, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated, IMO)


import pandas.util.testing as tm

from pandas.tests.frame.common import TestData


@pytest.fixture
def frame_of_index_cols():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to pandas/tests/frame/conftest.py need to start this.

class TestDataFrameAlterAxes(TestData):

def test_set_index_directly(self):
df = self.mixed_frame.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this to the conftest as well as a fixture

@h-vetinari
Copy link
Contributor Author

@jreback

I translated all the "fixture"-attributes from TestData to conftest.py, but I only replaced them in test_alter_axes, because they occur just too frequently to do it in one go. ;-)

@h-vetinari
Copy link
Contributor Author

@jreback All green, and should be good to go.

@h-vetinari
Copy link
Contributor Author

@jreback Another ping :)

@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

i will look soon
we have gotten a ton of PRs lately

@h-vetinari
Copy link
Contributor Author

@jreback I know you're busy, but you already said "ok looks closer a few more comments", and the changes since then are easy: just the fixtures you asked for. Would like to get this through as it's blocking two other PRs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, pls open a new issue that refs this, to remove use of TestData in favor of fixtures

if not isinstance(keys, list):
keys = [keys]

# collect elements from "keys" that are not allowed array types
col_labels = [x for x in keys
if not isinstance(x, (Series, Index, MultiIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the ABC forms here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this just isinstance(x, tuple) or is_scalar(x)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:

pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2]))
# KeyError: <map object at 0x000001D380115550>

Could be changed, but then I need to know the desired allowed signature / error reporting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.

____________________ TestDataFrameAlterAxes.test_set_index ____________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400>
mixed_frame =                    A         B         C         D  foo
lem22AYh2f -0.599136 -0.811715 -0.135242  0.764100  bar
iBogSI...05  bar
vDdWVTLtoj -0.433490 -1.416721  0.460720 -0.539860  bar
Pw5bUHt4sR  0.180016  1.651358 -1.041539 -0.832112  bar

    def test_set_index(self, mixed_frame):
        df = mixed_frame
        idx = Index(np.arange(len(df))[::-1])

>       df = df.set_index(idx)

pandas\tests\frame\test_alter_axes.py:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13,
            12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0],
           dtype='int64')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'Int64Index'

pandas\core\indexes\base.py:2021: TypeError
_________________ TestDataFrameAlterAxes.test_set_index_cast __________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668>

    def test_set_index_cast(self):
        # issue casting an index then set_index
        df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]},
                       index=[2010, 2011, 2012])
>       df2 = df.set_index(df.index.astype(np.int32))

pandas\tests\frame\test_alter_axes.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Int64Index([2010, 2011, 2012], dtype='int64')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'Int64Index'

pandas\core\indexes\base.py:2021: TypeError
_______________ TestDataFrameAlterAxes.test_set_index_timezone ________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0>

    def test_set_index_timezone(self):
        # GH 12358
        # tz-aware Series should retain the tz
        idx = to_datetime(["2014-01-01 10:10:10"],
                          utc=True).tz_convert('Europe/Rome')
        df = DataFrame({'A': idx})
>       assert df.set_index(idx).index[0].hour == 11

pandas\tests\frame\test_alter_axes.py:343:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'DatetimeIndex'

pandas\core\indexes\base.py:2021: TypeError
______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68>

    def test_dti_set_index_reindex(self):
        # GH 6631
        df = DataFrame(np.random.random(6))
        idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern')
        idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo')

>       df = df.set_index(idx1)

pandas\tests\frame\test_alter_axes.py:413:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00',
               '2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', freq='M')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'DatetimeIndex'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

# if there are any labels that are invalid, we raise a KeyError
missing = [x for x in col_labels if x not in self]
raise KeyError('{}'.format(missing))
elif len(set(col_labels)) < len(col_labels):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

raise KeyError('{}'.format(missing))
elif len(set(col_labels)) < len(col_labels):
# if all are valid labels, but there are duplicates
dup = Series(col_labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subtract the sets of all - others

@@ -28,244 +25,284 @@

class TestDataFrameAlterAxes(TestData):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need TestData any longer

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review; opened #22471 as requested

if not isinstance(keys, list):
keys = [keys]

# collect elements from "keys" that are not allowed array types
col_labels = [x for x in keys
if not isinstance(x, (Series, Index, MultiIndex,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:

pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2]))
# KeyError: <map object at 0x000001D380115550>

Could be changed, but then I need to know the desired allowed signature / error reporting

raise KeyError('{}'.format(missing))
elif len(set(col_labels)) < len(col_labels):
# if all are valid labels, but there are duplicates
dup = Series(col_labels)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not about the set of all inputs - it's strictly about column labels. E.g.

df.set_index(['A', 'A', 'B', df.A, some_series, some_index])
[...]
col_labels = ['A', 'A', 'B']  # at the line of your comment (in this case)

No set operation that I can think of would yield A, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated, IMO)

if not isinstance(keys, list):
keys = [keys]

# collect elements from "keys" that are not allowed array types
col_labels = [x for x in keys
if not isinstance(x, (Series, Index, MultiIndex,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.

____________________ TestDataFrameAlterAxes.test_set_index ____________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400>
mixed_frame =                    A         B         C         D  foo
lem22AYh2f -0.599136 -0.811715 -0.135242  0.764100  bar
iBogSI...05  bar
vDdWVTLtoj -0.433490 -1.416721  0.460720 -0.539860  bar
Pw5bUHt4sR  0.180016  1.651358 -1.041539 -0.832112  bar

    def test_set_index(self, mixed_frame):
        df = mixed_frame
        idx = Index(np.arange(len(df))[::-1])

>       df = df.set_index(idx)

pandas\tests\frame\test_alter_axes.py:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13,
            12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0],
           dtype='int64')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'Int64Index'

pandas\core\indexes\base.py:2021: TypeError
_________________ TestDataFrameAlterAxes.test_set_index_cast __________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668>

    def test_set_index_cast(self):
        # issue casting an index then set_index
        df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]},
                       index=[2010, 2011, 2012])
>       df2 = df.set_index(df.index.astype(np.int32))

pandas\tests\frame\test_alter_axes.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Int64Index([2010, 2011, 2012], dtype='int64')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'Int64Index'

pandas\core\indexes\base.py:2021: TypeError
_______________ TestDataFrameAlterAxes.test_set_index_timezone ________________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0>

    def test_set_index_timezone(self):
        # GH 12358
        # tz-aware Series should retain the tz
        idx = to_datetime(["2014-01-01 10:10:10"],
                          utc=True).tz_convert('Europe/Rome')
        df = DataFrame({'A': idx})
>       assert df.set_index(idx).index[0].hour == 11

pandas\tests\frame\test_alter_axes.py:343:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'DatetimeIndex'

pandas\core\indexes\base.py:2021: TypeError
______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________

self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68>

    def test_dti_set_index_reindex(self):
        # GH 6631
        df = DataFrame(np.random.random(6))
        idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern')
        idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo')

>       df = df.set_index(idx1)

pandas\tests\frame\test_alter_axes.py:413:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:3874: in set_index
    if any(x not in self for x in col_labels):
pandas\core\frame.py:3874: in <genexpr>
    if any(x not in self for x in col_labels):
pandas\core\generic.py:1654: in __contains__
    return key in self._info_axis
pandas\core\indexes\base.py:1994: in __contains__
    hash(key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00',
               '2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', freq='M')

    def __hash__(self):
>       raise TypeError("unhashable type: %r" % type(self).__name__)
E       TypeError: unhashable type: 'DatetimeIndex'

if not isinstance(keys, list):
keys = [keys]

# collect elements from "keys" that are not allowed array types
col_labels = [x for x in keys
if not isinstance(x, (Series, Index, MultiIndex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 23, 2018

@jreback

huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

Fair enough, this was actually a good intuition. I split out all the new things into a new issue/PR (#22484 #22486), and reverted all the new warnings here.

@h-vetinari h-vetinari changed the title TST/CLN: break up & parametrize tests for df.set_index; better warnings TST/CLN: break up & parametrize tests for df.set_index Aug 23, 2018
@@ -0,0 +1,121 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things about this file:

  • Do you use all of these fixtures in your changes?
  • I would prefer if the naming is a little more consistent e.g.:
  • You have the word "frame" is some of your fixture names but not othres
  • You have underscores between words in some names but not others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I turned an often-used DF into a fixture - together with the other attributes of TestData. @jreback then told me to start a conftest.py and put it there, together with the other TestData-attributes used in this module (frame, mixed_frame).

I translated all the attributes into conftest.py without renaming them, so that they can be replaced on a per-module-basis as laid out in #22471. The names of the fixtures are clearly suboptimal, but following up on fixturizing the other modules would be much harder if I start renaming now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. Let's save for a follow-up then.

Copy link
Contributor Author

@h-vetinari h-vetinari Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung:

Since the names now changed after all, I thought I'd ask for your opinion on the naming/consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari : This looks great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung
Thanks. I just added a few more consistency fixes (but squashed to simplify reviewing), but only the first three fixture names were affected.

@h-vetinari h-vetinari force-pushed the tst_df_set_index branch 4 times, most recently from 79541b8 to d050112 Compare September 15, 2018 16:37
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 15, 2018

I've had a ResourceWarning twice thrice, and then restarted a third time because AppVeyor isn't triggering for some reason. It didn't start a second time, so I've left it like that ATM.

@h-vetinari
Copy link
Contributor Author

@jreback Green (after 5 retriggers...)

@jreback jreback merged commit 1c500fb into pandas-dev:master Sep 15, 2018
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

thanks. in the future, not necessary to push multiple times, just needed a single one, if the failure is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants