TST/CLN: break up & parametrize tests for df.set_index #22236

h-vetinari · 2018-08-07T22:41:48Z

While working on #22225 I had the strong urge to clean up the tests for df.set_index (to be able to work off of them for series.set_index)

Since the diff is pretty busted, here's a description. In tests/frame/test_alter_axes.py, I did:

externalised an often-used df as a fixture
broke up test_set_index2 into several pieces
renamed test_set_index_bug to test_set_index_after_mutation (following corresponding GH issue)
renamed test_set_index_nonuniq to test_set_index_verify_integrity (also added a MI-case)
strongly parametrized test_set_index_pass_arrays and test_set_index_duplicate_names, including several combinations and cases that were not tested before
replaced everything pd. with direct imports (best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):

don’t import pd, directly import instead
replaced assert_series_equal etc. with tm.assert_series_equal (best practice according to review in TST/CLN: series.duplicated; parametrisation; fix warning #21899):

use tm; don’t import assert_series_equal
cleaned up the other tests a bit too, including several open TODOs and getting rid of the check_names=False in several cases.

I also added better warnings in case there are duplicate column labels in keys, and some corresponding tests.

codecov · 2018-08-08T07:58:35Z

Codecov Report

Merging #22236 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #22236   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968

Flag	Coverage Δ
#multiple	`90.58% <ø> (ø)`	⬆️
#single	`42.35% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0976e12...81ff9c2. Read the comment docs.

jreback · 2018-08-08T10:19:09Z

pandas/tests/frame/common.py

@@ -103,6 +103,15 @@ def simple(self):
        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

+    @cache_readonly


what is this? pls use a fixture

Not 100% sure, but don't the attributes of TestData have the roles of fixtures for class-based tests?

no that is a very bad pattern. we don't want to do that. pls use fixtures instead

I don't think it's fair to blame me for following an existing pattern - just look at

@cache_readonly def simple(self): arr = np.array([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]]) return pd.DataFrame(arr, columns=['one', 'two', 'three'], index=['a', 'b', 'c'])

directly above.

Do you want me to change all of test_alter_axes from classes to functions? That's the only (clean) way I can think of.

not blaming you, asking you to use fixtures.

that is orthogonal to switching from classes.

jreback · 2018-08-08T10:19:47Z

pandas/tests/frame/test_alter_axes.py


 import pandas.util.testing as tm

 from pandas.tests.frame.common import TestData

+key = lambda x: x.name


what are these?

they are "container"-types for testing, in the sense that -- applied to df['A'] -- they give the correct type of container. They are to be thought of like Series or Index, but since the bare MultiIndex()- constructor does not take a vector, I wrote it as a lambda. Makes testing all allowed cases nicely parametrisable.

this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable

jreback · 2018-08-08T10:20:27Z

pandas/tests/frame/test_alter_axes.py


 class TestDataFrameAlterAxes(TestData):

+    def test_set_index_manually(self):


can you find a different name than manually

I really think df.index = idx deserves the name "manual". It should also be discouraged, that's why there's set_index in the first place. df.set_index(idx) is then the API-supported (i.e. "non-manual") way of doing it

Nevermind, how do you like "directly"? ;-)

jreback · 2018-08-08T10:21:06Z

pandas/tests/frame/test_alter_axes.py


 class TestDataFrameAlterAxes(TestData):

+    def test_set_index_manually(self):
+        df = self.mixed_frame.copy()


so I would rather make mixed_frame a fixture to avoid lots of boilerplate here

This is a class-based test module. mixed_frame already existed before, and has - for all intents and purposes - the role of a fixture.

jreback · 2018-08-08T10:21:30Z

pandas/tests/frame/test_alter_axes.py

+    @pytest.mark.parametrize('inplace', [True, False])
+    @pytest.mark.parametrize('drop', [True, False])
+    def test_set_index_drop_inplace(self, drop, inplace, keys):
+        df = self.dummy.copy()


make dummy a fixture (and rename it to something else)

see above as well

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable

yes this is better (still pls make dummy a fixture)

jreback · 2018-08-08T10:22:40Z

pandas/tests/frame/test_alter_axes.py

+        result = df2.set_index('key')
+        tm.assert_frame_equal(result, expected)
+
+    @pytest.mark.parametrize('container', [Series, Index, np.array, mi])


call this box
put comments before the test

I think that "box" is less clear than "container", but OK

jreback · 2018-08-08T10:22:51Z

pandas/tests/frame/test_alter_axes.py

+
+        tm.assert_frame_equal(result, expected)
+
+    @pytest.mark.parametrize('container', [Series, Index, np.array, list, mi])


jreback · 2018-08-08T10:23:09Z

pandas/tests/frame/test_alter_axes.py

+
+        tm.assert_frame_equal(result, expected)
+
+    @pytest.mark.parametrize('elem2', [key, Series, Index, np.array, list, mi])


elem is not very descriptive

jreback · 2018-08-08T10:23:39Z

pandas/tests/frame/test_alter_axes.py

+
+        keys = [elem1(df['A']), elem2(df['A'])]
+
+        # == gives ambiguous Boolean for Series


huh? was this here before?

I said

strongly parametrized test_set_index_pass_arrays and test_set_index_duplicate_names, including several combinations and cases that were not tested before

This is (essentially) a very beefed-up version of test_set_index_duplicate_names. It test appending duplicate arrays in various forms (and with various kwargs, e.g. against drop), and tests for the error of passing duplicate column keys directly.

jreback · 2018-08-08T10:24:41Z

pandas/core/frame.py

        if not isinstance(keys, list):
            keys = [keys]

+        col_labels = [x for x in keys


what is all this for? if this is just reorging tests, why are you changing code?

I wrote that:

I also added better warnings in case there are duplicate column labels in keys, and some corresponding tests.

Currently, df.set_index(['A', 'A']) yields something like:
ValueError: Duplicated level name: "A", assigned to level 1, is already used for level 0.
which was

untested

not a very clear warning

bad for perf, because all arrays get processed already and only upon creation of the MultiIndex is the error raised.

I fixed that with the code block above ('better warnings" is also in the title).

jreback · 2018-08-09T00:36:53Z

pandas/tests/frame/common.py

@@ -103,6 +103,15 @@ def simple(self):
        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

+    @cache_readonly


no that is a very bad pattern. we don't want to do that. pls use fixtures instead

jreback · 2018-08-09T00:37:22Z

pandas/tests/frame/test_alter_axes.py


 import pandas.util.testing as tm

 from pandas.tests.frame.common import TestData

+key = lambda x: x.name


this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

jreback · 2018-08-09T00:37:42Z

pandas/tests/frame/test_alter_axes.py

+    @pytest.mark.parametrize('inplace', [True, False])
+    @pytest.mark.parametrize('drop', [True, False])
+    def test_set_index_drop_inplace(self, drop, inplace, keys):
+        df = self.dummy.copy()


see above as well

pep8speaks · 2018-08-09T08:32:59Z

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 31, 2018 at 20:35 Hours UTC

h-vetinari · 2018-08-09T11:43:04Z

@jreback Don't why my comments aren't showing up here - copying them to the thread.

no that is a very bad pattern. we don't want to do that. pls use fixtures instead

I don't think it's fair to blame me for following an existing pattern - just look at

    @cache_readonly
    def simple(self):
        arr = np.array([[1., 2., 3.],
                        [4., 5., 6.],
                        [7., 8., 9.]])

        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

directly above (https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/tests/frame/common.py#L97-104)

Do you want me to change all of test_alter_axes from classes to functions? That's the only other (clean) way I can think of.

this is obtuse, I can't tell what you are doing. if you need something create it where you need it.

Refactored. Hope you like it better like this - the advantage is that it stays local and that the generated test names are understandable

jreback · 2018-08-09T22:49:30Z

pandas/core/frame.py

+                      if not isinstance(x, (Series, Index, MultiIndex,
+                                            list, np.ndarray))]
+        if any(x not in self for x in col_labels):
+            missing = [x for x in col_labels if x not in self]


are there explict tests for each of the branches. pls add a comment to each one indicating what you are checking.

Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate and test_set_index_raise. Added comments

jreback · 2018-08-09T22:49:37Z

pandas/tests/frame/common.py

@@ -103,6 +103,15 @@ def simple(self):
        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

+    @cache_readonly


not blaming you, asking you to use fixtures.

jreback · 2018-08-09T22:49:46Z

pandas/tests/frame/common.py

@@ -103,6 +103,15 @@ def simple(self):
        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

+    @cache_readonly


that is orthogonal to switching from classes.

jreback · 2018-08-09T22:50:31Z

pandas/tests/frame/test_alter_axes.py

+    @pytest.mark.parametrize('inplace', [True, False])
+    @pytest.mark.parametrize('drop', [True, False])
+    def test_set_index_drop_inplace(self, drop, inplace, keys):
+        df = self.dummy.copy()


yes this is better (still pls make dummy a fixture)

jreback · 2018-08-09T22:52:12Z

pandas/tests/frame/test_alter_axes.py

+        tm.assert_frame_equal(result, expected)
+
+    # also test index name if append=True (name is duplicate here for B)
+    @pytest.mark.parametrize('box', [Series, Index, np.array, 'MultiIndex'])


dont' use a string, just directly put the lambda here. ideally we have NO if/else in test functions, sometimes its unavoidable, but making tests as simple as possible is the key

jreback · 2018-08-09T22:52:38Z

pandas/tests/frame/test_alter_axes.py

+        df = self.dummy.copy()
+        df.index.name = index_name
+
+        # update constructor in case of MultiIndex


same comment

jreback · 2018-08-09T22:53:23Z

pandas/tests/frame/test_alter_axes.py

+        df.index.name = index_name
+
+        # transform strings to correct box constructor
+        def rebox(x):


this is not good at all, pls don't do this inside the test function, just use a lambda in the box itself

jreback · 2018-08-09T22:54:22Z

pandas/tests/frame/test_alter_axes.py

-        # keep the timezone
-        result = i.to_series(keep_tz=True)
-        assert_series_equal(result.reset_index(drop=True), expected)
+        # convert to series while keeping the timezone


why did this change?

The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i to idx for better readability.

Finally, idx.to_series(keep_tz=True) yields:

B 2013-01-01 13:00:00-08:00 2013-01-01 13:00:00-08:00 2013-01-02 14:00:00-08:00 2013-01-02 14:00:00-08:00 Name: B, dtype: datetime64[ns, US/Pacific]

so needs an index change to fit with expected. I just found that using the index-kwarg of to_series is cleaner and more understandable than using reset_index - and since it's the conversion that's being tested and not the manner of the index reset, I changed it.

h-vetinari

Thanks for the detailed review. Should be getting close now.

h-vetinari · 2018-08-10T05:58:03Z

pandas/core/frame.py

+                      if not isinstance(x, (Series, Index, MultiIndex,
+                                            list, np.ndarray))]
+        if any(x not in self for x in col_labels):
+            missing = [x for x in col_labels if x not in self]


Yest there are tests for each branch, see test_set_index_pass_arrays_duplicate and test_set_index_raise. Added comments

h-vetinari · 2018-08-10T06:07:43Z

pandas/tests/frame/common.py

@@ -103,6 +103,15 @@ def simple(self):
        return pd.DataFrame(arr, columns=['one', 'two', 'three'],
                            index=['a', 'b', 'c'])

+    @cache_readonly


h-vetinari · 2018-08-10T06:12:14Z

pandas/tests/frame/test_alter_axes.py

-        # keep the timezone
-        result = i.to_series(keep_tz=True)
-        assert_series_equal(result.reset_index(drop=True), expected)
+        # convert to series while keeping the timezone


The whole test is about converting a DatetimeIndex to a Series, so I renamed the test as such.
I renamed i to idx for better readability.

Finally, idx.to_series(keep_tz=True) yields:

B 2013-01-01 13:00:00-08:00 2013-01-01 13:00:00-08:00 2013-01-02 14:00:00-08:00 2013-01-02 14:00:00-08:00 Name: B, dtype: datetime64[ns, US/Pacific]

so needs an index change to fit with expected. I just found that using the index-kwarg of to_series is cleaner and more understandable than using reset_index - and since it's the conversion that's being tested and not the manner of the index reset, I changed it.

jreback

ok looks closer a few more comments

jreback · 2018-08-10T10:14:33Z

pandas/core/frame.py

+            raise KeyError('{}'.format(missing))
+        elif len(set(col_labels)) < len(col_labels):
+            # if all are valid labels, but there are duplicates
+            dup = Series(col_labels)


rather use a set difference operation here

maybe I'm not seeing it, but IMO set difference isn't gonna show duplicates (because as a set, they'll be the same)...

subtract the sets of all - others

this is not about the set of all inputs - it's strictly about column labels. E.g.

df.set_index(['A', 'A', 'B', df.A, some_series, some_index]) [...] col_labels = ['A', 'A', 'B'] # at the line of your comment (in this case)

No set operation that I can think of would yield A, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated, IMO)

jreback · 2018-08-10T10:16:11Z

pandas/tests/frame/test_alter_axes.py


 import pandas.util.testing as tm

 from pandas.tests.frame.common import TestData


+@pytest.fixture
+def frame_of_index_cols():


pls move to pandas/tests/frame/conftest.py need to start this.

jreback · 2018-08-10T10:16:25Z

pandas/tests/frame/test_alter_axes.py

 class TestDataFrameAlterAxes(TestData):

+    def test_set_index_directly(self):
+        df = self.mixed_frame.copy()


add this to the conftest as well as a fixture

h-vetinari · 2018-08-10T16:15:51Z

@jreback

I translated all the "fixture"-attributes from TestData to conftest.py, but I only replaced them in test_alter_axes, because they occur just too frequently to do it in one go. ;-)

h-vetinari · 2018-08-14T06:45:39Z

@jreback All green, and should be good to go.

h-vetinari · 2018-08-16T15:42:40Z

@jreback Another ping :)

jreback · 2018-08-16T15:46:42Z

i will look soon
we have gotten a ton of PRs lately

h-vetinari · 2018-08-20T13:22:33Z

@jreback I know you're busy, but you already said "ok looks closer a few more comments", and the changes since then are easy: just the fixtures you asked for. Would like to get this through as it's blocking two other PRs.

jreback

ok, pls open a new issue that refs this, to remove use of TestData in favor of fixtures

jreback · 2018-08-22T13:13:50Z

pandas/core/frame.py

        if not isinstance(keys, list):
            keys = [keys]

+        # collect elements from "keys" that are not allowed array types
+        col_labels = [x for x in keys
+                      if not isinstance(x, (Series, Index, MultiIndex,


use the ABC forms here

why isn't this just isinstance(x, tuple) or is_scalar(x)?

These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:

pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2])) # KeyError: <map object at 0x000001D380115550>

Could be changed, but then I need to know the desired allowed signature / error reporting

Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.

____________________ TestDataFrameAlterAxes.test_set_index ____________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400> mixed_frame = A B C D foo lem22AYh2f -0.599136 -0.811715 -0.135242 0.764100 bar iBogSI...05 bar vDdWVTLtoj -0.433490 -1.416721 0.460720 -0.539860 bar Pw5bUHt4sR 0.180016 1.651358 -1.041539 -0.832112 bar def test_set_index(self, mixed_frame): df = mixed_frame idx = Index(np.arange(len(df))[::-1]) > df = df.set_index(idx) pandas\tests\frame\test_alter_axes.py:41: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0], dtype='int64') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'Int64Index' pandas\core\indexes\base.py:2021: TypeError _________________ TestDataFrameAlterAxes.test_set_index_cast __________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668> def test_set_index_cast(self): # issue casting an index then set_index df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]}, index=[2010, 2011, 2012]) > df2 = df.set_index(df.index.astype(np.int32)) pandas\tests\frame\test_alter_axes.py:50: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Int64Index([2010, 2011, 2012], dtype='int64') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'Int64Index' pandas\core\indexes\base.py:2021: TypeError _______________ TestDataFrameAlterAxes.test_set_index_timezone ________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0> def test_set_index_timezone(self): # GH 12358 # tz-aware Series should retain the tz idx = to_datetime(["2014-01-01 10:10:10"], utc=True).tz_convert('Europe/Rome') df = DataFrame({'A': idx}) > assert df.set_index(idx).index[0].hour == 11 pandas\tests\frame\test_alter_axes.py:343: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None) def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'DatetimeIndex' pandas\core\indexes\base.py:2021: TypeError ______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68> def test_dti_set_index_reindex(self): # GH 6631 df = DataFrame(np.random.random(6)) idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern') idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo') > df = df.set_index(idx1) pandas\tests\frame\test_alter_axes.py:413: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00', '2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', freq='M') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'DatetimeIndex'

huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

jreback · 2018-08-22T13:14:44Z

pandas/core/frame.py

+            # if there are any labels that are invalid, we raise a KeyError
+            missing = [x for x in col_labels if x not in self]
+            raise KeyError('{}'.format(missing))
+        elif len(set(col_labels)) < len(col_labels):


blank line here

jreback · 2018-08-22T13:15:12Z

pandas/core/frame.py

+            raise KeyError('{}'.format(missing))
+        elif len(set(col_labels)) < len(col_labels):
+            # if all are valid labels, but there are duplicates
+            dup = Series(col_labels)


subtract the sets of all - others

jreback · 2018-08-22T13:16:32Z

pandas/tests/frame/test_alter_axes.py

@@ -28,244 +25,284 @@

 class TestDataFrameAlterAxes(TestData):


you shouldn't need TestData any longer

h-vetinari

Thanks for the review; opened #22471 as requested

h-vetinari · 2018-08-22T21:40:48Z

pandas/core/frame.py

        if not isinstance(keys, list):
            keys = [keys]

+        # collect elements from "keys" that are not allowed array types
+        col_labels = [x for x in keys
+                      if not isinstance(x, (Series, Index, MultiIndex,


These are the cases (except valid column keys) that currently don't raise a KeyError. I've kept the error reporting exactly the same as before:

pd.DataFrame([[0,1], [2,3]]).set_index(map(str,[1,2])) # KeyError: <map object at 0x000001D380115550>

Could be changed, but then I need to know the desired allowed signature / error reporting

h-vetinari · 2018-08-22T21:45:59Z

pandas/core/frame.py

+            raise KeyError('{}'.format(missing))
+        elif len(set(col_labels)) < len(col_labels):
+            # if all are valid labels, but there are duplicates
+            dup = Series(col_labels)


this is not about the set of all inputs - it's strictly about column labels. E.g.

df.set_index(['A', 'A', 'B', df.A, some_series, some_index]) [...] col_labels = ['A', 'A', 'B'] # at the line of your comment (in this case)

No set operation that I can think of would yield A, which is what I want to raise. (multisets would work, but that's hardly more intuitive than duplicated, IMO)

h-vetinari · 2018-08-22T21:54:23Z

pandas/core/frame.py

        if not isinstance(keys, list):
            keys = [keys]

+        # collect elements from "keys" that are not allowed array types
+        col_labels = [x for x in keys
+                      if not isinstance(x, (Series, Index, MultiIndex,


Tried the ABC Versions, but getting errors that indexes are not hashable. Left it as is for the moment.

____________________ TestDataFrameAlterAxes.test_set_index ____________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FC05ED400> mixed_frame = A B C D foo lem22AYh2f -0.599136 -0.811715 -0.135242 0.764100 bar iBogSI...05 bar vDdWVTLtoj -0.433490 -1.416721 0.460720 -0.539860 bar Pw5bUHt4sR 0.180016 1.651358 -1.041539 -0.832112 bar def test_set_index(self, mixed_frame): df = mixed_frame idx = Index(np.arange(len(df))[::-1]) > df = df.set_index(idx) pandas\tests\frame\test_alter_axes.py:41: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Int64Index([29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0], dtype='int64') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'Int64Index' pandas\core\indexes\base.py:2021: TypeError _________________ TestDataFrameAlterAxes.test_set_index_cast __________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA8EE668> def test_set_index_cast(self): # issue casting an index then set_index df = DataFrame({'A': [1.1, 2.2, 3.3], 'B': [5.0, 6.1, 7.2]}, index=[2010, 2011, 2012]) > df2 = df.set_index(df.index.astype(np.int32)) pandas\tests\frame\test_alter_axes.py:50: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Int64Index([2010, 2011, 2012], dtype='int64') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'Int64Index' pandas\core\indexes\base.py:2021: TypeError _______________ TestDataFrameAlterAxes.test_set_index_timezone ________________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCAD1AEF0> def test_set_index_timezone(self): # GH 12358 # tz-aware Series should retain the tz idx = to_datetime(["2014-01-01 10:10:10"], utc=True).tz_convert('Europe/Rome') df = DataFrame({'A': idx}) > assert df.set_index(idx).index[0].hour == 11 pandas\tests\frame\test_alter_axes.py:343: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None) def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'DatetimeIndex' pandas\core\indexes\base.py:2021: TypeError ______________ TestDataFrameAlterAxes.test_dti_set_index_reindex ______________ self = <pandas.tests.frame.test_alter_axes.TestDataFrameAlterAxes object at 0x0000018FCA7FDD68> def test_dti_set_index_reindex(self): # GH 6631 df = DataFrame(np.random.random(6)) idx1 = date_range('2011/01/01', periods=6, freq='M', tz='US/Eastern') idx2 = date_range('2013', periods=6, freq='A', tz='Asia/Tokyo') > df = df.set_index(idx1) pandas\tests\frame\test_alter_axes.py:413: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas\core\frame.py:3874: in set_index if any(x not in self for x in col_labels): pandas\core\frame.py:3874: in <genexpr> if any(x not in self for x in col_labels): pandas\core\generic.py:1654: in __contains__ return key in self._info_axis pandas\core\indexes\base.py:1994: in __contains__ hash(key) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = DatetimeIndex(['2011-01-31 00:00:00-05:00', '2011-02-28 00:00:00-05:00', '2011-03-31 00:00:00-04:00', '... '2011-05-31 00:00:00-04:00', '2011-06-30 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', freq='M') def __hash__(self): > raise TypeError("unhashable type: %r" % type(self).__name__) E TypeError: unhashable type: 'DatetimeIndex'

jreback · 2018-08-23T10:17:48Z

pandas/core/frame.py

        if not isinstance(keys, list):
            keys = [keys]

+        # collect elements from "keys" that are not allowed array types
+        col_labels = [x for x in keys
+                      if not isinstance(x, (Series, Index, MultiIndex,


huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

h-vetinari · 2018-08-23T14:59:48Z

@jreback

huh? pls use the ABC version, we do this everywhere else. To be very honest I would split this PR up into 2 parts as I think you are actually changing something here but its very very hard to tell.

Fair enough, this was actually a good intuition. I split out all the new things into a new issue/PR (#22484 #22486), and reverted all the new warnings here.

gfyoung · 2018-08-25T10:36:44Z

pandas/tests/frame/conftest.py

@@ -0,0 +1,121 @@
+import pytest


A couple of things about this file:

Do you use all of these fixtures in your changes?

I would prefer if the naming is a little more consistent e.g.:

You have the word "frame" is some of your fixture names but not othres

You have underscores between words in some names but not others

For this PR, I turned an often-used DF into a fixture - together with the other attributes of TestData. @jreback then told me to start a conftest.py and put it there, together with the other TestData-attributes used in this module (frame, mixed_frame).

I translated all the attributes into conftest.py without renaming them, so that they can be replaced on a per-module-basis as laid out in #22471. The names of the fixtures are clearly suboptimal, but following up on fixturizing the other modules would be much harder if I start renaming now.

That's fair. Let's save for a follow-up then.

@gfyoung:

Since the names now changed after all, I thought I'd ask for your opinion on the naming/consistency.

@h-vetinari : This looks great!

@gfyoung
Thanks. I just added a few more consistency fixes (but squashed to simplify reviewing), but only the first three fixture names were affected.

h-vetinari · 2018-09-15T17:03:49Z

I've had a ResourceWarning ~~twice~~ thrice, and then restarted a third time because AppVeyor isn't triggering for some reason. It didn't start a second time, so I've left it like that ATM.

h-vetinari · 2018-09-15T22:04:14Z

@jreback Green (after 5 retriggers...)

jreback · 2018-09-15T22:51:20Z

thanks. in the future, not necessary to push multiple times, just needed a single one, if the failure is unrelated.

)

h-vetinari mentioned this pull request Aug 7, 2018

ENH: Add set_index to Series #22225

Closed

5 tasks

jreback requested changes Aug 8, 2018

View reviewed changes

jreback added Testing pandas testing functions or related to the test suite Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 8, 2018

h-vetinari force-pushed the tst_df_set_index branch from f773b94 to ee569b3 Compare August 8, 2018 21:09

jreback requested changes Aug 9, 2018

View reviewed changes

h-vetinari force-pushed the tst_df_set_index branch from 7aa1af4 to f46c1e4 Compare August 9, 2018 08:34

jreback requested changes Aug 9, 2018

View reviewed changes

h-vetinari commented Aug 10, 2018

View reviewed changes

jreback requested changes Aug 10, 2018

View reviewed changes

h-vetinari force-pushed the tst_df_set_index branch from a1da79c to 32b618f Compare August 10, 2018 16:19

jreback requested changes Aug 22, 2018

View reviewed changes

h-vetinari mentioned this pull request Aug 22, 2018

TST/CLN: remove TestData from frame-tests; replace with fixtures #22471

Closed

34 tasks

h-vetinari force-pushed the tst_df_set_index branch from 32b618f to d2f1e78 Compare August 22, 2018 21:55

h-vetinari commented Aug 22, 2018

View reviewed changes

jreback requested changes Aug 23, 2018

View reviewed changes

This was referenced Aug 23, 2018

API: better error-handling for df.set_index #22484

Closed

API: better error-handling for df.set_index #22486

Merged

h-vetinari changed the title ~~TST/CLN: break up & parametrize tests for df.set_index; better warnings~~ TST/CLN: break up & parametrize tests for df.set_index Aug 23, 2018

gfyoung reviewed Aug 25, 2018

View reviewed changes

h-vetinari force-pushed the tst_df_set_index branch from c175d85 to befb356 Compare August 27, 2018 07:24

h-vetinari added 7 commits September 15, 2018 15:59

Review (jreback)

12d999d

Typo; retrigger CI after circle-timeout

3554dd9

Add fixtures

f4c51ff

Review (jreback)

bcaab67

Revert new warnings

2430273

Add comment in conftest.py

de9e91d

Add docstrings to fixtures

61b252d

h-vetinari force-pushed the tst_df_set_index branch 4 times, most recently from 79541b8 to d050112 Compare September 15, 2018 16:37

h-vetinari force-pushed the tst_df_set_index branch from d050112 to f42793a Compare September 15, 2018 17:45

Review (jreback) fixtures

4ac9633

h-vetinari force-pushed the tst_df_set_index branch from f42793a to 4ac9633 Compare September 15, 2018 18:31

jreback merged commit 1c500fb into pandas-dev:master Sep 15, 2018

h-vetinari deleted the tst_df_set_index branch September 16, 2018 21:12

This was referenced Sep 17, 2018

CLN: res/exp and GH references in frame tests #22730

Merged

TST/CLN: Fixturize frame/test_analytics #22733

Merged

Fixturize tests/frame/test_arithmetic #22736

Merged

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

TST/CLN: break up & parametrize tests for df.set_index (pandas-dev#22236

2ac80c4

)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

TST/CLN: break up & parametrize tests for df.set_index (pandas-dev#22236

550642b

)

h-vetinari mentioned this pull request Jan 9, 2019

API: capabilities of df.set_index #24046

Open

7 tasks

This was referenced Jan 22, 2019

REF/TST: Stop using singleton fixtures #24769

Closed

TST: Remove subset of singleton fixtures #24873

Closed

h-vetinari mentioned this pull request Feb 3, 2019

CLN: Use ABCs in set_index #25128

Merged

This was referenced Mar 10, 2019

Fixturize tests/frame/test_axis_select_reindex.py #25627

Merged

Fixturize tests/frame/test_constructors.py #25635

Merged


		class TestDataFrameAlterAxes(TestData):

		def test_set_index_manually(self):


		tm.assert_frame_equal(result, expected)

		@pytest.mark.parametrize('container', [Series, Index, np.array, list, mi])


		tm.assert_frame_equal(result, expected)

		@pytest.mark.parametrize('elem2', [key, Series, Index, np.array, list, mi])


		keys = [elem1(df['A']), elem2(df['A'])]

		# == gives ambiguous Boolean for Series

		@@ -28,244 +25,284 @@

		class TestDataFrameAlterAxes(TestData):

TST/CLN: break up & parametrize tests for df.set_index #22236

TST/CLN: break up & parametrize tests for df.set_index #22236

Conversation

h-vetinari commented Aug 7, 2018 • edited Loading

codecov bot commented Aug 8, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Aug 9, 2018 • edited Loading

Comment last updated on August 31, 2018 at 20:35 Hours UTC

h-vetinari commented Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

h-vetinari Aug 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Aug 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Aug 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Aug 10, 2018

h-vetinari commented Aug 14, 2018

h-vetinari commented Aug 16, 2018

jreback commented Aug 16, 2018

h-vetinari commented Aug 20, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Aug 7, 2018 •

edited

Loading

codecov bot commented Aug 8, 2018 •

edited

Loading

pep8speaks commented Aug 9, 2018 •

edited

Loading

h-vetinari commented Aug 9, 2018 •

edited

Loading

h-vetinari Aug 10, 2018 •

edited

Loading

h-vetinari Aug 10, 2018 •

edited

Loading

h-vetinari Aug 10, 2018 •

edited

Loading