Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix inconsistent index naming with union/intersect #35847 #36413

Merged
merged 1 commit into from
Oct 7, 2020

Conversation

iamlemec
Copy link
Contributor

@iamlemec iamlemec commented Sep 17, 2020

This takes care of some inconsistency in how names are handled by Index functions union and intersection, as discussed in #35847. I believe this covers all index types, either through the base class or in the subclass when necessary.

What I've implemented here actually uses the unanimous convention, wherein all input names must match to get assigned to the output. Originally, I was thinking consensus would be better (assign if there is only one non-None input name), but looking through the existing tests, it seems that unanimous was usually expected. I also had some worries about whether index names would become too "contagious" with consensus. Anyway, it's easy to change between the two if people have strong opinions on this.


def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
return self._shallow_copy(result, name=name)
if isinstance(result, ABCIndexClass):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check will include Index objects but exclude subclasses. is that intentional? if so, can you add a comment about why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I don't actually understand this whole ABC business 100%, but I'm getting:

isinstance(pd.MultiIndex, pd.Index) -> False
isinstance(pd.MultiIndex, ABCIndexClass) -> True
issubclass(pd.MultiIndex, pd.Index) -> True

Should I be using issubclass instead? Wasn't trying to do anything too fancy here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah its not intuitive. basically isinstance(obj, ABCIndexClass) is equivalent to type(obj) is Index, so what you probably want is isinstance(obj, Index), which behaves like normal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but wouldn't we get isinstance(multiindex, ABCIndexClass) == True and type(multiindex) != pd.Index here? Either way, agreed that isinstance will get us what we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, i confused ABCIndexClass with ABCIndex, never mind.

@@ -5926,3 +5931,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs):
return index.astype(object)

return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)


def get_unanimous_names(*indexes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever get more than two here?

Would indexes instead of *indexes work>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we use this in union_indexes which operates on a list of indices. but most uses are explicitly two, so I figured this would make the typical syntax easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the input and output

@@ -3415,7 +3420,7 @@ def union(self, other, sort=None):
other, result_names = self._convert_can_do_setop(other)

if len(other) == 0 or self.equals(other):
return self
return self._shallow_copy(names=result_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.rename?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, makes sense. Certainly more legible, and seems like equivalent.

@@ -180,7 +180,8 @@ def _union(self, other, sort):
if needs_cast:
first = self.astype("float")
second = other.astype("float")
return first._union(second, sort)
result = first._union(second, sort)
return Float64Index(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should already be a Float64Index. are there cases where it isnt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. At some point I was allowing _union to return just an array, so I ended up needing that. But now there's a _shallow_copy at return, so it's redundant. Will revert.


the_union = idx.union(idx[:0], sort=sort)
assert the_union is idx
assert tm.equalContents(the_union, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use tm.assert_index_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback added the Index Related to the Index class or subclasses label Sep 17, 2020
pandas/core/indexes/api.py Show resolved Hide resolved
@@ -5926,3 +5931,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs):
return index.astype(object)

return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)


def get_unanimous_names(*indexes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the input and output

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good @iamlemec

only small concern is that we are likely not fully validating whether we are returning a copy for the 0 len cases or where indexes are equal.

but certainly can do as a followup.

@@ -293,6 +293,7 @@ Indexing

- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`)
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`)
- Harmonize resulting index names from :meth:`Index.union` and :meth:`Index.intersection` across various index types (:issue:`35847`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are significant changes that a user would want to know can you add that detail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also migth warrant a sub-section if this is something easier to see via code-examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, added in a bit about name preservation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right add the issue number above and remove this note

@@ -124,6 +124,65 @@ def test_corner_union(self, index, fname, sname, expected_name):
expected = index.drop(index).set_names(expected_name)
tm.assert_index_equal(union, expected)

# test copy.union(subset) - need sort for unicode and string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this another test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@iamlemec
Copy link
Contributor Author

Ok, added in a code example to whatsnew showing the new/consistent naming behavior when aggregating. Also split off those new tests.

Regarding copying in the various empty/equality cases, copy here means a deep copy including values, right? Seems like some of the tests are using "is" which breaks when you add in copying. I can look into that and the nan issues in a subsequent PR if that works.

@jreback jreback added this to the 1.2 milestone Sep 19, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comment, ping on green.

Index/column name preservation when aggregating
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When aggregating using :meth:`concat` or the :class:`DataFrame` constructor, Pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue reference here.

Also I would add the same text here in the docs, maybe in the reshaping section, in a note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i ended up adding it as a note in the "merge, join, concatenate, and compare" section.

@@ -293,6 +293,7 @@ Indexing

- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`)
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`)
- Harmonize resulting index names from :meth:`Index.union` and :meth:`Index.intersection` across various index types (:issue:`35847`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right add the issue number above and remove this note


res_name = get_unanimous_names(self, *others)[0]
if this.name != res_name:
return this._shallow_copy(name=res_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename instead of shallow_copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, indeed. i actually went through and made the same change in base.py and api.py. seems like it works fine all around.

@iamlemec
Copy link
Contributor Author

@jreback fixed the docs indentation issue, that's green now. i'm having trouble figuring out went wrong with some of the other tests. is it some kind of internal CI error?

@jbrockmendel
Copy link
Member

CI failures were unrelated, should be fixed if you merge master

@@ -5935,3 +5941,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs):
return index.astype(object)

return klass(data, dtype=dtype, copy=copy, name=name, **kwargs)


def get_unanimous_names(*indexes: Type[Index]) -> Tuple[Any, ...]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type[Index] seems really weird here. Shouldnt it be Tuple[Index]? And i think the return type should use Label instead of Any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering if there was a Label notion! And now I see there's a _typing.py, will keep that handy in the future.

Yeah, I don't know why that extraneous ended up Type there. But actually with an *args situation, you just give the type of the individual elements, so I think it should just be Index.


if self.equals(other):
return self._get_reconciled_name_object(other)

if len(self) == 0:
return self.copy()
return self.copy()._get_reconciled_name_object(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the base class you got rid of the _get_reconciled_name_object usage. why going the other direction here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still keeping this around for the rare case where you want to return the same values but with a possible name change. In base.py it only shows up in the one intersection return clause (and not in union because of the union/_union split). It is interesting though that the version of intersection in datetimelike.py has two more quick return clauses than the base class.

@iamlemec
Copy link
Contributor Author

I think we're mostly good here, but the travis-ci build for ARM just kinda bailed without comment.

get_objs_combined_axis,
)
import pandas.core.indexes.base as ibase
from pandas.core.indexes.base import get_unanimous_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_unanimous_names is in indexes.api, should get it from there

@@ -471,7 +471,7 @@ def test_intersection_bug(self):
def test_intersection_list(self):
# GH#35876
values = [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-02-01")]
idx = pd.DatetimeIndex(values, name="a")
idx = pd.DatetimeIndex(values)
res = idx.intersection(values)
tm.assert_index_equal(res, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could keep the name in idx and test for tm.assert_index_equal(res, idx.rename(None))?

@@ -46,7 +46,7 @@ def test_join_level_corner_case(idx):

def test_join_self(idx, join_type):
joined = idx.join(idx, how=join_type)
assert idx is joined
assert tm.equalContents(joined, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does assert_index_equal not work here?

@@ -278,7 +278,7 @@ def test_intersection(idx, sort):

# corner case, pass self
the_int = idx.intersection(idx, sort=sort)
assert the_int is idx
assert tm.equalContents(the_int, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_index_equal?

@iamlemec
Copy link
Contributor Author

Thanks @jbrockmendel! Pushed all your suggested changes.

@jbrockmendel
Copy link
Member

@iamlemec can you merge master; IIRC this was pretty close to ready

@iamlemec
Copy link
Contributor Author

iamlemec commented Oct 7, 2020

@jbrockmendel sure thing, just pushed a rebase onto master

@jreback jreback merged commit 1f67100 into pandas-dev:master Oct 7, 2020
@jreback
Copy link
Contributor

jreback commented Oct 7, 2020

thanks @iamlemec very nice!

@iamlemec
Copy link
Contributor Author

iamlemec commented Oct 7, 2020

thanks for all the help @jreback and @jbrockmendel!

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent naming when combining indices of various types
3 participants