Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent naming when combining indices of various types #35847

Closed
iamlemec opened this issue Aug 22, 2020 · 3 comments · Fixed by #36413
Closed

BUG: inconsistent naming when combining indices of various types #35847

iamlemec opened this issue Aug 22, 2020 · 3 comments · Fixed by #36413
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@iamlemec
Copy link
Contributor

iamlemec commented Aug 22, 2020

This is building off of some issues seen in #13475 and subsequent PR. Essentially, the resulting index name when combining existing named indices with union_indexes is not consistent across index types. This primarily affects concat and the DataFrame constructor, which call union_indexes.

When combining named indices, there are three main name resolution rules I can think of:

  • ignore: assign no name to output
  • unanimous: assign name if all names agree
  • consensus: assign name only if one unique non-null name

With some testing, below is my best understanding of what resolution rule the various index types use. Note that the behavior may differ depending on whether the indices are numerically equal or not, as with RangeIndex. For the not numerically equal case:

  • Index: consensus
  • RangeIndex: ignore
  • Int64Index: unanimous
  • Float64Index: unanimous
  • DateTimeIndex: consensus
  • TimeDeltaIndex: consensus
  • PeriodIndex: unanimous
  • CategoricalIndex: unanimous
  • MultiIndex: unanimous (over all levels)

I'm not really taking a stand on the correct name resolution rule, but I think they should at least be consistent across index type! And of course MultiIndex is a bit more complicated. Seems possible things could be implemented in common for non-multi-indices in the higher level union function? I'm not really sure, but I'm happy to put some work into it.

The test is slightly different for each index type, but for the RangeIndex case, here's an MWE:

idx1 = pd.RangeIndex(0, 5, name='idx')
idx2 = pd.RangeIndex(2, 7, name='idx')
pd.core.indexes.api.union_indexes([idx1, idx2])

and the result will have no name (checked on master).

@iamlemec iamlemec added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2020
@dsaxton dsaxton added API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2020
@jreback
Copy link
Contributor

jreback commented Aug 23, 2020

all index combinations should have the consensus name or None (if different)

am pretty sure we test this

but it's possible it's not fully tested on - happy to have a PR to do this and isolate and edge cases

@iamlemec
Copy link
Contributor Author

Ok, sounds good. Just looking at the RangeIndex case, there doesn't appear to be testing of union or naming, so I can go through and add those in as needed.

As for the implementation, would it be okay to add common naming logic in Index.union rather than reimplement it in _union for each subclass?

The other thing is that these same issue arise with other operations such as intersection. These tend to use get_op_result_name for name resolution. However, that appears to use the unanimous rule. It sort of tries to be consensus, but since it uses hasattr(obj, name) instead of obj.name is not None, it doesn't quite get there.

@iamlemec
Copy link
Contributor Author

Ok, seems like these changes are feasible. Working through getting testing 100% now. Just to double check, we're saying that all of union, intersection, difference, and symmetric_difference should use consensus naming? It seems that indexes/test_base.py::TestIndex::test_intersection_name_preservation currently expects naming to be unanimous for intersection, so I just want to make sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants