Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: loc dropping levels when df has only one row #38150

Merged
merged 13 commits into from
Dec 30, 2020
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 29, 2020

Originally this case was dispatched to xs, which dispatched to _get_loc_level, which raised if a slice is not slice(None, None) and went back to our original function. If the DataFrame contains only one row, _get_loc_level does not raise and hence we drop levels.

I added len(self.obj) for performance reasons becasue I am not sure if the any check for slices is faster than supressing the IndexingError in general.

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Nov 29, 2020
@jbrockmendel
Copy link
Member

If the DataFrame contains only one row, _get_loc_level does not raise and hence we drop levels.

Should the _get_loc_level behavior correct?

@phofl
Copy link
Member Author

phofl commented Nov 30, 2020

From reading the code I think this is the intended behavior, but not quite sure.

@@ -838,8 +838,10 @@ def _getitem_nested_tuple(self, tup: Tuple):
if self.name != "loc":
# This should never be reached, but lets be explicit about it
raise ValueError("Too many indices")
with suppress(IndexingError):
return self._handle_lowerdim_multi_index_axis0(tup)
if len(self.obj) > 1 or not any(isinstance(x, slice) for x in tup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can we just handle this properly in _handle_lower_multi_index_axis0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could raise an indexingError in there. We can not call xs.

            axis = self.axis or 0
            return self._getitem_axis(tup, axis=axis)

Would this be preferably to the if condition outside?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the len(self.obj) needed here? that smells a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance reasons, not necessary from a technical standpoint. Will remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok pls add a comment then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jbrockmendel
Copy link
Member

It looks like jreback and I had different suggestions, but the theme to both is "can this be handled any earlier?" have you ruled out the lower-level options?

@phofl
Copy link
Member Author

phofl commented Dec 3, 2020

Sorry tough week at work. We can not call xs, because this would return wrong results. Do you mean mit earlier before reaching this point? Because I understood the suggestions as doing this deeper down in the calling stack

@jbrockmendel
Copy link
Member

Do you mean mit earlier before reaching this point? Because I understood the suggestions as doing this deeper down in the calling stack

Deeper in the call stack is ideal, yes.

@jbrockmendel
Copy link
Member

@phofl are you of the opinion this is the best available approach? you're pretty well versed in this part of the code and im inclined to trust your opinion.

@phofl
Copy link
Member Author

phofl commented Dec 22, 2020

Forgot that one unfortunately, had 3 pretty busy weeks at work, but will be better now.

I am not 100% sure, but I tried this out a bit locally. We have to do this if condition somewhere before dispatching to xs, _get_label itself is not really an option, would be too ugly. Doing it in _handle_lowerdim_multi_index_axis0 is possible, but would be quite ugly, because the if condition is the same case as raising TypeError or InvalidIndexError. This would be quite ugly, also I think this is just unnecessary convolution for the other case where _handle_lowerdim_multi_index_axis0 is called.

@@ -838,8 +838,10 @@ def _getitem_nested_tuple(self, tup: Tuple):
if self.name != "loc":
# This should never be reached, but lets be explicit about it
raise ValueError("Too many indices")
with suppress(IndexingError):
return self._handle_lowerdim_multi_index_axis0(tup)
if len(self.obj) > 1 or not any(isinstance(x, slice) for x in tup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the len(self.obj) needed here? that smells a bit

@jreback jreback added this to the 1.3 milestone Dec 24, 2020
@phofl
Copy link
Member Author

phofl commented Dec 24, 2020

Was a bit hasty. The problem only occurs for Objects with only one row and slices in indexer. Objects with more than one row have to go in there, no matter if the indexer has slices.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

Was a bit hasty. The problem only occurs for Objects with only one row and slices in indexer. Objects with more than one row have to go in there, no matter if the indexer has slices.

this is very strange then, these must be hitting a different path that other types, no? maybe something different in the indexing engine itself? its pretty odd to special case like this.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment and can you merge master

@@ -838,8 +838,10 @@ def _getitem_nested_tuple(self, tup: Tuple):
if self.name != "loc":
# This should never be reached, but lets be explicit about it
raise ValueError("Too many indices")
with suppress(IndexingError):
return self._handle_lowerdim_multi_index_axis0(tup)
if len(self.obj) > 1 or not any(isinstance(x, slice) for x in tup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok pls add a comment then

@phofl
Copy link
Member Author

phofl commented Dec 29, 2020

Looked into this again, the issue is with Series, the dimension of the MultiIndex should be reduced there, #6022

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

cc @jbrockmendel if any commetns.

@@ -842,8 +842,14 @@ def _getitem_nested_tuple(self, tup: Tuple):
if self.name != "loc":
# This should never be reached, but lets be explicit about it
raise ValueError("Too many indices")
with suppress(IndexingError):
return self._handle_lowerdim_multi_index_axis0(tup)
if isinstance(self.obj, ABCSeries) or not any(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.ndim == 1 is faster than this isinstance check (41 ns vs 297 ns, so not a huge deal either way)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is better to read too :)

ser = Series([0], index=mi)
result = ser.loc["x", :, "z"]
expected = Series([0], index=Index(["y"], name="b"))
tm.assert_series_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big picture, why isnt ser_expected = frame_expected["d"]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series drops levels of MultiIndex (for example #6022), while DataFrame should keep them. The current behavior is, that DataFrame keeps them all the times except when it has one row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is #38150 (comment) saying that you plant to change the Series behavior to match DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the comment was meant to clarify why we have to check for ndim==1

@jreback jreback merged commit bbd0f66 into pandas-dev:master Dec 30, 2020
@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

thanks @phofl

@phofl phofl deleted the 10521 branch December 30, 2020 13:51
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
@@ -842,8 +842,12 @@ def _getitem_nested_tuple(self, tup: Tuple):
if self.name != "loc":
# This should never be reached, but lets be explicit about it
raise ValueError("Too many indices")
with suppress(IndexingError):
return self._handle_lowerdim_multi_index_axis0(tup)
if self.ndim == 1 or not any(isinstance(x, slice) for x in tup):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl the more i look at this the weirder it seems to be treating Series/DataFrame differently. Are you clear on if there's a reason for this vs just historical coincidence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel

sorry for the slow reply. I am pretty sure that this is for historical reasons, but would have to take a closer look.

I am currently on vacation until mid July. Can get back to you afterwards, will also have more time for pandas then again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, enjoy your vacation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsisten multi-level indexing when levels are dropped
3 participants