Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#5186: set_index case with multiindex #5190

Merged
merged 7 commits into from
Nov 14, 2022

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Nov 3, 2022

Signed-off-by: Myachev anatoly.myachev@intel.com

What do these changes do?

The main problem is that the size of the key (column name) does not match the number of multiindex levels. The simplest solution is to complete the key to the desired dimension.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: set_index works wrong with multiindex #5186
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev force-pushed the issue5186 branch 2 times, most recently from 87b1955 to fd75856 Compare November 4, 2022 12:15
@anmyachev anmyachev marked this pull request as ready for review November 4, 2022 13:21
@anmyachev anmyachev requested a review from a team as a code owner November 4, 2022 13:21
@dchigarev
Copy link
Collaborator

This PR doesn't seem to fix the root cause of the problem but just adds a hack to make .set_index() work properly.

The actual bug is that the .get_indexer_for() in .take_2d_... method seems to not working as expected for MultiIndex columns

if col_labels is not None:
# Get numpy array of positions of values from `col_labels`
col_positions = self.columns.get_indexer_for(col_labels)

Indeed, passing a single level value to locate postion of the multi-level index fails and returns an invalid index:

>>> df = pd.DataFrame(np.ones([2,4]),columns = [['a','b','c','d'],['','','x','y']])
>>> df.columns.get_indexer_for(["a"])
array([-1], dtype=int64)

There's though another method .get_locs() that works exactly as expected in that case:

>>> df.columns.get_locs(["a"])
array([0], dtype=int64)

I would propose to use .get_locs() method in one way or another to resolve the issue.

I also don't think we should put the fix-logic inside low-level .from_labels() as the PandasDataframe layer shouldn't care about arguments preprocessing. The user's input handling should be located at the modin.pandas level.

So, I have two thoughs from the top of my head of how to fix the issue:

  1. Fix the very root-cause by if-branching in .take_2d_labels_or_positionsl() and use .get_indexer_for() or .get_locs() depending on the index object's type.
  2. Add the logic of completing multi-level indexing inside high-level DataFrame.set_index() that would use .get_locs(), for example:
    if isinstance(columns, MultiIndex):
        keys = columns[columns.get_locs(keys)]

@anmyachev
Copy link
Collaborator Author

I would propose to use .get_locs() method in one way or another to resolve the issue.

get_locs works unexpectedly with another case we are testing. That's why I didn't use it.
Also, I didn't find a multiindex function that would work without my hacks in all test cases.

(Pdb) df.columns.get_locs(['a', ('b', "")])    
FutureWarning: The behavior of indexing on a MultiIndex with a nested sequence of labels is deprecated and will change in a future version. `series.loc[label, sequence]` will raise if any members of 'sequence' or not present in the index's second level. 
To retain the old behavior, use `series.index.isin(sequence, level=1)`
array([0], dtype=int64)

@dchigarev
Copy link
Collaborator

dchigarev commented Nov 7, 2022

get_locs works unexpectedly with another case we are testing

Okay, I see.

It seems that it works inappropriately only in cases when multiple keys with a different number of levels are passed. Thus we still could try to apply .get_locs() for the 2nd option from the comment.

Let's say we would put the following for-loop here in the DataFrame.set_index():

# keys = ["a", ("b", "")]
proper_keys = []
if isinstance(self.columns, MultiIndex):
    for col in keys:
        proper_keys.append(self.columns[self.columns.get_locs(col)])
# proper_keys = [("a", ""), ("b", "")]
self._query_compiler.set_index_from_columns(keys=proper_keys, ...)

How do you think, would that work?


p.s.

just found out that there's also a .get_loc() method that returns location for a single key only. It should fit better when used in grabbing positions in a for-loop one by one.

@anmyachev anmyachev force-pushed the issue5186 branch 3 times, most recently from 1fb581e to 2aef99f Compare November 8, 2022 08:56
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator Author

@dchigarev ready for review

Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Comment on lines +658 to +659
if isinstance(label, str):
label = [label]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like get_loc starts iterating over the string, resulting in incorrect results.

Signed-off-by: Myachev <anatoly.myachev@intel.com>
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Waiting for CI to pass

@anmyachev
Copy link
Collaborator Author

@dchigarev ready for merge

@dchigarev dchigarev merged commit 269ce44 into modin-project:master Nov 14, 2022
@anmyachev anmyachev deleted the issue5186 branch November 14, 2022 20:41
dchigarev pushed a commit that referenced this pull request Nov 25, 2022
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: set_index works wrong with multiindex
2 participants