-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#5186: set_index
case with multiindex
#5190
Conversation
87b1955
to
fd75856
Compare
This PR doesn't seem to fix the root cause of the problem but just adds a hack to make The actual bug is that the modin/modin/core/dataframe/pandas/dataframe/dataframe.py Lines 657 to 659 in c30ab4c
Indeed, passing a single level value to locate postion of the multi-level index fails and returns an invalid index: >>> df = pd.DataFrame(np.ones([2,4]),columns = [['a','b','c','d'],['','','x','y']])
>>> df.columns.get_indexer_for(["a"])
array([-1], dtype=int64) There's though another method >>> df.columns.get_locs(["a"])
array([0], dtype=int64) I would propose to use I also don't think we should put the fix-logic inside low-level So, I have two thoughs from the top of my head of how to fix the issue:
|
(Pdb) df.columns.get_locs(['a', ('b', "")])
FutureWarning: The behavior of indexing on a MultiIndex with a nested sequence of labels is deprecated and will change in a future version. `series.loc[label, sequence]` will raise if any members of 'sequence' or not present in the index's second level.
To retain the old behavior, use `series.index.isin(sequence, level=1)`
array([0], dtype=int64) |
Okay, I see. It seems that it works inappropriately only in cases when multiple keys with a different number of levels are passed. Thus we still could try to apply Let's say we would put the following for-loop here in the # keys = ["a", ("b", "")]
proper_keys = []
if isinstance(self.columns, MultiIndex):
for col in keys:
proper_keys.append(self.columns[self.columns.get_locs(col)])
# proper_keys = [("a", ""), ("b", "")]
self._query_compiler.set_index_from_columns(keys=proper_keys, ...) How do you think, would that work? p.s. just found out that there's also a |
1fb581e
to
2aef99f
Compare
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
@dchigarev ready for review |
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
if isinstance(label, str): | ||
label = [label] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like get_loc
starts iterating over the string, resulting in incorrect results.
Signed-off-by: Myachev <anatoly.myachev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Waiting for CI to pass
@dchigarev ready for merge |
Signed-off-by: Myachev anatoly.myachev@intel.com
What do these changes do?
The main problem is that the size of the key (column name) does not match the number of multiindex levels. The simplest solution is to complete the key to the desired dimension.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
set_index
works wrong with multiindex #5186docs/development/architecture.rst
is up-to-date