-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df #3765
FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df #3765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RehanSD, this looks like it would fix the issue. Can you add a test?
Codecov Report
@@ Coverage Diff @@
## master #3765 +/- ##
==========================================
+ Coverage 84.98% 89.67% +4.68%
==========================================
Files 253 257 +4
Lines 19124 19674 +550
==========================================
+ Hits 16253 17642 +1389
+ Misses 2871 2032 -839
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thank you @devin-petersohn! I'm running into an issue where the dtypes get changed doing this, so once I figure that out, I'll be sure to add some tests! |
@RehanSD when you get back to converting this from draft to ready-to-review, please also update the PR title so it doesn't have an ellipsis but is ready to be included in the changelog. |
This pull request introduces 1 alert when merging 997eaa9ef0aabcb90505d05cd96361ef6fff19d4 into d590de0 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of special casing like this, but it might be necessary in this case. @YarShev, @vnlitvinov @dchigarev any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Me neither, but pandas is its own world of sometimes twisted logic, so... |
modin/pandas/indexing.py
Outdated
if ( | ||
isinstance(row_loc, list) | ||
and len(row_loc) == 1 | ||
and row_loc[0] not in self.qc.index | ||
) or ( | ||
not (is_list_like(row_loc) or isinstance(row_loc, slice)) | ||
and row_loc not in self.qc.index | ||
): | ||
row_loc = row_loc[0] if isinstance(row_loc, list) else row_loc | ||
index = self.qc.index.insert(len(self.qc.index), row_loc) | ||
self.qc = self.qc.reindex(labels=index, axis=0) | ||
self.df._update_inplace(new_query_compiler=self.qc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( | |
isinstance(row_loc, list) | |
and len(row_loc) == 1 | |
and row_loc[0] not in self.qc.index | |
) or ( | |
not (is_list_like(row_loc) or isinstance(row_loc, slice)) | |
and row_loc not in self.qc.index | |
): | |
row_loc = row_loc[0] if isinstance(row_loc, list) else row_loc | |
index = self.qc.index.insert(len(self.qc.index), row_loc) | |
self.qc = self.qc.reindex(labels=index, axis=0) | |
self.df._update_inplace(new_query_compiler=self.qc) | |
self._maybe_compute_enlarge_labels(row_loc, col_loc) |
I guess we actually have a method in the Locator class that was brought to solve exactly the same problem but is unused for some reason: _compute_enlarge_labels
.
We may try to replace the current _compute_enlarge_labels
logic with the introduced one and then replace this if-statement with a single call of self._maybe_compute_enlarge_labels
. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that require me to compute a new index with the labels from the loc and then call this function? Because the input to the function you linked takes two Indexes and computes the differences, and in this case we already have the difference?
Also should I hold off on merging and change this, or should this be another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial idea of my suggestion was to hide this scary if-construction into some function. We already have a function that is supposed to compute enlargest labels, but it seems that it doesn't do exactly what we want. As far as I understand, it mimics some outdated pandas logic and isn't used anywhere in the code. That's why I propose to remove _compute_enlarge_labels
at all and replace it with some _maybe_enlarge_labels
function that would take row and column locators and perform the same logic that this PR is introducing (check whether row_loc
contains a label that's not presented in the index and insert the missing label into the index if needed).
I would like _maybe_enlarge_labels
to be added in this PR, as for the removal of _compute_enlarge_labels
- I think we should do it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me! From a naming perspective do we want to name it something like _enlarge_labels
? Happy either way though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this special case only applies if we are indexing a single value that doesn't exist. Should this be the case, or should we also be checking if lists with multiple indexers have missing labels and adding them?
e.g. if a df has a range index from 0 to 2, and we do df[3, 4, 5] = df[1] should we add index elements 3, 4, and 5? Because this would error in the current implementation @devin-petersohn @dchigarev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what pandas does in this case? I guess it's what we have to imitate anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ran some tests:
Pandas:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
>>> df.loc[2, 3] = df.loc[1]
ValueError: Incompatible indexer with Series
>>> df
A B 3
0 1.0 2.0 NaN
1 3.0 4.0 NaN
2 NaN NaN NaN
# Reset df
>>> df.loc[[2, 3]] = df.loc[1]
KeyError: "None of [Int64Index([2, 3], dtype='int64')] are in the [index]"
>>> df.loc[[2, 3], :] = df.loc[1]
KeyError: "None of [Int64Index([2, 3], dtype='int64')] are in the [index]"
>>> df.loc[:, 'C', 'D'] = df.loc[:, 'A']
TypeError: unhashable type: 'slice'
>>> df.loc[:, ['C', 'D']] = df.loc[:, 'A']
>>> df
A B C D
0 1 2 1 3
1 3 4 1 3
Modin (without this fix):
>>> import modin.pandas as pd
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
>>> df.loc[2, 3] = df.loc[1]
KeyError: array([2])
>>> df
A B
0 1 2
1 3 4
>>> df.loc[[2, 3]] = df.loc[1]
KeyError: array([2, 3])
>>> df.loc[[2, 3], :] = df.loc[1]
KeyError: array([2, 3])
>>> df.loc[:, 'C', 'D'] = df.loc[:, 'A']
IndexingError: Too many indexers
>>> df.loc[:, ['C', 'D']] = df.loc[:, 'A']
KeyError: array(['C', 'D'], dtype=object)
>>> df
A B
0 1 2
1 3 4
so there are a lot of places where we are differing from pandas
or where we also error but have different (less transparent) errors.
Modin (with this fix so far):
>>> import modin.pandas as pd
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
>>> df.loc[2, 3] = df.loc[1]
KeyError: array([3])
>>> df
A B
0 1.0 2.0
1 3.0 4.0
2 NaN NaN
# Reset df
>>> df.loc[[2, 3]] = df.loc[1]
KeyError: array([2, 3])
>>> df.loc[[2, 3], :] = df.loc[1]
KeyError: array([2, 3])
>>> df.loc[:, 'C', 'D'] = df.loc[:, 'A']
IndexingError: Too many indexers
>>> df.loc[:, ['C', 'D']] = df.loc[:, 'A']
KeyError: array(['C', 'D'], dtype=object)
>>> df
A B
0 1 2
1 3 4
I'm leaving this here to document the differences we have so we can start fixing them!
I noticed when experimenting in an interactive python shell that the dtypes of the df change - I think its maybe because of this line here in the query compiler -
_LocIndexer to preserve (and copy for new columns) dtypes?
|
You linked an Example>>> df
a b c
1 1 4 4
2 1 2 3
3 1 2 3
>>> df.dtypes
a int32
b int32
c int32
dtype: object
>>> res = df.reindex([1, 2, 3, 4])
>>> res
a b c
1 1.0 4.0 4.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN NaN
>>> res.dtypes
a float64
b float64
c float64
dtype: object What we could do here is try to detect cases of appearing missing labels after reindex, and if it's not the case pass "copy" as a new Anyway, I think it's a discussion that should be done in a separate issue. |
pandas_df.loc[2] = pandas_df.loc[1] | ||
modin_df.loc[2] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[6] = pandas_df.loc[1] | ||
modin_df.loc[6] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[lambda df: 70] = pandas_df.loc[1] | ||
modin_df.loc[lambda df: 70] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[90] = pandas_df.loc[70] | ||
modin_df.loc[90] = modin_df.loc[70] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the same case, but when a column is appended?
>>> df
a b c
1 1 2 3
2 1 2 3
3 1 2 3
>>> df.loc[:, "d"] = 1
>>> df
a b c d
1 1 2 3 1
2 1 2 3 1
3 1 2 3 1
This PR only changes the logic of inserting a new row, but not the column. Isn't column inserting is affected by the same bug that was fixed there for rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for out of bounds columns in the most recent commit to cover this case.
@RehanSD what is the state of this PR? What makes it a draft? |
@vnlitvinov this PR was put on hold for some more pressing items - its still incomplete, but there's the start of a new implementation for setitem. |
How do folks think we should handle cases like these, where the same thing (one with a Series, and one with a DF) is invalid, but hits different errors (since it seems to go to diff codepaths)? In [32]: pdf.loc[4, 5] = pdf
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 pdf.loc[4, 5] = pdf
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value)
713 self._has_valid_setitem_indexer(key)
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:1647, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1642 self.obj[key] = infer_fill_value(value)
1644 new_indexer = convert_from_missing_indexer_tuple(
1645 indexer, self.obj.axes
1646 )
-> 1647 self._setitem_with_indexer(new_indexer, value, name)
1649 return
1651 # reindex the axis
1652 # make sure to clear the cache because we are
1653 # just replacing the block manager here
1654 # so the object is the same
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:1688, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1685 # align and set the values
1686 if take_split_path:
1687 # We have to operate column-wise
-> 1688 self._setitem_with_indexer_split_path(indexer, value, name)
1689 else:
1690 self._setitem_single_block(indexer, value, name)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:1724, in _iLocIndexer._setitem_with_indexer_split_path(self, indexer, value, name)
1721 if is_list_like_indexer(value) and getattr(value, "ndim", 1) > 0:
1723 if isinstance(value, ABCDataFrame):
-> 1724 self._setitem_with_indexer_frame_value(indexer, value, name)
1726 elif np.ndim(value) == 2:
1727 self._setitem_with_indexer_2d_value(indexer, value)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:1840, in _iLocIndexer._setitem_with_indexer_frame_value(self, indexer, value, name)
1838 if item in value:
1839 sub_indexer[1] = item
-> 1840 val = self._align_series(
1841 tuple(sub_indexer), value[item], multiindex_indexer
1842 )
1843 else:
1844 val = np.nan
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/indexing.py:2148, in _iLocIndexer._align_series(self, indexer, ser, multiindex_indexer)
2144 return ser._values.copy()
2146 return ser.reindex(ax)._values
-> 2148 raise ValueError("Incompatible indexer with Series")
ValueError: Incompatible indexer with Series
In [33]: pdf.loc[4, 5] = pandas.Series(pdf)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [33], in <cell line: 1>()
----> 1 pdf.loc[4, 5] = pandas.Series(pdf)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/series.py:367, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
363 else:
365 name = ibase.maybe_extract_name(name, data, type(self))
--> 367 if is_empty_data(data) and dtype is None:
368 # gh-17261
369 warnings.warn(
370 "The default dtype for empty Series will be 'object' instead "
371 "of 'float64' in a future version. Specify a dtype explicitly "
(...)
374 stacklevel=find_stack_level(),
375 )
376 # uncomment the line below when removing the FutureWarning
377 # dtype = np.dtype(object)
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/construction.py:821, in is_empty_data(data)
819 is_none = data is None
820 is_list_like_without_dtype = is_list_like(data) and not hasattr(data, "dtype")
--> 821 is_simple_empty = is_list_like_without_dtype and not data
822 return is_none or is_simple_empty
File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
1525 @final
1526 def __nonzero__(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).__name__} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1530 )
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
```? |
modin/pandas/indexing.py
Outdated
item = item.rename(index=col_loc[0]) | ||
else: | ||
item = item.rename(columns=lambda x: col_loc[0]) | ||
new_qc = self.qc.concat(1, item._query_compiler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably reuse qc.insert_item
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…-like index is out of bounds Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
…ix of existing and new columns Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
64126e7
to
8918e86
Compare
…to df Signed-off-by: Bill Wang <billiam@ponder.io>
pandas_df.loc[2] = pandas_df.loc[1] | ||
modin_df.loc[2] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[6] = pandas_df.loc[1] | ||
modin_df.loc[6] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[lambda df: 70] = pandas_df.loc[1] | ||
modin_df.loc[lambda df: 70] = modin_df.loc[1] | ||
df_equals(modin_df, pandas_df) | ||
|
||
pandas_df.loc[90] = pandas_df.loc[70] | ||
modin_df.loc[90] = modin_df.loc[70] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment handled?
@RehanSD @Billy2551 what makes this a draft still? |
Signed-off-by: Bill Wang <billiam@ponder.io>
0bb0465
to
395b7e2
Compare
Signed-off-by: Bill Wang <billiam@ponder.io>
Signed-off-by: Bill Wang <billiam@ponder.io>
modin/pandas/indexing.py
Outdated
"Must have equal len keys and value when setting with an iterable" | ||
) | ||
else: | ||
if item.shape[0] != len(self.qc.index) or item.shape[1] != len(col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if item.shape[0] != len(self.qc.index) or item.shape[1] != len(col_loc): | |
if item.shape != (len(self.qc.index, len(col_loc)): |
maybe that way it would look cleaner? or are we expecting to receive a non-2D .shape
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
base_index_type = type(base_index) | ||
locator_as_index = base_index_type(locator) | ||
base_as_index = pandas.Index(list(base_index)) | ||
locator_as_index = pandas.Index(list(locator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? this seems to be breaking the index type...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had issues with this when the Index type of the base_index was incompatible with the locator Index type.
For example, when base_index
has type pandas.RangeIndex
and locator
has type pandas.Int64Index
, the original code errors with TypeError: Value needs to be a scalar value, was type Int64Index
.
Casting to list extracts all of the actual values and then casts it back to pandas.Index
so that we can find the intersection of the two indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I see that it isn't returned outside, should be fine then
Signed-off-by: Bill Wang <billiam@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
base_index_type = type(base_index) | ||
locator_as_index = base_index_type(locator) | ||
base_as_index = pandas.Index(list(base_index)) | ||
locator_as_index = pandas.Index(list(locator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I see that it isn't returned outside, should be fine then
In pandas,
df.loc[df.index + 1] = df.loc[index]
copies the last row and increases thedf
's len by 1. When doing the same operation in Modin, we get an error since we're trying to index out of bounds.Signed-off-by: Rehan Durrani rehan@ponder.io
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/developer/architecture.rst
is up-to-date