-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4407: Align insert
function with pandas in case of numpy array with several columns
#4408
Conversation
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
Codecov Report
@@ Coverage Diff @@
## master #4408 +/- ##
==========================================
+ Coverage 86.75% 89.47% +2.71%
==========================================
Files 222 222
Lines 18042 18038 -4
==========================================
+ Hits 15653 16139 +486
+ Misses 2389 1899 -490
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions + think we should update the tests, but other than that, looks good to me!
@@ -1161,15 +1168,11 @@ def insert(self, loc, column, value, allow_duplicates=False): # noqa: PR01, D20 | |||
raise ValueError("Length of values does not match length of index") | |||
if not allow_duplicates and column in self.columns: | |||
raise ValueError(f"cannot insert {column}, already exists") | |||
if loc > len(self.columns): | |||
if not -len(self.columns) <= loc <= len(self.columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.insert requires 0 <= loc <= len(df.columns)
. Why do we have support for negative indices here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of lines below another exception is thrown - raise ValueError("unbounded slice")
if loc < 0. That is, two different exceptions in the case when the loc
is negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked in pandas, and the case where loc is negative but greater than -len(self.columns)
should result in the second error, so I think this code should just say this:
if not -len(self.columns) <= loc <= len(self.columns): | |
if not loc <= len(self.columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version did you check on? On the latest release, the behavior is different.
>>> import pandas as pd
>>> pd.__version__
'1.4.2'
>>> data = {"col1": [1, 2, 3], "col2": [2, 3, 4]}
>>> df = pd.DataFrame(data)
>>> df
col1 col2
0 1 2
1 2 3
2 3 4
>>> df.insert(-1, "col3", df["col2"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\Miniconda3\envs\modin\lib\site-packages\pandas\core\frame.py", line 4445, in insert
self._mgr.insert(loc, column, value)
File "...\Miniconda3\envs\modin\lib\site-packages\pandas\core\internals\managers.py", line 1241, in insert
bp = BlockPlacement(slice(loc, loc + 1))
File "pandas\_libs\internals.pyx", line 56, in pandas._libs.internals.BlockPlacement.__cinit__
File "pandas\_libs\internals.pyx", line 318, in pandas._libs.internals.slice_canonize
ValueError: unbounded slice
>>> df.insert(-3, "col3", df["col2"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\Miniconda3\envs\modin\lib\site-packages\pandas\core\frame.py", line 4445, in insert
self._mgr.insert(loc, column, value)
File "...\Miniconda3\envs\modin\lib\site-packages\pandas\core\internals\managers.py", line 1230, in insert
new_axis = self.items.insert(loc, item)
File "...\Miniconda3\envs\modin\lib\site-packages\pandas\core\indexes\base.py", line 6602, in insert
new_values = np.insert(arr, loc, casted)
File "<__array_function__ internals>", line 180, in insert
File "...\Miniconda3\envs\modin\lib\site-packages\numpy\lib\function_base.py", line 5280, in insert
raise IndexError(f"index {obj} is out of bounds for axis {axis} "
IndexError: index -3 is out of bounds for axis 0 with size 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, df.insert()
raises different exception types starting from 1.4.0, since that version this line (which is basicaly an Index.insert
) started to fail with an IndexError if the location is out abs(idx_len)
bounds.
BTW, this overall if-elsing that's based on pandas internals that changes from version to version seems very unstable (we would again need to change this if pandas change something else there), I don't have though a good solution for how to make our code stable and preserve type-exceptions compatibility with pandas at the same time. For now the approach is ok, but we should start thinking in the direction of solving this exceptions-type compatibility problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the exact same code snippet on 1.4.2 and I got a different error:
Python 3.8.12 (default, Oct 12 2021, 06:23:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.30.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
pd.__version__
In [2]: pd.__version__
Out[2]: '1.4.2'
In [3]: data = {"col1": [1, 2, 3], "col2": [2, 3, 4]}
In [4]: df = pd.DataFrame(data)
In [5]: df.insert(-1, "col3", df["col2"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-ecea5cb625c7> in <module>
----> 1 df.insert(-1, "col3", df["col2"])
~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
4443
4444 value = self._sanitize_column(value)
-> 4445 self._mgr.insert(loc, column, value)
4446
4447 def assign(self, **kwargs) -> DataFrame:
~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value)
1239 value = ensure_block_shape(value, ndim=self.ndim)
1240
-> 1241 bp = BlockPlacement(slice(loc, loc + 1))
1242 block = new_block_2d(values=value, placement=bp)
1243
pandas/_libs/internals.pyx in pandas._libs.internals.BlockPlacement.__cinit__()
pandas/_libs/internals.pyx in pandas._libs.internals.slice_canonize()
ValueError: unbounded slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD with loc=-1
, I also have this exception (unbounded slice
). Another exception (IndexError: index -3 is out of bounds for axis 0 with size 2
) I get when loc=-3
(my results above are for two runs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right,
df.insert()
raises different exception types starting from 1.4.0, since that version this line (which is basicaly anIndex.insert
) started to fail with an IndexError if the location is outabs(idx_len)
bounds.BTW, this overall if-elsing that's based on pandas internals that changes from version to version seems very unstable (we would again need to change this if pandas change something else there), I don't have though a good solution for how to make our code stable and preserve type-exceptions compatibility with pandas at the same time. For now the approach is ok, but we should start thinking in the direction of solving this exceptions-type compatibility problem.
I think before finding a solution to this problem, we should explicitly compare messages in exceptions during testing.
Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
raise ValueError( | ||
f"Wrong number of items passed {len(value.columns)}, placement implies 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated message.
@RehanSD ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but I think we're really close, thank you!
@@ -1161,15 +1168,11 @@ def insert(self, loc, column, value, allow_duplicates=False): # noqa: PR01, D20 | |||
raise ValueError("Length of values does not match length of index") | |||
if not allow_duplicates and column in self.columns: | |||
raise ValueError(f"cannot insert {column}, already exists") | |||
if loc > len(self.columns): | |||
if not -len(self.columns) <= loc <= len(self.columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked in pandas, and the case where loc is negative but greater than -len(self.columns)
should result in the second error, so I think this code should just say this:
if not -len(self.columns) <= loc <= len(self.columns): | |
if not loc <= len(self.columns): |
Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
@RehanSD ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason the bug we're getting seems to be different even when we're on the same pandas versions - perhaps its a python version thing? I think we should probably resolve this before merging, but other than that, lgtm!
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
@RehanSD please take a look at #4408 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a single comment regarding testing, otherwise LGTM
Hi @RehanSD! Hope you are well. Is there anything else that needs to be fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!
Signed-off-by: Anatoly Myachev anatoliimyachev@mail.com
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
insert
function is different from pandas in case of numpy array with several columns #4407docs/development/architecture.rst
is up-to-date