-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: validate Index data is 1D + deprecate multi-dim indexing #30588
Conversation
There is a feather test in which we do
and in this PR we are now raising on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ping on green. i think should add a whatsnew note
@jbrockmendel that test setup looks bad. We're assigning a length-1 tuple whose only element is a MultiIndex to be the single key? In [7]: df = pd.DataFrame({"A": [1, 2, 3]})
In [8]: df.columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("a", 3)]),
In [9]: df
Out[9]: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/sandbox/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.map_locations()
1637 raise KeyError(key)
1638
-> 1639 def map_locations(self, ndarray[object] values):
1640 cdef:
1641 Py_ssize_t i, n = len(values)
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
File "pandas/_libs/hashtable_class_helper.pxi", line 1639, in pandas._libs.hashtable.PyObjectHashTable.map_locations
def map_locations(self, ndarray[object] values):
ValueError: Buffer has wrong number of dimensions (expected 1, got 3) I think we'd rather have a length-1 MultiIndex diff --git a/pandas/tests/io/test_feather.py b/pandas/tests/io/test_feather.py
index e06f2c31a2..3500470035 100644
--- a/pandas/tests/io/test_feather.py
+++ b/pandas/tests/io/test_feather.py
@@ -136,7 +136,7 @@ class TestFeather:
# column multi-index
df.index = [0, 1, 2]
- df.columns = (pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]),)
+ df.columns = pd.MultiIndex.from_tuples([("a", 1)])
self.check_error_on_write(df, ValueError)
def test_path_pathlib(self): |
That seems to do the trick, thanks. |
ping, whatsnew added |
Do we have tests that ensure |
It's definitely been decided on for DatetimeIndex, which does that ATM. For the rest, I think this is #27837. |
It seems like in #27837 the preference of most was to deprecate indexing with 2-D indexers. I'm concerned about changing the behavior here to return an ndarray, only to deprecate it in the future. I suppose that validating we don't have 2D data necessitates the change to |
I guess we _could_ call simple_new from getitem, but I’d prefer to just
return ndarray. Could also deprecate saying it will raise instead of
returning ndarray in the future.
…On Thu, Jan 2, 2020 at 2:11 PM Tom Augspurger ***@***.***> wrote:
It seems like in #27837
<#27837> the preference of
most was to deprecate indexing with 2-D indexers. I'm concerned about
changing the behavior here to return an ndarray, only to deprecate it in
the future.
I suppose that validating we don't have 2D data necessitates the change to
Index.__getitem__ to return an ndarray?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30588?email_source=notifications&email_token=AB5UM6HJ66T5NLKXZDNLJ6TQ3ZRA7A5CNFSM4KBX27QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7TDLA#issuecomment-570372524>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6D6ELAB6XUXV4N2JSDQ3ZRA7ANCNFSM4KBX27QA>
.
|
That's also an option. @jreback @jorisvandenbossche thoughts on that? I'm slightly OK with breaking API to return an ndarray rather than an Index here, since the behavior on master is so clearly broken. But I also don't want to establish |
re-reading the original, I think we are all in favor of deprecate & raise. This PR now raises for construction with a 2D. I don't see any behavior change on getitem, so that still returns an ndarray? |
This does change |
ok and tests are added for this? or changed ones? |
Yes, in test_indexing tests are updated to check for the expected exception when indexing with non-1D. |
I am personally fine with the 2D Index -> ndarray change in getitem now, if that makes it easier to fix the Index construction from 2D data bug (in #27837 (comment), I also mentioned such a change probably has not too much impact, at least not for the matplotlib use case). |
DeprecationWarning or FutureWarning? I don't have a preference. We'll need a doc note for the API change and the deprecation. |
Let's start with a DeprecationWarning, as this is typically something that will be used by libraries I think. We have plenty of time to change it to FutureWarning before 2.0. |
sgtm |
pandas/core/indexes/base.py
Outdated
# Deprecation GH#30588 | ||
warnings.warn( | ||
"Support for Index[:, None] is deprecated and will be " | ||
"removed in a future version.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about Index[:, None]
and recommended alternative as I posted above for the whatsnew
Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Planning to do the RC in ~5 hours. |
@@ -500,8 +500,11 @@ def __getitem__(self, value): | |||
|
|||
# scalar | |||
if not isinstance(left, ABCIndexClass): | |||
if isna(left): | |||
if is_scalar(left) and isna(left): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention with this block of code was to only handle scalars here. Since self.left
and self.right
are always indexes, and we're using __getitem__
on them to get left
/right
, my assumption at the time was that left
/right
would either always be a scalar or an Index
(1d), so not isinstance(left, ABCIndexClass)
would imply scalar (I guess could use is_scalar
instead of the isinstance
).
With this PR it looks like Index.__getitem__
can return a scalar, Index
, or ndarray
with ndim
> 1? With the last case being temporary until we remove this behavior? Or am I omitting a case where something else could be returned? If these are the only three cases, could we handle the ndim
> 1 case separately before this if
block?
Something like:
left = self.left[value]
right = self.right[value]
# TODO: remove this block when Index.__getitem__ returning ndim > 1 is deprecated
if np.ndim(left) > 1:
# GH#30588 multi-dimensional indexer disallowed
raise ValueError(...)
# scalar
if not isinstance(left, ABCIndexClass):
....
Makes the logic less nested and easier to remove when we go through with the deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so that this raises ValueError for >1dim indexing on IntervalArray
@TomAugspurger I have to head out in about half an hour, which IIRC is about when you wanted to cut the RC. Anything else to do here? (assuming Travis finished green) |
I think we're good. Will let it sit until just before I tag the rc, in case @jreback has a chance to look. But I think we can do followups after the RC. |
Taking a look |
) | ||
result = self._codes[key] | ||
if result.ndim > 1: | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also raise a warning?
(but adding the warning is not a blocker for the RC, since that's a future deprecation, as long as the behaviour change is already there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we are returning plain integer codes here, which doesn't look good.
And checked with 0.25, and there multi-dim indexing on categorical/categoricalindex basically fails (it "returns", but whenever you do something it gives an error; even converting to the result to a numpy array (the original matplotlib use case for the 2d indexing) fails).
So I would also raise an error here, like you did for some other arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this should raise to match Interval. Will push an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not straightforward, as actually this already gets catched before in the boolean checking ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I would be fine with doing a clean-up for my comments in a follow-up PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a bit tricky. Will look a bit longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this raised an error in 0.25, it raises a different (but wrong) error now, so I think fine to do in a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confused by what's going on with with Series.__getitem__
here. Are we OK with changing this for categorical after the RC?
def test_getitem_2d_deprecated(self): | ||
# GH#30588 multi-dim indexing is deprecated, but raising is also acceptable | ||
idx = self.create_index() | ||
with pytest.raises(ValueError, match="cannot mask with array containing NA"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't test wrong error messages ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bug in how we're handling tuples. We convert the tuple (slice(None), None)
to an ndarray
array([slice(None, None, None), None], dtype=object)
which gets interpreteded as a boolean mask with missing values, and so raise.
I'm going to update the test to have a 2d ndarray as the indexer, nad open an issue for the bug.
idx = self.create_index() | ||
with pytest.raises(ValueError, match="multi-dimensional indexing not allowed"): | ||
with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False): | ||
idx[:, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this only raise the error? (while trying out this branch, for me it also does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, it's a deprecation warning and may not be visible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's it
In [11]: warnings.simplefilter("always", DeprecationWarning)
In [12]: idx[np.array([[0, 1]])]
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/interval.py:498: DeprecationWarning: Support for multi-dimensional indexing (e.g. `index[:, None]`) on an Index is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
left = self.left[value]
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/interval.py:499: DeprecationWarning: Support for multi-dimensional indexing (e.g. `index[:, None]`) on an Index is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
right = self.right[value]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-bd6337c200ee> in <module>
----> 1 idx[np.array([[0, 1]])]
~/sandbox/pandas/pandas/core/indexes/interval.py in __getitem__(self, value)
990
991 def __getitem__(self, value):
--> 992 result = self._data[value]
993 if isinstance(result, IntervalArray):
994 return self._shallow_copy(result)
~/sandbox/pandas/pandas/core/arrays/interval.py in __getitem__(self, value)
505 if np.ndim(left) > 1:
506 # GH#30588 multi-dimensional indexer disallowed
--> 507 raise ValueError("multi-dimensional indexing not allowed")
508 return Interval(left, right, self.closed)
509
ValueError: multi-dimensional indexing not allowed
I think that's OK for now, but can fix up later if desired.
idxr[nd3] | ||
else: | ||
with pytest.raises(ValueError, match=msg): | ||
with tm.assert_produces_warning(DeprecationWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, is it needed to have both contexts?
@@ -66,11 +66,10 @@ def test_registering_no_warning(self): | |||
|
|||
# Set to the "warn" state, in case this isn't the first test run | |||
register_matplotlib_converters() | |||
with tm.assert_produces_warning(None) as w: | |||
with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False): | |||
# GH#30588 DeprecationWarning from 2D indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't assert here, but just ignore warnings. When matplotlib fixes this (hopefully they will do soon), this test would otherwise fail
(can be fixed later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on a fix on the mpl side now.
off topic: running the tests has started leaving behind a bunch of gibberish.pickle files. is that just me? |
So I think the only potential behavior change is for categorical at #30588 (comment). We might want that to raise for non 1-d indexers, but I can't figure things out. I'm comfortable changing that after the RC, so I'm planning to merge this in a bit (30 minutes or so). |
I’m AFK for a few more hours, feel free to push edits.
…On Thu, Jan 9, 2020 at 1:02 PM Tom Augspurger ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/arrays/categorical.py
<#30588 (comment)>:
> @@ -2007,9 +2007,10 @@ def __getitem__(self, key):
if com.is_bool_indexer(key):
key = check_bool_array_indexer(self, key)
- return self._constructor(
- values=self._codes[key], dtype=self.dtype, fastpath=True
- )
+ result = self._codes[key]
+ if result.ndim > 1:
+ return result
I'm very confused by what's going on with with Series.__getitem__ here.
Are we OK with changing this for categorical after the RC?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30588?email_source=notifications&email_token=AB5UM6GIS5HB7GM2GGFB3XLQ46GFBA5CNFSM4KBX27QKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRIF7EA#discussion_r364958045>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6GAQOIUPYZCZKAEIF3Q46GFBANCNFSM4KBX27QA>
.
|
To be clear, the 2D indexing case on categorical is already completely broken in pandas 0.25.0 as well, so I think the (wrong) error now is fine. We can fix the error message after the RC. |
Thanks! Opening followups now. |
…ndexing-1row-df * upstream/master: (284 commits) CLN: leftover ix checks (pandas-dev#30951) CI: numpydev changed double to single quote (pandas-dev#30952) DOC: Fix whatsnew contributors section (pandas-dev#30926) STY: wrong placed space in strings (pandas-dev#30940) TYP: type up parts of series.py (pandas-dev#30761) DOC: Fix SS03 docstring error (pandas-dev#30939) CLN: remove unnecesary _date_check_type (pandas-dev#30932) DOC: Fixture docs in pandas/conftest.py (pandas-dev#30917) CLN: F-strings (pandas-dev#30916) replace syntax with f-string (pandas-dev#30919) BUG: pickle files left behind by tm.round_trip_pickle (pandas-dev#30906) TYP: offsets (pandas-dev#30897) TYP: typing annotations (pandas-dev#30901) WEB: Remove from roadmap moving the docstring script (pandas-dev#30893) WEB: Removing Discourse links (pandas-dev#30890) DOC: Encourage use of pre-commit in the docs (pandas-dev#30864) DEPR: fix missing stacklevel in pandas.core.index deprecation (pandas-dev#30878) CLN: remove unnecessary overriding in subclasses (pandas-dev#30875) RLS: 1.0.0rc0 BUG: validate Index data is 1D + deprecate multi-dim indexing (pandas-dev#30588) ... # Conflicts: # doc/source/whatsnew/v1.0.0.rst
Hi @jbrockmendel @TomAugspurger @jreback , I have a pandas dataframe with the following 4 columns
This is the result of predicted dataframe (called forecast) using Prophet time series model. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This changes the behavior of
idx[:, None]
to return an ndarray instead of an invalid Index, needed to keep matplotlib tests working.See also #20285 which this does not entirely close. That becomes pretty easy to address though once a decision is made on whether to treat
[[0, 1], [2, 3]]
like[(0, 1), (2, 3)]
(the latter becomes a MultiIndex, the former currently becomes an invalid Index)