Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: validate Index data is 1D + deprecate multi-dim indexing #30588

Merged
merged 16 commits into from
Jan 9, 2020

Conversation

jbrockmendel
Copy link
Member

This changes the behavior of idx[:, None] to return an ndarray instead of an invalid Index, needed to keep matplotlib tests working.

See also #20285 which this does not entirely close. That becomes pretty easy to address though once a decision is made on whether to treat [[0, 1], [2, 3]] like [(0, 1), (2, 3)] (the latter becomes a MultiIndex, the former currently becomes an invalid Index)

@jbrockmendel
Copy link
Member Author

There is a feather test in which we do

         df.columns = (pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]),)
         self.check_error_on_write(df, ValueError)

and in this PR we are now raising on the df.columns = ... instead of getting through to fail on the next line. Is there a different invalid index we can pass to restore this test? cc @jorisvandenbossche

@jreback jreback added Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses labels Jan 1, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green. i think should add a whatsnew note

@TomAugspurger
Copy link
Contributor

There is a feather test in which we do

@jbrockmendel that test setup looks bad. We're assigning a length-1 tuple whose only element is a MultiIndex to be the single key?

In [7]: df = pd.DataFrame({"A": [1, 2, 3]})

In [8]: df.columns = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("a", 3)]),

In [9]: df
Out[9]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/sandbox/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.map_locations()
   1637             raise KeyError(key)
   1638
-> 1639     def map_locations(self, ndarray[object] values):
   1640         cdef:
   1641             Py_ssize_t i, n = len(values)

ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 1639, in pandas._libs.hashtable.PyObjectHashTable.map_locations
    def map_locations(self, ndarray[object] values):
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)

I think we'd rather have a length-1 MultiIndex

diff --git a/pandas/tests/io/test_feather.py b/pandas/tests/io/test_feather.py
index e06f2c31a2..3500470035 100644
--- a/pandas/tests/io/test_feather.py
+++ b/pandas/tests/io/test_feather.py
@@ -136,7 +136,7 @@ class TestFeather:
 
         # column multi-index
         df.index = [0, 1, 2]
-        df.columns = (pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]),)
+        df.columns = pd.MultiIndex.from_tuples([("a", 1)])
         self.check_error_on_write(df, ValueError)
 
     def test_path_pathlib(self):

@jbrockmendel
Copy link
Member Author

I think we'd rather have a length-1 MultiIndex

That seems to do the trick, thanks.

@jbrockmendel
Copy link
Member Author

ping, whatsnew added

@TomAugspurger
Copy link
Contributor

This changes the behavior of idx[:, None] to return an ndarray instead of an invalid Index, needed to keep matplotlib tests working.

Do we have tests that ensure index[2darray] and idx[:, None] returns an ndarray? Did we decide on that as a desirable API, or is that still under discussion?

@jbrockmendel
Copy link
Member Author

Did we decide on that as a desirable API, or is that still under discussion?

It's definitely been decided on for DatetimeIndex, which does that ATM. For the rest, I think this is #27837.

@TomAugspurger
Copy link
Contributor

It seems like in #27837 the preference of most was to deprecate indexing with 2-D indexers. I'm concerned about changing the behavior here to return an ndarray, only to deprecate it in the future.

I suppose that validating we don't have 2D data necessitates the change to Index.__getitem__ to return an ndarray?

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jan 2, 2020 via email

@TomAugspurger
Copy link
Contributor

That's also an option. @jreback @jorisvandenbossche thoughts on that?

I'm slightly OK with breaking API to return an ndarray rather than an Index here, since the behavior on master is so clearly broken. But I also don't want to establish Index.__getitem__ returning an ndarray as a documented / supported behavior.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2020

re-reading the original, I think we are all in favor of deprecate & raise. This PR now raises for construction with a 2D. I don't see any behavior change on getitem, so that still returns an ndarray?

@jbrockmendel
Copy link
Member Author

I don't see any behavior change on getitem, so that still returns an ndarray?

This does change Index.__getitem__ to return ndarray[ndim=2] instead of constructing an invalid Index.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2020

I don't see any behavior change on getitem, so that still returns an ndarray?

This does change Index.__getitem__ to return ndarray[ndim=2] instead of constructing an invalid Index.

ok and tests are added for this? or changed ones?

@jbrockmendel
Copy link
Member Author

ok and tests are added for this? or changed ones?

Yes, in test_indexing tests are updated to check for the expected exception when indexing with non-1D.

@jorisvandenbossche
Copy link
Member

I am personally fine with the 2D Index -> ndarray change in getitem now, if that makes it easier to fix the Index construction from 2D data bug (in #27837 (comment), I also mentioned such a change probably has not too much impact, at least not for the matplotlib use case).
But, since we agree we want to deprecate it, I would directly add such a DeprecationWarning as well while changing it.

@TomAugspurger
Copy link
Contributor

DeprecationWarning or FutureWarning? I don't have a preference.

We'll need a doc note for the API change and the deprecation.

@jorisvandenbossche
Copy link
Member

Let's start with a DeprecationWarning, as this is typically something that will be used by libraries I think. We have plenty of time to change it to FutureWarning before 2.0.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

Let's start with a DeprecationWarning, as this is typically something that will be used by libraries I think. We have plenty of time to change it to FutureWarning before 2.0.

sgtm

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
# Deprecation GH#30588
warnings.warn(
"Support for Index[:, None] is deprecated and will be "
"removed in a future version.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about Index[:, None] and recommended alternative as I posted above for the whatsnew

@TomAugspurger
Copy link
Contributor

Planning to do the RC in ~5 hours.

pandas/core/arrays/interval.py Show resolved Hide resolved
@@ -500,8 +500,11 @@ def __getitem__(self, value):

# scalar
if not isinstance(left, ABCIndexClass):
if isna(left):
if is_scalar(left) and isna(left):
Copy link
Member

@jschendel jschendel Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention with this block of code was to only handle scalars here. Since self.left and self.right are always indexes, and we're using __getitem__ on them to get left/right, my assumption at the time was that left/right would either always be a scalar or an Index (1d), so not isinstance(left, ABCIndexClass) would imply scalar (I guess could use is_scalar instead of the isinstance).

With this PR it looks like Index.__getitem__ can return a scalar, Index, or ndarray with ndim > 1? With the last case being temporary until we remove this behavior? Or am I omitting a case where something else could be returned? If these are the only three cases, could we handle the ndim > 1 case separately before this if block?

Something like:

left = self.left[value]
right = self.right[value]

# TODO: remove this block when Index.__getitem__ returning ndim > 1 is deprecated
if np.ndim(left) > 1:
    # GH#30588 multi-dimensional indexer disallowed
    raise ValueError(...)

# scalar
if not isinstance(left, ABCIndexClass):
    ....

Makes the logic less nested and easier to remove when we go through with the deprecation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that this raises ValueError for >1dim indexing on IntervalArray

@jbrockmendel
Copy link
Member Author

@TomAugspurger I have to head out in about half an hour, which IIRC is about when you wanted to cut the RC. Anything else to do here? (assuming Travis finished green)

@TomAugspurger
Copy link
Contributor

I think we're good. Will let it sit until just before I tag the rc, in case @jreback has a chance to look. But I think we can do followups after the RC.

@jorisvandenbossche
Copy link
Member

Taking a look

)
result = self._codes[key]
if result.ndim > 1:
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also raise a warning?

(but adding the warning is not a blocker for the RC, since that's a future deprecation, as long as the behaviour change is already there)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are returning plain integer codes here, which doesn't look good.

And checked with 0.25, and there multi-dim indexing on categorical/categoricalindex basically fails (it "returns", but whenever you do something it gives an error; even converting to the result to a numpy array (the original matplotlib use case for the 2d indexing) fails).
So I would also raise an error here, like you did for some other arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should raise to match Interval. Will push an update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not straightforward, as actually this already gets catched before in the boolean checking ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I would be fine with doing a clean-up for my comments in a follow-up PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a bit tricky. Will look a bit longer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this raised an error in 0.25, it raises a different (but wrong) error now, so I think fine to do in a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused by what's going on with with Series.__getitem__ here. Are we OK with changing this for categorical after the RC?

def test_getitem_2d_deprecated(self):
# GH#30588 multi-dim indexing is deprecated, but raising is also acceptable
idx = self.create_index()
with pytest.raises(ValueError, match="cannot mask with array containing NA"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't test wrong error messages ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bug in how we're handling tuples. We convert the tuple (slice(None), None) to an ndarray

array([slice(None, None, None), None], dtype=object)

which gets interpreteded as a boolean mask with missing values, and so raise.

I'm going to update the test to have a 2d ndarray as the indexer, nad open an issue for the bug.

idx = self.create_index()
with pytest.raises(ValueError, match="multi-dimensional indexing not allowed"):
with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False):
idx[:, None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this only raise the error? (while trying out this branch, for me it also does)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind, it's a deprecation warning and may not be visible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's it

In [11]: warnings.simplefilter("always", DeprecationWarning)

In [12]: idx[np.array([[0, 1]])]
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/interval.py:498: DeprecationWarning: Support for multi-dimensional indexing (e.g. `index[:, None]`) on an Index is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  left = self.left[value]
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/interval.py:499: DeprecationWarning: Support for multi-dimensional indexing (e.g. `index[:, None]`) on an Index is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  right = self.right[value]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-bd6337c200ee> in <module>
----> 1 idx[np.array([[0, 1]])]

~/sandbox/pandas/pandas/core/indexes/interval.py in __getitem__(self, value)
    990
    991     def __getitem__(self, value):
--> 992         result = self._data[value]
    993         if isinstance(result, IntervalArray):
    994             return self._shallow_copy(result)

~/sandbox/pandas/pandas/core/arrays/interval.py in __getitem__(self, value)
    505             if np.ndim(left) > 1:
    506                 # GH#30588 multi-dimensional indexer disallowed
--> 507                 raise ValueError("multi-dimensional indexing not allowed")
    508             return Interval(left, right, self.closed)
    509

ValueError: multi-dimensional indexing not allowed

I think that's OK for now, but can fix up later if desired.

idxr[nd3]
else:
with pytest.raises(ValueError, match=msg):
with tm.assert_produces_warning(DeprecationWarning):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, is it needed to have both contexts?

@@ -66,11 +66,10 @@ def test_registering_no_warning(self):

# Set to the "warn" state, in case this isn't the first test run
register_matplotlib_converters()
with tm.assert_produces_warning(None) as w:
with tm.assert_produces_warning(DeprecationWarning, check_stacklevel=False):
# GH#30588 DeprecationWarning from 2D indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't assert here, but just ignore warnings. When matplotlib fixes this (hopefully they will do soon), this test would otherwise fail

(can be fixed later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on a fix on the mpl side now.

@jorisvandenbossche jorisvandenbossche changed the title BUG: validate Index data is 1D BUG: validate Index data is 1D + deprecate multi-dim indexing Jan 9, 2020
@jbrockmendel
Copy link
Member Author

off topic: running the tests has started leaving behind a bunch of gibberish.pickle files. is that just me?

@TomAugspurger
Copy link
Contributor

So I think the only potential behavior change is for categorical at #30588 (comment). We might want that to raise for non 1-d indexers, but I can't figure things out. I'm comfortable changing that after the RC, so I'm planning to merge this in a bit (30 minutes or so).

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jan 9, 2020 via email

@jorisvandenbossche
Copy link
Member

To be clear, the 2D indexing case on categorical is already completely broken in pandas 0.25.0 as well, so I think the (wrong) error now is fine. We can fix the error message after the RC.

@TomAugspurger TomAugspurger merged commit 13858f6 into pandas-dev:master Jan 9, 2020
@TomAugspurger
Copy link
Contributor

Thanks! Opening followups now.

@jbrockmendel jbrockmendel deleted the bug-idx-ndim branch January 10, 2020 03:53
keechongtan added a commit to keechongtan/pandas that referenced this pull request Jan 13, 2020
…ndexing-1row-df

* upstream/master: (284 commits)
  CLN: leftover ix checks (pandas-dev#30951)
  CI: numpydev changed double to single quote (pandas-dev#30952)
  DOC: Fix whatsnew contributors section (pandas-dev#30926)
  STY: wrong placed space in strings (pandas-dev#30940)
  TYP: type up parts of series.py (pandas-dev#30761)
  DOC: Fix SS03 docstring error (pandas-dev#30939)
  CLN: remove unnecesary _date_check_type (pandas-dev#30932)
  DOC: Fixture docs in pandas/conftest.py (pandas-dev#30917)
  CLN: F-strings (pandas-dev#30916)
  replace syntax with f-string (pandas-dev#30919)
  BUG: pickle files left behind by tm.round_trip_pickle (pandas-dev#30906)
  TYP: offsets (pandas-dev#30897)
  TYP: typing annotations (pandas-dev#30901)
  WEB: Remove from roadmap moving the docstring script (pandas-dev#30893)
  WEB: Removing Discourse links (pandas-dev#30890)
  DOC: Encourage use of pre-commit in the docs (pandas-dev#30864)
  DEPR: fix missing stacklevel in pandas.core.index deprecation (pandas-dev#30878)
  CLN: remove unnecessary overriding in subclasses (pandas-dev#30875)
  RLS: 1.0.0rc0
  BUG: validate Index data is 1D + deprecate multi-dim indexing (pandas-dev#30588)
  ...

# Conflicts:
#	doc/source/whatsnew/v1.0.0.rst
@cishwarya
Copy link

cishwarya commented Apr 5, 2023

Hi @jbrockmendel @TomAugspurger @jreback , I have a pandas dataframe with the following 4 columns
ds | yhat | yhat_lower | yhat_upper

2023-01-28 | 144.54255 | 130.96604 | 157.49898

This is the result of predicted dataframe (called forecast) using Prophet time series model.
I am trying to plot the dataframe with the following command
fig1 = model.plot(forecast)
This doesn't work anymore and throws the error Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.
I converted my dataframe to numpy array
forecast_np =forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].to_numpy()
but still get the error
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Can you please help here?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses
Projects
None yet
7 participants