Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.where does not handle Series slice correctly (#10218) #10283

Merged
merged 1 commit into from
Aug 26, 2015

Conversation

mortada
Copy link
Contributor

@mortada mortada commented Jun 5, 2015

closes #10218

@jreback jreback changed the title BUG: DataFrame.where does not handle Series slice correctly (fixes #1… BUG: DataFrame.where does not handle Series slice correctly (#10218) Jun 5, 2015
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 5, 2015
@jreback jreback added this to the 0.16.2 milestone Jun 5, 2015
@@ -3439,6 +3439,8 @@ def where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
try_cast=False, raise_on_error=True):

if isinstance(cond, NDFrame):
if isinstance(cond, pd.Series) and isinstance(self, pd.DataFrame):
cond = self._getitem_array(cond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be better to make these align properly, e.g.

cond, _ = cond.align(self)

(which currently will raise)
and it will do the .reindex

then inside of align you can so somethign like that

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

@mortada if you can have look w.r.t to my comments

@jreback jreback modified the milestones: 0.17.0, 0.16.2 Jun 9, 2015
@mortada
Copy link
Contributor Author

mortada commented Jun 12, 2015

@jreback I dug into align() but I'm not quite sure how to get that to work

do you mean changing align() so that in this particular case, cond, _ = cond.align(self) would turn cond from a Series into a DataFrame?

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

@mortada I think we need to add a parameter to .align, maybe broadcast=False (so default is the same as before). if broadcast==True, then normalize to the higher of the ndims of the returned objects. kind of like this

@mortada mortada force-pushed the df_where branch 3 times, most recently from d406f7b to 0ee610f Compare June 26, 2015 00:26
@mortada
Copy link
Contributor Author

mortada commented Jun 26, 2015

@jreback added broadcast keyword for align(), please take a look.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Woah, not sure this is a good idea. Pandas currently does broadcasting the other way (like numpy).

I do think this other broadcasting is more useful, but it is an API change that should be discussed more holistically.

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Jun 26, 2015
@jreback
Copy link
Contributor

jreback commented Jun 26, 2015

@shoyer not sure if u saw my prior comment
I asked @mortada to do it this way as it is rather general
this doesn't change the API at as this is turned off by default
only in very specific circumstances would you need/want this

@shoyer
Copy link
Member

shoyer commented Jun 27, 2015

My concern is that pandas (like NumPy) currently has a Series broadcast against the rows of a DataFrame, similar to the situation for 1D/2D numpy arrays. So I don't think this is a good argument name, because here broadcast=True would imply broadcasting in a special custom way that isn't consistent with the remainder of pandas.

I don't really have a good alternative parameter name here. But I think it would be better simply to add support for df.where(series) then to call the parameter broadcast. Currently where doesn't support any type of broadcasting:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})

In [4]: s = pd.Series([True, False])

In [5]: df[s]
Out[5]:
   x  y
0  0  2

In [6]: df.where(s)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-537e7e30ec86> in <module>()
----> 1 df.where(s)

/Users/shoyer/miniconda/envs/santa-maria/lib/python2.7/site-packages/pandas/core/generic.pyc in where(self, cond, other, inplace, axis, level, try_cast, raise_on_error)
   3509
   3510         if isinstance(cond, NDFrame):
-> 3511             cond = cond.reindex(**self._construct_axes_dict())
   3512         else:
   3513             if not hasattr(cond, 'shape'):

/Users/shoyer/miniconda/envs/santa-maria/lib/python2.7/site-packages/pandas/core/series.pyc in reindex(self, index, **kwargs)
   2149     @Appender(generic._shared_docs['reindex'] % _shared_doc_kwargs)
   2150     def reindex(self, index=None, **kwargs):
-> 2151         return super(Series, self).reindex(index=index, **kwargs)
   2152
   2153     @Appender(generic._shared_docs['fillna'] % _shared_doc_kwargs)

/Users/shoyer/miniconda/envs/santa-maria/lib/python2.7/site-packages/pandas/core/generic.pyc in reindex(self, *args, **kwargs)
   1750         if kwargs:
   1751             raise TypeError('reindex() got an unexpected keyword '
-> 1752                     'argument "{0}"'.format(list(kwargs.keys())[0]))
   1753
   1754         self._consolidate_inplace()

TypeError: reindex() got an unexpected keyword argument "columns"

@jreback
Copy link
Contributor

jreback commented Jun 27, 2015

@shoyer if you see the above issue. This is exactly what this is try to address. But rather than a specific fix, I think a more general approach is warranted, maybe broadcast_axis might be a better name. I think this kind of behavior (manual broadcasting) should live in .align as that is the purpose; all other routines (including .where simply need call .align to get a compatible shape/axes).

@shoyer
Copy link
Member

shoyer commented Jun 27, 2015

I agree -- broadcast_axis in align seems like a reasonable way to handle this. Not entirely clear how that generalizes to interactions with panels but perhaps that's less important.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2015

@shoyer oh I agree, just wanted to make a starting point and not bury align logic.

@mortada can you rename to broadcast_axis (and you will have to pass it explicity in the .where)

@mortada
Copy link
Contributor Author

mortada commented Jun 28, 2015

@shoyer @jreback renamed to broadcast_axis please take a look

@shoyer
Copy link
Member

shoyer commented Jun 28, 2015

@mortada before you implement this, could we work out the desired semantics for broadcast_axis?

For example, I thought @jreback was suggesting broadcast_axis as an integer (since it has the name axis).

@mortada
Copy link
Contributor Author

mortada commented Jun 28, 2015

@shoyer that's a good point, the name does seem to suggest an int or axis name

@jreback
Copy link
Contributor

jreback commented Jun 28, 2015

yes I meant it should be None by default and be an axis (int/string)

@jreback
Copy link
Contributor

jreback commented Aug 5, 2015

@mortada can you update broadcast_axis to what we talked about above? e.g. an actual axis (int/string)

@max-sixty
Copy link
Contributor

I think this also closes #9558

@jreback
Copy link
Contributor

jreback commented Aug 6, 2015

that looks right, @mortada can you add tests for that as well.

@mortada
Copy link
Contributor Author

mortada commented Aug 6, 2015

@jreback sure will do

@mortada
Copy link
Contributor Author

mortada commented Aug 20, 2015

@jreback one problem with using _shared_docs here is that align() is implemented in generic.py but the code and docstring for it are very DataFrame specific. So we wouldn't really be able to use the keyword substitutions

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@mortada ok, then I think what we need to do is make the doc-string in generic.py for > 2 dims
and put one in Series/Frame (if you are really fancy would do this with subsitutions, but whatever works).

@mortada
Copy link
Contributor Author

mortada commented Aug 20, 2015

@jreback sure, I've updated this to use _shared_docs

Filling axis, method and limit
broadcast_axis : %(axes_single_arg)s, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionadded

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

pls rebase, ping when green.

@mortada
Copy link
Contributor Author

mortada commented Aug 21, 2015

@jreback rebased and added versionadded to docstring

as for the docstring for Panel it is a bit tricky, because the implementation doesn't actually work with Panels yet.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

right, Panel does some other things. Ok, let's just raise a NotImplementedError in Panel then.
I don't think this should break anything.

@shoyer
Copy link
Member

shoyer commented Aug 21, 2015

It seems like broadcast_axis only makes sense for dataframes. Maybe it should be broadcast_axes instead? That would make sense for generalizing to panels.

On Fri, Aug 21, 2015 at 3:31 PM, Jeff Reback notifications@github.com
wrote:

right, Panel does some other things. Ok, let's just raise a NotImplementedError in Panel then.

I don't think this should break anything.

Reply to this email directly or view it on GitHub:
#10283 (comment)

@mortada
Copy link
Contributor Author

mortada commented Aug 22, 2015

@shoyer indeed this is really only for DataFrame. With the broadcast_axes you are proposing, do you mean it is a bool or a list of multiple axes like [1, 2]?

@shoyer
Copy link
Member

shoyer commented Aug 22, 2015

Broadcast_axes would take a pair of tuples (or ints) indicating which axis from each argument to broadcast against the other.

On Sat, Aug 22, 2015 at 3:10 AM, Mortada Mehyar notifications@github.com
wrote:

@shoyer indeed this is really only for DataFrame. With the broadcast_axes you are proposing, do you mean it is a bool or a list of multiple axes like [1, 2]?

Reply to this email directly or view it on GitHub:
#10283 (comment)

@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

@mortada actually I think broadcast_axis is descriptive enough. and evenutally could be made to accept a list of axes (as we do with methods like .apply), but this is not necessary for now. When the panel fix is in, ping.

@shoyer
Copy link
Member

shoyer commented Aug 22, 2015

What does the API look like in the more general form? As much as I like this flexibility, it's not clear to me that broadcast_axis=1 will make sense in the broader context of panels. It's particularly ambiguous about which argument it refers to.

On Sat, Aug 22, 2015 at 5:40 PM, Jeff Reback notifications@github.com
wrote:

@mortada actually I think broadcast_axis is descriptive enough. and evenutally could be made to accept a list of axes (as we do with methods like .apply), but this is not necessary for now. When the panel fix is in, ping.

Reply to this email directly or view it on GitHub:
#10283 (comment)

@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

@mortada I forgot why we are not using axis in this case? is that ambiguous (to have it auto-broadcast)?

@mortada
Copy link
Contributor Author

mortada commented Aug 23, 2015

@jreback we can't use axis in this case because as far as I understand axis is the axis to align the two objects on.

the docstring for it actually sounds weird and it perhaps meant to say "aligned" not "allowed":

axis : allowed axis of the other object, default None
    Align on index (0), columns (1), or both (None)

as a quick example

In [2]: df1 = pd.DataFrame(np.random.randn(2, 2), index=[1,2], columns=['a', 'b'])

In [3]: df2 = pd.DataFrame(np.random.randn(2, 2), index=[2,3], columns=['b', 'c'])

In [4]: df1
Out[4]:
          a         b
1 -0.864765 -0.293754
2  0.932535 -0.399291

In [5] df2
Out[5]:
          b         c
2 -0.481231 -0.941903
3  1.169895 -0.116977

if axis=None it aligns both axes

In [7]: left, right = df1.align(df2)

In [8]: left
Out[8]:
          a         b   c
1 -0.864765 -0.293754 NaN
2  0.932535 -0.399291 NaN
3       NaN       NaN NaN

In [9]: right
Out[9]:
    a         b         c
1 NaN       NaN       NaN
2 NaN -0.481231 -0.941903
3 NaN  1.169895 -0.116977

otherwise if axis=0

In [10]: left, right = df1.align(df2, axis=0)

In [11]: left
Out[11]:
          a         b
1 -0.864765 -0.293754
2  0.932535 -0.399291
3       NaN       NaN

In [12]: right
Out[12]:
          b         c
1       NaN       NaN
2 -0.481231 -0.941903
3  1.169895 -0.116977

if axis=1

In [13]: left, right = df1.align(df2, axis=1)

In [14]: left
Out[14]:
          a         b   c
1 -0.864765 -0.293754 NaN
2  0.932535 -0.399291 NaN

In [15]: right
Out[15]:
    a         b         c
2 NaN -0.481231 -0.941903
3 NaN  1.169895 -0.116977

@mortada
Copy link
Contributor Author

mortada commented Aug 23, 2015

And the new parameter, whether we call it broadcast_axis/axes or some other name, is really about whether to broadcast / expand the lower-dimensional object into larger dimensions when we are trying to align a lower-dimensional object with a higher-dimensional object.

Since the highest dimension align() currently supports is 2, this new parameter is effectively a bool for whether to broadcast or not. IMHO maybe it'd be a good idea to have a bool even when align() starts supporting Panel and higher, the benefits would be

  1. API would be cleaner, there's already a bunch of axis related parameters
  2. In theory it's less flexible because align() would either broadcast all remaining axes or not at all. But one can easily slice out the axes from the other object before aligning. E.g. if one is aligning a Series with a Panel and somehow wants a DataFrame back, he/she can always slice out the desired two dimensions from the Panel first.

@mortada
Copy link
Contributor Author

mortada commented Aug 23, 2015

by the way just updated to raise NotImplementedError for Panel

@@ -628,6 +628,9 @@ def _needs_reindex_multi(self, axes, method, level):
""" don't allow a multi reindex on Panel or above ndim """
return False

def align(self, other):
raise NotImplementedError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have this take **kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will update

jreback added a commit that referenced this pull request Aug 26, 2015
BUG: DataFrame.where does not handle Series slice correctly (#10218)
@jreback jreback merged commit b55ca5c into pandas-dev:master Aug 26, 2015
@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

thanks @mortada

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame#where broken in latest 0.16.1 release
5 participants