Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: three argument version of where #1496

Merged
merged 5 commits into from
Aug 8, 2017
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jul 29, 2017

Example usage:

>>> a.where(a.x + a.y < 5, -1)
<xarray.DataArray (x: 5, y: 5)>
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8, -1],
       [10, 11, 12, -1, -1],
       [15, 16, -1, -1, -1],
       [20, -1, -1, -1, -1]])
Dimensions without coordinates: x, y
  • Closes define fill value for where #576
  • Tests added / passed
  • Passes git diff upstream/master | flake8 --diff
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

CC @MaximilianR

@shoyer
Copy link
Member Author

shoyer commented Jul 30, 2017

One design decision here is how to handle alignment. The current version of where does an inner join between self and cond, but that wasn't carefully thought through -- more a side effect of making using of the binary arithmetic machinery.

I don't like this behavior for the three argument version of where, because it means that the indexes of other could remove non-NA values from the result, even if the condition is always True. This goes against one of our general principles, which is not aligning away non-missing values.

We can't use an outer join because we don't know how to handle NaNs in cond. We have a couple of other options:

  1. We could require exact alignment for the three argument signature. This is simple and we already have the machinery.
  2. We could require exact alignment to cond, but do some sort of more flexible alignment (e.g., an outer join) for self and other. This would require (at least) two calls to align() when other is provided, e.g., possibly
self, other = align(self, other, join='outer')
self, other, cond = align(self, other, cond, join='inner')
# still need to handle aligning data variables for Dataset objects

I am inclined to require exact alignment (i.e., join='exact') when three arguments are provided, because I know how to implement it correctly, and it still preserves the option of switching to more flexible alignment in the future.

@spencerahill
Copy link
Contributor

How difficult would it be to include np.where's option to provide values for both where the condition is met and where it isn't? From their docstring:

If both x and y are specified, the output array contains elements of x where condition is True, and elements from y elsewhere.

From your example above (haven't gone through the code), what you have implemented in this PR is a special case, namely the xarray analog to np.where(a.x + a.y < 5, a, -1).

I recently had a usecase where this would be handy.

@shoyer
Copy link
Member Author

shoyer commented Jul 30, 2017

How difficult would it be to include np.where's option to provide values for both where the condition is met and where it isn't?

That works, too, e.g.,

In [6]: a.where(a.x + a.y < 5, -a)
Out[6]:
<xarray.DataArray (x: 5, y: 5)>
array([[  0,   1,   2,   3,   4],
       [  5,   6,   7,   8,  -9],
       [ 10,  11,  12, -13, -14],
       [ 15,  16, -17, -18, -19],
       [ 20, -21, -22, -23, -24]])
Dimensions without coordinates: x, y

You can even put Dataset objects in any of the arguments and it should broadcast across variables.

One annoyance is that instead of where(cond, x, y), you need to write this in the slightly counter-intuitive form x.where(cond, y). This is consistent with pandas, but we could also consider exposing the where(cond, x, y) version publicly as a function.

@spencerahill
Copy link
Contributor

Thanks @shoyer. I guess what I had in mind is the case where both x and y are scalars, while cond is still a condition on a. In that case you couldn't do x.where(cond, y); it would require either a.where(cond, x, y) or where(cond, x, y) being supported. Am I understanding that correctly? (If I'm not being clear, consider a concrete case by plugging in e.g. x=-2, y=0, and cond=(a.x + a.y < 5).)

a.where(cond, x, y) might seem odd, since it doesn't actually retain any of a's values, but then it could retain coordinates and attributes, so it might still be useful. And this differs from where(cond, x, y), which it seems would retain cond's coords and attrs.

@shoyer
Copy link
Member Author

shoyer commented Jul 31, 2017

@spencerahill It sounds like we should just add the function, too, so you can write xr.where(cond, -2, 0). It's a only a few more lines of code.

@max-sixty
Copy link
Collaborator

This is excellent! Thank you @shoyer .

I agree with forcing exact alignment for the 3 arg version. Does exact allow a scalar on a dimension? That's then simple and unambiguous to broadcast.

@shoyer
Copy link
Member Author

shoyer commented Aug 1, 2017

Does exact allow a scalar on a dimension? That's then simple and unambiguous to broadcast.

I'm not quite sure what you mean by this, can you explain?

I think the answer is probably no (but perhaps we could change that)

@max-sixty
Copy link
Collaborator

Not a clear comment. I had meant for this to work, which it does:

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: a = xr.DataArray(np.random.rand(3,4), dims=['x','y'])

In [4]: b = xr.DataArray(np.random.rand(3,1), dims=['x','y'])

In [5]: a+b.squeeze()
Out[5]:
<xarray.DataArray (x: 3, y: 4)>
array([[ 1.394345,  0.916842,  0.806284,  1.604707],
       [ 1.011313,  0.736347,  0.677679,  0.970856],
       [ 0.477433,  0.825672,  1.014959,  0.495829]])
Dimensions without coordinates: x, y

In [6]: a.where(a>0.5, b.squeeze())
Out[6]:
<xarray.DataArray (x: 3, y: 4)>
array([[ 0.676462,  0.717883,  0.717883,  0.886823],
       [ 0.568829,  0.442484,  0.442484,  0.528372],
       [ 0.100403,  0.725269,  0.914556,  0.100403]])
Dimensions without coordinates: x, y

I had meant: 'exact' includes other having a subset of dimensions, where the missing dimensions are easy to broadcast out

@shoyer
Copy link
Member Author

shoyer commented Aug 1, 2017

I had meant: 'exact' includes other having a subset of dimensions, where the missing dimensions are easy to broadcast out

Yes, that works :). Exact only requires that common dimensions have the same lengths/labels, not that all arguments have the same dimensions.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - This is looking very clean. I actually had a branch working on the same thing but this is a far better way to do this. My only comment is that I think we do want the where function to be exposed as part of this PR so let's do that before merging.

@@ -31,6 +31,9 @@ Enhancements

- Speed-up (x 100) of :py:func:`~xarray.conventions.decode_cf_datetime`.
By `Christian Chwala <https://github.com/cchwala>`_.
- :py:meth:`~xarray.Dataset.where` now supports the ``other`` argument, for
filling with a value other than ``NaN`` (:issue:`576`).
By `Stephan Hoyer <https://github.com/shoyer>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also comment about xr.where function.

@jhamman jhamman modified the milestone: 0.10 Aug 4, 2017
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaximilianR and @spencerahill - any final comments. This is looking pretty good to me.

def raises_regex(error, pattern):
with pytest.raises(error) as excinfo:
yield
excinfo.match(pattern)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall now that the match method was only added pytest 3.0. I think there's a simple work around for earlier versions of pytest, so I'm inclined to switch to that rather than require newer pytest for the test suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to play devils advocate, 3.0 has been out for a year now. For us developers, its trivial to get the current release via conda/pip... so I'm not sure we really need to worry about maintaining backward compatibility.

Obviously, I won't fight you if you think we should handle version 2 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not opposed to requiring pytest 3 when we have something that really needs it. But the work around here is barely an inconvenience.

@spencerahill
Copy link
Contributor

any final comments

@jhamman none; thanks @shoyer for including the function-version of where

@jhamman
Copy link
Member

jhamman commented Aug 8, 2017

LGTM. Merge when ready.

@shoyer shoyer merged commit f9464fd into pydata:master Aug 8, 2017
@shoyer shoyer deleted the where-other branch August 8, 2017 17:00
fujiisoup pushed a commit to fujiisoup/xarray that referenced this pull request Aug 22, 2017
* ENH: three argument version of where

Fixes GH576

* Docstring fixup

* Use join=exact for three argument where

* Add where function

* Don't require pytest 3 for raises_regex
@jhamman jhamman mentioned this pull request Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants