Boolean indexing with multi-dimensional key arrays #1887

shoyer · 2018-02-04T23:28:45Z

Originally from #974

For boolean indexing:

da[key] where key is a boolean labelled array (with any number of dimensions) is made equivalent to da.where(key.reindex_like(ds), drop=True). This matches the existing behavior if key is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means that da[key].mean() gives the same result as in NumPy.
da[key] = value where key is a boolean labelled array can be made equivalent to da = da.where(*align(key.reindex_like(da), value.reindex_like(da))) (that is, the three argument form of where).
da[key_0, ..., key_n] where all of key_i are boolean arrays gets handled in the usual way. It is an IndexingError to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply write da[key_0 & ... & key_n].

The text was updated successfully, but these errors were encountered:

Hoeze · 2019-10-21T20:27:14Z

Since #3206 has been implemented now:
Maybe fancy boolean indexing (da[boolean_mask]) could return a sparse array as well.

shaprann · 2020-12-14T14:07:32Z

Just wanted to confirm, that boolean indexing is indeed highly relevant, especially for assigning values instead of just selecting them. Here is a use case which I encounter very often:

I'm working with very sparse data (e.g a satellite image of some islands surrounded by water), and I want to modify it using some_vectorized_function(). Of course I could use some_vectorized_function() to process the whole image, but boolean masking allows me to save a lot of computations.

Here is how I would achieve this in numpy:

import numpy as np
import some_vectorized_function

image = np.array(                                          # image.shape == (3, 7, 7)
    [[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 454, 454, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 565, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 343, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
    
     [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 454, 565, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 667, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 878, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
    
     [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 565, 676, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 323, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 545, 0.0],
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]]
)
image = np.moveaxis(image, 0, -1)                          # image.shape == (7, 7, 3)


# "image" is a standard RGB image
# with shape == (height, width, channel)
# but only 4 pixels contain relevant data!


mask = np.all(image > 0, axis=-1)                          # mask.shape == (7, 7)
                                                           # mask.dtype == bool
                                                           # mask.sum() == 4

image[mask] = some_vectorized_function(image[mask])        # len(image[mask]) == 4
                                                           # image[mask].shape == (4, 3)

The most important fact here is that image[mask] is just a list of 4 pixels, which I can process and then assign them back into their original place. And as you see, this boolean masking also plays very nice with broadcasting, which allows me to mask a 3D array with a 2D mask.

Unfortunately, nothing like this is currently possible with XArray. If implemented, it would enable some crazy speedups for operations like spatial interpolation, where we don't want to interpolate the whole image, but only some pixels that we care about.

max-sixty · 2021-03-21T00:38:23Z

I've added the "good first issue" label — at least the first two bullets of the proposal would be relatively simple to implement, given they're mostly syntactic sugar.

shoyer · 2021-04-20T23:50:34Z

It's worth noting that there is at least one other way boolean indexing could work:

ds[key] could work like ds.stack({key.name: key.dims}).isel({key.name: np.flatnonzero(key.data)}), except without creating a MultiIndex. Arguably this might be more useful and also more consistent with NumPy itself. It's also more similar to the operation @Hoeze wants in N-dimensional boolean indexing #5179.

We can't support both with the same syntax, so we have to make a choice here :).

See also the discussion about what drop_duplicates/unique should do over in #5089.

max-sixty · 2021-04-21T19:21:46Z

I've been trying to conceptualize why I think the where equivalence (the original proposal) is better than the stack proposal (the latter). I think it's mostly:

It's simpler
I'm not sure how the setitem would work; da[key] = value?
If someone wants the stack result, it's less work to do original -> where result -> stack result relative to original -> stack result -> where result; which suggests they're more composable?

But I don't do much pointwise indexing — and so maybe we do want to prioritize that

shoyer · 2021-04-21T20:16:10Z

I've been trying to conceptualize why I think the where equivalence (the original proposal) is better than the stack proposal (the latter).

Here are two reasons why I like the stack version:

It's more NumPy like -- boolean indexing in NumPy returns a flat array in the same way
It doesn't need dtype promotion to handle possibly missing values, so it will have more predictable semantics.

As a side note: one nice feature of using isel() for stacking is that it does not create a MultiIndex, which can be expensive. But there's no reason why we necessarily need to do that for stack(). I'll open a new issue to discuss adding an optional parameter.

I'm not sure how the setitem would work; da[key] = value?

To match the semantics of NumPy, value would need to have matching dims/coords to those of da[key]. In other words, it would also need to be stacked.

If someone wants the stack result, it's less work to do original -> where result -> stack result relative to original -> stack result -> where result; which suggests they're more composable?

I'm not quite sure this is true -- it's the difference between needing to call stack() vs unstack().

max-sixty · 2021-04-22T00:33:29Z

OK great. To confirm, this is what it would look like:

Context:

In [81]: da = xr.DataArray(np.arange(12).reshape(3,4), dims=list('ab'))

In [82]: da
Out[82]:
<xarray.DataArray (a: 3, b: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Dimensions without coordinates: a, b

In [84]: key = da % 3 == 0

In [83]: key
Out[83]:
<xarray.DataArray (a: 3, b: 4)>
array([[ True, False, False,  True],
       [False, False,  True, False],
       [False,  True, False, False]])
Dimensions without coordinates: a, b

Currently

In [85]: da[key]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-85-7fd83c907cb6> in <module>
----> 1 da[key]
...
~/.asdf/installs/python/3.8.8/lib/python3.8/site-packages/xarray/core/variable.py in _validate_indexers(self, key)
    697                         )
    698                     if k.ndim > 1:
--> 699                         raise IndexError(
    700                             "{}-dimensional boolean indexing is "
    701                             "not supported. ".format(k.ndim)

IndexError: 2-dimensional boolean indexing is not supported.

Current proposal ("stack"), of da[key] and with a dimension of key's name (and probably no multiindex):

In [86]: da.values[key.values]
Out[86]: array([0, 3, 6, 9])   # But the xarray version

Previous suggestion ("where"), for the result of da[key]:

In [87]: da.where(key)
Out[87]:
<xarray.DataArray (a: 3, b: 4)>
array([[ 0., nan, nan,  3.],
       [nan, nan,  6., nan],
       [nan,  9., nan, nan]])
Dimensions without coordinates: a, b

(small follow up I'll put in another message, for clarity)

max-sixty · 2021-04-22T00:40:49Z

I'm not quite sure this is true -- it's the difference between needing to call stack() vs unstack().

This was a tiny point so it's fine to discard. I had meant that producing the where result via the stack result requires a stack and unstack. But producing the stack result via a where result requires only one stack — the where result is very cheap.

shoyer · 2021-04-22T00:59:25Z

OK great. To confirm, this is what it would look like:

Yes, this looks right to me.

shoyer · 2021-04-22T01:02:32Z

Current proposal ("stack"), of da[key] and with a dimension of key's name (and probably no multiindex):
In [86]: da.values[key.values]
Out[86]: array([0, 3, 6, 9])   # But the xarray version

The part about this new proposal that is most annoying is that the key needs a name, which we can use to name the new dimension. That's not too hard to do, but it is little annoying -- in practice you would have to write something like da[key.rename('key_name')] much of the time to make this work.

max-sixty · 2021-04-22T03:04:41Z

I'm still working through this. Using this to jot down my notes, no need to respond.

One property that seems to be lacking is that if key changes from n-1 to n dimensions, the behavior changes (also outlined here):

In [171]: a
Out[171]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [172]: mask
Out[172]: array([ True, False,  True])

In [173]: a[mask]
Out[173]:
array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

...as expected, but now let's make a 2D mask...

In [174]: full_mask = np.broadcast_to(mask[:, np.newaxis], (3,4))

In [175]: full_mask
Out[175]:
array([[ True,  True,  True,  True],
       [False, False, False, False],
       [ True,  True,  True,  True]])

In [176]: a[full_mask]
Out[176]: array([ 0,  1,  2,  3,  8,  9, 10, 11])    # flattened!

Hoeze · 2021-04-22T12:08:45Z

Current proposal ("stack"), of da[key] and with a dimension of key's name (and probably no multiindex):
In [86]: da.values[key.values]
Out[86]: array([0, 3, 6, 9])   # But the xarray version
The part about this new proposal that is most annoying is that the key needs a name, which we can use to name the new dimension. That's not too hard to do, but it is little annoying -- in practice you would have to write something like da[key.rename('key_name')] much of the time to make this work.

IMO, the perfect solution would be masking support.
I.e. da[key] would return the same array with an additional variable da.mask == key:

In [87]: da[key]
Out[87]:
<xarray.DataArray (a: 3, b: 4)>
array([[   0, <NA>, <NA>,    3],
       [<NA>, <NA>,    6, <NA>],
       [<NA>,    9, <NA>, <NA>]])
dtype: int
Dimensions without coordinates: a, b

Then we could have something like da[key].stack(new_dim=["a", "b"], dropna=True):

In [87]: da[key].stack(new_dim=["a", "b"], dropna=True)
Out[87]:
<xarray.DataArray (newdim: 4)>
array([0, 3, 6, 9])
coords{
   "a" (newdim): [0, 0, 1, 2],
   "b" (newdim): [0, 3, 2, 1],
}
Dimensions without coordinates: newdim

Here, dropna=True would allow avoiding to create the cross-product of a, b.

Also, that would avoid all those unnecessary float casts for free.

max-sixty · 2021-04-22T20:50:29Z

stack(new_dim=["a", "b"], dropna=True)

This could be useful (potentially we can open a different issue). While someone can call .dropna, that coerces to floats (or some type that supports missing) and can allocate more than is needed. Potentially this can be considered along with issues around sparse, e.g. #3245, #4143

shoyer mentioned this issue Feb 4, 2018

Indexing with alignment and broadcasting #974

Closed

shoyer mentioned this issue Mar 29, 2018

square-bracket slice a Dataset with a DataArray #2027

Open

shoyer mentioned this issue Oct 5, 2018

CDO-like convenience methods to select times #557

Open

dcherian added the topic-indexing label Jan 13, 2019

shoyer mentioned this issue Jan 21, 2019

[Feature Request] Dataset.loc should accept comparisons with a contained DataArray as key #2689

Closed

fujiisoup mentioned this issue Nov 19, 2019

loc API gives KeyError: "not all values found in index" #3546

Open

max-sixty mentioned this issue Mar 21, 2021

disallow boolean coordinates? #4892

Open

max-sixty added the contrib-good-first-issue label Mar 21, 2021

max-sixty mentioned this issue Apr 17, 2021

N-dimensional boolean indexing #5179

Open

shoyer mentioned this issue Apr 21, 2021

Make creating a MultiIndex in stack optional #5202

Closed

shoyer mentioned this issue Apr 22, 2021

Add drop duplicates #5089

Closed

5 tasks

max-sixty mentioned this issue Apr 22, 2021

[Feature request] Masked operations #4143

Open

hammer mentioned this issue Sep 29, 2022

Implement boolean array indexing cubed-dev/cubed#73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boolean indexing with multi-dimensional key arrays #1887

Boolean indexing with multi-dimensional key arrays #1887

shoyer commented Feb 4, 2018

Hoeze commented Oct 21, 2019

shaprann commented Dec 14, 2020 •

edited

Loading

max-sixty commented Mar 21, 2021

shoyer commented Apr 20, 2021

max-sixty commented Apr 21, 2021

shoyer commented Apr 21, 2021

max-sixty commented Apr 22, 2021 •

edited

Loading

max-sixty commented Apr 22, 2021

shoyer commented Apr 22, 2021

shoyer commented Apr 22, 2021

max-sixty commented Apr 22, 2021 •

edited

Loading

Hoeze commented Apr 22, 2021 •

edited

Loading

max-sixty commented Apr 22, 2021 •

edited

Loading

Boolean indexing with multi-dimensional key arrays #1887

Boolean indexing with multi-dimensional key arrays #1887

Comments

shoyer commented Feb 4, 2018

Hoeze commented Oct 21, 2019

shaprann commented Dec 14, 2020 • edited Loading

max-sixty commented Mar 21, 2021

shoyer commented Apr 20, 2021

max-sixty commented Apr 21, 2021

shoyer commented Apr 21, 2021

max-sixty commented Apr 22, 2021 • edited Loading

max-sixty commented Apr 22, 2021

shoyer commented Apr 22, 2021

shoyer commented Apr 22, 2021

max-sixty commented Apr 22, 2021 • edited Loading

Hoeze commented Apr 22, 2021 • edited Loading

max-sixty commented Apr 22, 2021 • edited Loading

shaprann commented Dec 14, 2020 •

edited

Loading

max-sixty commented Apr 22, 2021 •

edited

Loading

max-sixty commented Apr 22, 2021 •

edited

Loading

Hoeze commented Apr 22, 2021 •

edited

Loading

max-sixty commented Apr 22, 2021 •

edited

Loading