N-dimensional boolean indexing #5179

Hoeze · 2021-04-17T14:07:48Z

Currently, the docs state that boolean indexing is only possible with 1-dimensional arrays:
http://xarray.pydata.org/en/stable/indexing.html

However, I often have the case where I'd like to convert a subset of an xarray to a dataframe.
Usually, I would call e.g.:

data = xrds.stack(observations=["dim1", "dim2", "dim3"])
data = data.isel(~ data.missing)
df = data.to_dataframe()

However, this approach is incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array.

Describe the solution you'd like
A better approach would be to directly allow index selection with the boolean array:

data = xrds.isel(~ xrds.missing, dim="observations")
df = data.to_dataframe()

This way, it is possible to

Identify the resulting coordinates with np.argwhere()
Directly use the underlying array for fancy indexing: variable.data[mask]

Additional context
I created a proof-of-concept that works for my projects:
https://gist.github.com/Hoeze/c746ea1e5fef40d99997f765c48d3c0d
Some important lines are those:

def core_dim_locs_from_cond(cond, new_dim_name, core_dims=None) -> List[Tuple[str, xr.DataArray]]:
    [...]
    core_dim_locs = np.argwhere(cond.data)
    if isinstance(core_dim_locs, dask.array.core.Array):
        core_dim_locs = core_dim_locs.persist().compute_chunk_sizes()

def subset_variable(variable, core_dim_locs, new_dim_name, mask=None):
    [...]
    subset = dask.array.asanyarray(variable.data)[mask]
    # force-set chunk size from known chunks
    chunk_sizes = core_dim_locs[0][1].chunks[0]
    subset._chunks = (chunk_sizes, *subset._chunks[1:])

As a result, I would expect something like this:

The text was updated successfully, but these errors were encountered:

max-sixty · 2021-04-17T18:53:05Z

Thanks for the issue @Hoeze . Multi-dimensional bool indexing is definitely something we'd like to add.

How does your code differ from the proposals in #1887? In a brief look through the code — thanks for supplying it — I couldn't immediately see why we need a new dimension?

Hoeze · 2021-04-17T20:22:13Z

@max-sixty The reason is that my method is basically a special case of point-wise indexing:
http://xarray.pydata.org/en/stable/indexing.html#more-advanced-indexing
You can get the same result by calling:

core_dim_locs = {key: value for key, value in core_dim_locs_from_cond(mask, new_dim_name="newdim")}

# pointwise selection
data.sel(
    dim_0=outliers_subset["dim_0"], 
    dim_1=outliers_subset["dim_1"],
    dim_2=outliers_subset["dim_2"]
)

(Note that you loose chunk information by this method, that's why it is less efficient)

When you want to select random items from a N-dimensional array, you can either model the result as some sparse array or by stacking the dimensions.
(OK, stacking the dimensions means also a sparse COO encoding...)

max-sixty · 2021-04-17T21:12:11Z

Ah right, I see now, thanks for explaining.

Allowing pointwise indexing with bool indexes would also be welcome.

shoyer · 2021-04-20T23:51:46Z

I wonder if this is just a better proposal than making N-dimensional boolean indexing an alias for where: #1887 (comment)

shoyer · 2021-04-22T03:11:21Z

@max-sixty and I have been having some more discussion about whether this is what ds[key] should do for N-dimensional boolean indexing over in #1887.

But regardless of what we want boolean indexing with [] to do, this would certainly be welcome functionality and should exist in a dedicated method. ds[key] is already very heavily overloaded in Xarray, so a more explicit option is nice to have, e.g., for the benefit of readability and static type checking. For the same reason, I would rather not put it inside isel() which already integer based indexing with a different call signature. My tentative suggestion is to call this new method sel_mask(), since that's what it does -- selection like sel/isel except based on a mask.

Hoeze · 2021-05-19T21:27:17Z

fyi, I updated the boolean indexing to support additional or missing dimensions:
https://gist.github.com/Hoeze/96616ef9d179180b0b7de97c97e00a27
I'm using this on a 4D-array with >300GB to flatten three of the four dimensions and it works, even on 64GB of RAM.

shoyer mentioned this issue Apr 20, 2021

Boolean indexing with multi-dimensional key arrays #1887

Open

shoyer mentioned this issue Apr 21, 2021

Make creating a MultiIndex in stack optional #5202

Closed

dcherian added API design design question labels Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N-dimensional boolean indexing #5179

N-dimensional boolean indexing #5179

Hoeze commented Apr 17, 2021

max-sixty commented Apr 17, 2021

Hoeze commented Apr 17, 2021 •

edited

Loading

max-sixty commented Apr 17, 2021

shoyer commented Apr 20, 2021

shoyer commented Apr 22, 2021

Hoeze commented May 19, 2021

N-dimensional boolean indexing #5179

N-dimensional boolean indexing #5179

Comments

Hoeze commented Apr 17, 2021

max-sixty commented Apr 17, 2021

Hoeze commented Apr 17, 2021 • edited Loading

max-sixty commented Apr 17, 2021

shoyer commented Apr 20, 2021

shoyer commented Apr 22, 2021

Hoeze commented May 19, 2021

Hoeze commented Apr 17, 2021 •

edited

Loading