Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N-dimensional boolean indexing #5179

Open
Hoeze opened this issue Apr 17, 2021 · 6 comments
Open

N-dimensional boolean indexing #5179

Hoeze opened this issue Apr 17, 2021 · 6 comments

Comments

@Hoeze
Copy link

Hoeze commented Apr 17, 2021

Currently, the docs state that boolean indexing is only possible with 1-dimensional arrays:
http://xarray.pydata.org/en/stable/indexing.html

However, I often have the case where I'd like to convert a subset of an xarray to a dataframe.
Usually, I would call e.g.:

data = xrds.stack(observations=["dim1", "dim2", "dim3"])
data = data.isel(~ data.missing)
df = data.to_dataframe()

However, this approach is incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array.

Describe the solution you'd like
A better approach would be to directly allow index selection with the boolean array:

data = xrds.isel(~ xrds.missing, dim="observations")
df = data.to_dataframe()

This way, it is possible to

  1. Identify the resulting coordinates with np.argwhere()
  2. Directly use the underlying array for fancy indexing: variable.data[mask]

Additional context
I created a proof-of-concept that works for my projects:
https://gist.github.com/Hoeze/c746ea1e5fef40d99997f765c48d3c0d
Some important lines are those:

def core_dim_locs_from_cond(cond, new_dim_name, core_dims=None) -> List[Tuple[str, xr.DataArray]]:
    [...]
    core_dim_locs = np.argwhere(cond.data)
    if isinstance(core_dim_locs, dask.array.core.Array):
        core_dim_locs = core_dim_locs.persist().compute_chunk_sizes()

def subset_variable(variable, core_dim_locs, new_dim_name, mask=None):
    [...]
    subset = dask.array.asanyarray(variable.data)[mask]
    # force-set chunk size from known chunks
    chunk_sizes = core_dim_locs[0][1].chunks[0]
    subset._chunks = (chunk_sizes, *subset._chunks[1:])

As a result, I would expect something like this:
image

@max-sixty
Copy link
Collaborator

Thanks for the issue @Hoeze . Multi-dimensional bool indexing is definitely something we'd like to add.

How does your code differ from the proposals in #1887? In a brief look through the code — thanks for supplying it — I couldn't immediately see why we need a new dimension?

@Hoeze
Copy link
Author

Hoeze commented Apr 17, 2021

@max-sixty The reason is that my method is basically a special case of point-wise indexing:
http://xarray.pydata.org/en/stable/indexing.html#more-advanced-indexing
You can get the same result by calling:

core_dim_locs = {key: value for key, value in core_dim_locs_from_cond(mask, new_dim_name="newdim")}

# pointwise selection
data.sel(
    dim_0=outliers_subset["dim_0"], 
    dim_1=outliers_subset["dim_1"],
    dim_2=outliers_subset["dim_2"]
)

(Note that you loose chunk information by this method, that's why it is less efficient)

When you want to select random items from a N-dimensional array, you can either model the result as some sparse array or by stacking the dimensions.
(OK, stacking the dimensions means also a sparse COO encoding...)

@max-sixty
Copy link
Collaborator

Ah right, I see now, thanks for explaining.

Allowing pointwise indexing with bool indexes would also be welcome.

@shoyer
Copy link
Member

shoyer commented Apr 20, 2021

I wonder if this is just a better proposal than making N-dimensional boolean indexing an alias for where: #1887 (comment)

@shoyer
Copy link
Member

shoyer commented Apr 22, 2021

@max-sixty and I have been having some more discussion about whether this is what ds[key] should do for N-dimensional boolean indexing over in #1887.

But regardless of what we want boolean indexing with [] to do, this would certainly be welcome functionality and should exist in a dedicated method. ds[key] is already very heavily overloaded in Xarray, so a more explicit option is nice to have, e.g., for the benefit of readability and static type checking. For the same reason, I would rather not put it inside isel() which already integer based indexing with a different call signature. My tentative suggestion is to call this new method sel_mask(), since that's what it does -- selection like sel/isel except based on a mask.

@Hoeze
Copy link
Author

Hoeze commented May 19, 2021

fyi, I updated the boolean indexing to support additional or missing dimensions:
https://gist.github.com/Hoeze/96616ef9d179180b0b7de97c97e00a27
I'm using this on a 4D-array with >300GB to flatten three of the four dimensions and it works, even on 64GB of RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants