Indexing with alignment and broadcasting #974

shoyer · 2016-08-18T06:39:27Z

I think we can bring all of NumPy's advanced indexing to xarray in a very consistent way, with only very minor breaks in backwards compatibility.

For boolean indexing:

da[key] where key is a boolean labelled array (with any number of dimensions) is made equivalent to da.where(key.reindex_like(ds), drop=True). This matches the existing behavior if key is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means that da[key].mean() gives the same result as in NumPy.
da[key] = value where key is a boolean labelled array can be made equivalent to da = da.where(*align(key.reindex_like(da), value.reindex_like(da))) (that is, the three argument form of where).
da[key_0, ..., key_n] where all of key_i are boolean arrays gets handled in the usual way. It is an IndexingError to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply write da[key_0 & ... & key_n].

For vectorized indexing (by integer or index value):

da[key_0, ..., key_n] where all of key_i are integer labelled arrays with any number of dimensions gets handled like NumPy, except instead of broadcasting numpy-style we do broadcasting xarray-style:
- If any of key_i are unlabelled, 1D arrays (e.g., numpy arrays), we convert them into an xarray.Variable along the respective dimension. 0D arrays remain scalars. This ensures that the result of broadcasting them (in the next step) will be consistent with our current "outer indexing" behavior. Unlabelled higher dimensional arrays triggers an IndexingError.
- We ensure all keys have the same dimensions/coordinates by mapping it to da[*broadcast(key_0, ..., key_n)] (note that broadcast now includes automatic alignment).
- The result's dimensions and coordinates are copied from the broadcast keys.
- The result's values are taken by mapping each set of integer locations specified by the broadcast version of key_i to the integer position on the corresponding ith axis on da.
Labeled indexing like ds.loc[key_0, ...., key_n] works exactly as above, except instead of doing integer lookup, we lookup label values in the corresponding index instead.
Indexing with .isel and .sel/.reindex works like the two previous cases, except we lookup axes by dimension name instead of axis position.
I haven't fully thought through the implications for assignment (da[key] = value or da.loc[key] = value), but I think it works in a straightforwardly similar fashion.

All of these methods should also work for indexing on Dataset by looping over Dataset variables in the usual way.

This framework neatly subsumes most of the major limitations with xarray's existing indexing:

Boolean indexing on multi-dimensional arrays works in an intuitive way, for both selection and assignment.
No more need for specialized methods (sel_points/isel_points) for pointwise indexing. If you want to select along the diagonal of an array, you simply need to supply indexers that use a new dimension. Instead of arr.sel_points(lat=stations.lat, lon=stations.lon, dim='station'), you would simply write arr.sel(lat=stations.lat, lon=stations.lon) -- the station dimension is taken automatically from the indexer.
Other use cases for NumPy's advanced indexing that currently are impossible in xarray also automatically work. For example, nearest neighbor interpolation to a completely different grid is now as simple as ds.reindex(lon=grid.lon, lat=grid.lat, method='nearest', tolerance=0.5) or ds.reindex_like(grid, method='nearest', tolerance=0.5).

Questions to consider:

How does this interact with @benbovy's enhancements for MultiIndex indexing? (Multi-index indexing #802 and Multi-index levels as coordinates #947)
How do we handle mixed slice and array indexing? In NumPy, this is a major source of confusion, because slicing is done before broadcasting and the order of slices in the result is handled separately from broadcast indices. I think we may be able to resolve this by mapping slices in this case to 1D arrays along their respective axes, and using our normal broadcasting rules.
Should we deprecate non-boolean indexing with [] and .loc[] and non-labelled arrays when some but not all dimensions are provided? Instead, we would require explicitly indexing like [key, ...] (yes, writing ...), which indicates "all trailing axes" like NumPy. This behavior has been suggested for new indexers in NumPy because it precludes a class of bugs where the array has an unexpected number of dimensions. On the other hand, it's not so necessary for us when we have explicit indexing by dimension name with .sel.

xref these comments from @MaximilianR and myself

Note: I would certainly welcome help making this happen from a contributor other than myself, though you should probably wait until I finish #964, first, which lays important groundwork.

The text was updated successfully, but these errors were encountered:

fujiisoup · 2017-07-05T10:26:54Z

What is the current status of these proposed major changes?
Is there any starting basis for them?

I have recently some free time.

shoyer · 2017-07-05T17:25:07Z

I haven't had the time to start working on this -- help would be very gratefully appreciated!

In rough order, I would suggest:

Write a design doc for any new API. We basically have this here, but there may be a few aspects that still need to be thought through, so I would appreciate your critical review.
Write the test suite for the new functionality. We could even submit this, with the tests marked as xfail.
Implement the core functionality and get the new tests to pass. This is the fun part :).
Get the existing test suite to pass. This will likely be the most painful part, because there will be a number of completely unrelated test failures. I can definitely help when we get to this stage.

fujiisoup · 2017-07-10T01:00:35Z

@shoyer Thanks for the suggestions.

I don't think I fully understand how boolean indexing works.
Could you show me some example use-cases?

I started from the vectorizing indexing by updating Variable.__getitem__ method,
in my folked repo (test).

shoyer · 2017-07-10T01:34:23Z

@fujiisoup My first two bullets for boolean indexing are actually new functionality, so we wouldn't need that for a first pass here. It actually would probably be better to save it for a second PR.

My third bullet on boolean indexing is basically just saying that da[key_1, ..., key_n] and ds.sel(x=key) should be handled in a consistent way with the new indexing behavior when given a boolean array. I don't think this will require any special adjustments -- it should fall out pretty immediately once you get the rest working.

shoyer · 2017-07-10T01:49:57Z

I opened a pull request with your branch in #1473 so I can comment/view your changes. I hope that's OK!

shoyer · 2018-02-04T23:30:11Z

Vectorized indexing was implemented by #1473

I've opened a new issue #1887 for multi-dimensional boolean indexing.

shoyer added topic-indexing API design labels Aug 23, 2016

shoyer mentioned this issue Aug 23, 2016

API design for pointwise indexing #475

Open

benbovy mentioned this issue Oct 20, 2016

Follow-ups on MultIndex support #719

Closed

7 tasks

shoyer added this to the before 1.0 milestone Feb 1, 2017

shoyer added the major API changes label Feb 1, 2017

shoyer mentioned this issue Mar 28, 2017

Deprecate indexing with non-aligned DataArray objects #1333

Closed

This was referenced Apr 30, 2017

argmin / argmax behavior doesn't match documentation #1388

Closed

Sortby #1389

Merged

shoyer mentioned this issue May 31, 2017

Wrong dimension referenced in boolean indexing with .loc #1436

Closed

fujiisoup mentioned this issue Jul 1, 2017

Argmin indexes #1469

Closed

4 tasks

shoyer mentioned this issue Jul 10, 2017

WIP: indexing with broadcasting #1473

Closed

4 tasks

leezu mentioned this issue Aug 24, 2017

Assignment #1519

Open

shoyer mentioned this issue Feb 4, 2018

Boolean indexing with multi-dimensional key arrays #1887

Open

shoyer closed this as completed Feb 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing with alignment and broadcasting #974

Indexing with alignment and broadcasting #974

shoyer commented Aug 18, 2016 •

edited

Loading

fujiisoup commented Jul 5, 2017 •

edited

Loading

shoyer commented Jul 5, 2017 •

edited

Loading

fujiisoup commented Jul 10, 2017

shoyer commented Jul 10, 2017

shoyer commented Jul 10, 2017 •

edited

Loading

shoyer commented Feb 4, 2018

Indexing with alignment and broadcasting #974

Indexing with alignment and broadcasting #974

Comments

shoyer commented Aug 18, 2016 • edited Loading

fujiisoup commented Jul 5, 2017 • edited Loading

shoyer commented Jul 5, 2017 • edited Loading

fujiisoup commented Jul 10, 2017

shoyer commented Jul 10, 2017

shoyer commented Jul 10, 2017 • edited Loading

shoyer commented Feb 4, 2018

shoyer commented Aug 18, 2016 •

edited

Loading

fujiisoup commented Jul 5, 2017 •

edited

Loading

shoyer commented Jul 5, 2017 •

edited

Loading

shoyer commented Jul 10, 2017 •

edited

Loading