-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing with alignment and broadcasting #974
Comments
What is the current status of these proposed major changes? I have recently some free time. |
I haven't had the time to start working on this -- help would be very gratefully appreciated! In rough order, I would suggest:
|
@shoyer Thanks for the suggestions. I don't think I fully understand how boolean indexing works. I started from the vectorizing indexing by updating |
@fujiisoup My first two bullets for boolean indexing are actually new functionality, so we wouldn't need that for a first pass here. It actually would probably be better to save it for a second PR. My third bullet on boolean indexing is basically just saying that |
I opened a pull request with your branch in #1473 so I can comment/view your changes. I hope that's OK! |
I think we can bring all of NumPy's advanced indexing to xarray in a very consistent way, with only very minor breaks in backwards compatibility.
For boolean indexing:
da[key]
wherekey
is a boolean labelled array (with any number of dimensions) is made equivalent toda.where(key.reindex_like(ds), drop=True)
. This matches the existing behavior ifkey
is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means thatda[key].mean()
gives the same result as in NumPy.da[key] = value
wherekey
is a boolean labelled array can be made equivalent toda = da.where(*align(key.reindex_like(da), value.reindex_like(da)))
(that is, the three argument form ofwhere
).da[key_0, ..., key_n]
where all ofkey_i
are boolean arrays gets handled in the usual way. It is anIndexingError
to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply writeda[key_0 & ... & key_n]
.For vectorized indexing (by integer or index value):
da[key_0, ..., key_n]
where all ofkey_i
are integer labelled arrays with any number of dimensions gets handled like NumPy, except instead of broadcasting numpy-style we do broadcasting xarray-style:key_i
are unlabelled, 1D arrays (e.g., numpy arrays), we convert them into anxarray.Variable
along the respective dimension. 0D arrays remain scalars. This ensures that the result of broadcasting them (in the next step) will be consistent with our current "outer indexing" behavior. Unlabelled higher dimensional arrays triggers anIndexingError
.da[*broadcast(key_0, ..., key_n)]
(note that broadcast now includes automatic alignment).key_i
to the integer position on the correspondingi
th axis onda
.ds.loc[key_0, ...., key_n]
works exactly as above, except instead of doing integer lookup, we lookup label values in the corresponding index instead..isel
and.sel
/.reindex
works like the two previous cases, except we lookup axes by dimension name instead of axis position.da[key] = value
orda.loc[key] = value
), but I think it works in a straightforwardly similar fashion.All of these methods should also work for indexing on
Dataset
by looping over Dataset variables in the usual way.This framework neatly subsumes most of the major limitations with xarray's existing indexing:
sel_points
/isel_points
) for pointwise indexing. If you want to select along the diagonal of an array, you simply need to supply indexers that use a new dimension. Instead ofarr.sel_points(lat=stations.lat, lon=stations.lon, dim='station')
, you would simply writearr.sel(lat=stations.lat, lon=stations.lon)
-- thestation
dimension is taken automatically from the indexer.ds.reindex(lon=grid.lon, lat=grid.lat, method='nearest', tolerance=0.5)
ords.reindex_like(grid, method='nearest', tolerance=0.5)
.Questions to consider:
[]
and.loc[]
and non-labelled arrays when some but not all dimensions are provided? Instead, we would require explicitly indexing like[key, ...]
(yes, writing...
), which indicates "all trailing axes" like NumPy. This behavior has been suggested for new indexers in NumPy because it precludes a class of bugs where the array has an unexpected number of dimensions. On the other hand, it's not so necessary for us when we have explicit indexing by dimension name with.sel
.xref these comments from @MaximilianR and myself
Note: I would certainly welcome help making this happen from a contributor other than myself, though you should probably wait until I finish #964, first, which lays important groundwork.
The text was updated successfully, but these errors were encountered: