Add set_xindex and drop_indexes methods #6971

benbovy · 2022-08-31T12:54:35Z

Closes Public API for setting new indexes: add a set_xindex method? #6849
Supersedes (scipy 2022 branch) Add an "options" argument to Index.from_variables() #6800
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This PR adds Dataset and DataArray .set_xindex and .drop_indexes methods (the latter is also discussed in #4366). I've cherry picked the relevant commits in the scipy22 branch and added a few more commits. This PR also allows passing build options to any Index.

Some comments and open questions:

Should we make the index_cls argument of set_xindex optional?
- I.e., set_index(coord_names, index_cls=None, **options) where a pandas index is created by default (or a pandas multi-index if several coordinate names are given), provided that the coordinate(s) are valid 1-d candidates.
- This would be redundant with the existing set_index method, but this would be convenient if we later depreciate it.
Should we depreciate set_index and reset_index? I think we should, but probably not at this point yet.
There's a special case for multi-indexes where set_xindex(["foo", "bar"], PandasMultiIndex) adds a dimension coordinate in addition to the "foo" and "bar" level coordinates so that it is consistent with the rest of Xarray. I find it a bit annoying, though. Probably another motivation for depreciating this dimension coordinate.
In this PR I also imported the Index base class in Xarray's root namespace.
- It is needed for custom indexes and it's just a little more convenient than importing it from xarray.core.indexes.
- Should we do the same for PandasIndex and PandasMultiIndex subclasses? Maybe if one wants to create a custom index inheriting from it. PandasMultiIndex factory methods could be also useful if we depreciate passing pd.MultiIndex objects as DataArray / Dataset coordinates.

It allows passing options to the constructor of a custom index class (if any). The **options arguments of Dataset.set_xindex() are passed through. Also add type annotations to set_xindex().

benbovy · 2022-08-31T13:33:42Z

Also, since .drop_indexes is new API I didn't feel the need to implement the old behavior regarding pandas multi-indexes (restored in #6592 and #6798 but deprecated anyway). @dcherian what do you think?

benbovy · 2022-08-31T13:45:26Z

BTW, viewing pull-request doc builds on RTD seems broken? Clicking on the "Details" link of the corresponding check leads to a 404.

xarray/core/dataarray.py

mathause

I'm super excited to see this in - then we get to finally play with the new indexes (if I understand this correctly).

mathause · 2022-09-01T10:20:02Z

xarray/core/dataset.py

+        # coordinates do not conflict), but let's not allow this for now
+        indexed_coords = set(coord_names) & set(self._indexes)
+
+        if indexed_coords:


Does this mean that you cannot use coords in more than one indexes? (I am not sure how important this is but could imagine a use case where lat & lon are used as 1D indexes and in a KDTree).

Yes that's right, allow multiple indexes per coordinate would make many things much harder.

There are indeed some examples (like the one you mention) where it could be useful to have multiple indexes. But I think it could be solved by either switching between indexes (if building them is not too expensive) or via a custom "meta-index" that would encapsulate both kinds of indexes.

Fair enough - thanks for the clarification!

Try setting a pandas (multi-)index by default.

benbovy · 2022-09-07T09:03:46Z

Should we make the index_cls argument of set_xindex optional?

I ended up doing it. It is convenient for setting a pandas index for a non-dimension coordinate, which is currently not possible to do with set_index(). For unindexed dimension coordinates (e.g., now possible after renaming coordinates or dimensions), I find the syntax set_index(x="x") a bit weird compared to set_xindex("x").

xarray/core/dataset.py

TomNicholas · 2022-09-14T04:00:22Z

In this PR I also imported the Index base class in Xarray's root namespace.

It is needed for custom indexes and it's just a little more convenient than importing it from xarray.core.indexes.
Should we do the same for PandasIndex and PandasMultiIndex subclasses? Maybe if one wants to create a custom index inheriting from it. PandasMultiIndex factory methods could be also useful if we depreciate passing pd.MultiIndex objects as DataArray / Dataset coordinates.

Have you thought about whether we might want to expose a separate public xarray.indexes namespace? Then as the list of helpers for creating custom index objects grows they could live in there, so we might have xarray.Index, but xarray.indexes.PandasIndex, xarray.indexes.PandasMultiIndex, xarray.indexes.PeriodicBoundaryIndex, xarray.indexes.OtherDomainAgnosticIndex etc. all listed in the docs API page ?

benbovy · 2022-09-14T10:33:17Z

Have you thought about whether we might want to expose a separate public xarray.indexes namespace?

Yes I've been thinking about it and I agree I find it cleaner than exposing all of this in Xarray's main namespace. There's a few (minor) cons, though:

I think the indexes.py and indexing.py modules and their content are well located in core
We could create a xarray/indexes/__init__.py and import there a few "public" classes from core, but is it worth it? I'm not sure if the number of Xarray built-in indexes will grow much beyond PandasIndex and PandasMultiIndex. Perhaps it's preferable not?
Things like CFTimeIndex are already imported in Xarray's main namespace

TomNicholas · 2022-09-14T15:11:49Z

I personally would still choose to put indexes stuff in a separate namespace, just because it's neater, but I can see it's borderline.

…

On Wed, 14 Sep 2022, 06:33 Benoit Bovy, ***@***.***> wrote: Have you thought about whether we might want to expose a separate public xarray.indexes namespace? Yes I've been thinking about it and I agree I find it cleaner than exposing all of this in Xarray's main namespace. There's a few (minor) cons, though: - I think the indexes.py and indexing.py modules and their content are well located in core - We could create a xarray/indexes/__init__.py and import there a few "public" classes from core, but is it worth it? I'm not sure if the number of Xarray built-in indexes will grow much beyond PandasIndex and PandasMultiIndex. Perhaps it's preferable not? - Things like CFTimeIndex are already imported in Xarray's main namespace — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pydata_xarray_pull_6971-23issuecomment-2D1246569430&d=DwMCaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qdISi9HqjazmE0DcySuXts3OlnplnLfKjH4hpzAV0xo&m=4E5eW5IsNTqFQTrWcdzS851OngwlYEdG3SG0WlL5z0sbHu692Rkq4bkhw8yxynW1&s=s8yiD2RYG-LEkCEiuSDT6KhIowl7VtGsnb_6GuYOwZk&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AISNPI6DG7BJG5WWV7VBCEDV6GSXRANCNFSM6AAAAAAQBLFT4I&d=DwMCaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qdISi9HqjazmE0DcySuXts3OlnplnLfKjH4hpzAV0xo&m=4E5eW5IsNTqFQTrWcdzS851OngwlYEdG3SG0WlL5z0sbHu692Rkq4bkhw8yxynW1&s=bAX5LysTxNxkTVXx0Tv75_8-UZ5okn0yuHXvGeGScGg&e=> . You are receiving this because you commented.Message ID: ***@***.***>

Illviljan · 2022-09-16T02:45:34Z

xarray/core/dataset.py

+        coord_names: Hashable | Sequence[Hashable],
+        index_cls: type[Index] | None = None,
+        **options,
+    ) -> Dataset:


Suggested change

) -> Dataset:

) -> T_Dataset:

mypy is not happy with this:

xarray/tests/test_dataset.py:3307: error: Argument 1 to "set_xindex" of "Dataset" has incompatible type "List[str]"; expected "Hashable" [arg-type] xarray/tests/test_dataset.py:3307: note: Following member(s) of "List[str]" have conflicts: xarray/tests/test_dataset.py:3307: note: __hash__: expected "Callable[[], int]", got "None" xarray/tests/test_dataset.py:3307: note: Protocol member Hashable.__hash__ expected instance variable, got classe variabl

#6971 (comment)

strings are sequences apparently:

isinstance("str", typing.Sequence) Out[63]: True

Try out CoordNames = Union[str, Iterable[Hashable]] seems to be succesful in #7048.
It would be nice if we aligned these tricky types so try to use named variables for repeated arguments.

Technically a str is also an Iterable of Hashable :P
But the typing community is quite relaxed about violating that fact.
So as long as you don't need the two types to be "perpendicular" it should work.

+1 for using a named variable like CoordNames. The tricky thing here is that the order is important. Do we use Sequence in Xarray in that case? I guess we would need to define two variables for each case where the order does / doesn't matter?

Also, I don't remember whether a single coordinate name should be str or Hashable. Should we treat it like a single dimension name or not?

I feel like this issue should be addressed more globally in Xarray than within the scope of this PR. Perhaps better to move on and merge this PR before the next release?

Usually we try to move to str | Iterable [Hashable] for "one or more dims", and Hashable for a single dim.

Usually we try to move to str | Iterable[Hashable] for "one or more dims", and Hashable for a single dim.

Probably not in all cases? For example, with DataArray.__init__(..., dims: str | Iterable[Hashable]) the type checker would allow passing a set. Recently I had to figure out what was going on with xr.DataArray(data=np.zeros((10, 5)), dims={'x', 'time'}), which mypy should actually catch with Sequence[Hashable]. Slightly off-topic: should we have two variables Dims and OrderedDims defined in xarray.core.types?

Same issue here for coordinate names. str | Sequence[Hashable] seems to work well, though.

Why should a set not be allowed?
It's already since quite some time that the order is preserved? I think all built-in Iterables have conserved order, and internally we convert to tuple anyway?

I don't think the order is preserved for sets (unlike dicts). This is what I can get with CPython 3.9 / Xarray v2022.6.0:

print(xr.DataArray(data=np.zeros((2, 3)), dims={'x', 'time'})) # <xarray.DataArray (time: 2, x: 3)> # array([[0., 0., 0.], # [0., 0., 0.]]) # Dimensions without coordinates: time, x tuple({'x', 'time'}) # ('time', 'x')

Whoops you are right, that was dicts.
Then indeed we need to distinguish between dims and ordered dims.

xarray/core/dataarray.py

benbovy · 2022-09-27T11:14:07Z

In the last commit I added the xarray.indexes namespace from which we can import Index, PandasIndex and PandasMultiIndex.

Thanks everyone for the feedback and review!

I think this is ready to merge, if we agree to address the coord_names typing issue in another PR?

benbovy and others added 17 commits August 31, 2022 09:29

temporary API to set custom indexes

3f6f637

add the temporary index API to DataArray

bf30d54

add options argument to Index.from_variables()

9de9c46

It allows passing options to the constructor of a custom index class (if any). The **options arguments of Dataset.set_xindex() are passed through. Also add type annotations to set_xindex().

fix mypy

aa403a4

remove temporary API warning

210a59a

add the Index class in Xarray's root namespace

d8c3985

improve set_xindex docstrings and add to api.rst

c4afabf

remove temp comments

fe723ce

special case for pandas multi-index dim coord

a48c853

add tests for set_xindex

01de6bd

error message tweaks

201bd05

set_xindex with 1 coord: avoid reodering coords

41c896f

mypy fixes

1ec5ca6

add Dataset and DataArray drop_indexes methods

a6caa7a

improve assert_no_index_corrupted error msg

bb07d5a

drop_indexes: add tests

ec2f8fc

add drop_indexes to api.rst

f9601b9

github-actions bot added the topic-indexing label Aug 31, 2022

benbovy added 2 commits August 31, 2022 15:47

improve docstrings of legacy methods

1a555bc

add what's new entry

0b7d582

benbovy mentioned this pull request Aug 31, 2022

Explicit indexes: next steps #6293

Open

49 tasks

Illviljan reviewed Aug 31, 2022

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

try using correct typing w/o mypy complaining

3ab0bc9

mathause reviewed Sep 1, 2022

View reviewed changes

This was referenced Sep 2, 2022

(scipy 2022 branch) Add an "options" argument to Index.from_variables() #6800

Closed

reset multi-index to single index (level): coordinate not renamed #6989

Closed

Review (re)set_index #6992

Merged

This was referenced Sep 6, 2022

Raise UserWarning when rename creates a new dimension coord #6999

Merged

Inconsistency in whether index is created with new dimension coordinate? #4417

Closed

make index_cls arg optional

9e75f95

Try setting a pandas (multi-)index by default.

mathause mentioned this pull request Sep 7, 2022

sel along 1D non-index coordinates #3925

Closed

4 tasks

benbovy mentioned this pull request Sep 9, 2022

Add documentation on custom indexes #6975

Merged

dcherian reviewed Sep 9, 2022

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

TomNicholas mentioned this pull request Sep 13, 2022

Periodic Boundary Index #7031

Open

Illviljan reviewed Sep 16, 2022

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

benbovy mentioned this pull request Sep 20, 2022

Should Xarray stop doing automatic index-based alignment? #7045

Open

benbovy added 4 commits September 23, 2022 09:45

docstrings fixes and tweaks

00c2711

make Index.from_variables options arg keyword only

cb67612

Merge branch 'main' into add-set-xindex-and-drop-indexes

af67168

improve set_xindex invalid coordinates error msg

2cd0aa8

benbovy mentioned this pull request Sep 23, 2022

release? #7069

Closed

benbovy added 3 commits September 27, 2022 11:56

add xarray.indexes namespace

61d6e28

Merge branch 'main' into add-set-xindex-and-drop-indexes

ec08d73

Merge branch 'main' into add-set-xindex-and-drop-indexes

20dbf5a

benbovy mentioned this pull request Sep 27, 2022

Need a way to speciefy the names of coordinates from the indices which droped by DataArray.reset_index. #5874

Closed

type tweaks

b598447

headtr1ck mentioned this pull request Sep 27, 2022

Align typing of dimension inputs #7094

Open

3 tasks

benbovy merged commit e678a1d into pydata:main Sep 28, 2022

benbovy mentioned this pull request Oct 3, 2022

slice using non-index coordinates #2028

Closed

benbovy deleted the add-set-xindex-and-drop-indexes branch December 8, 2022 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add set_xindex and drop_indexes methods #6971

Add set_xindex and drop_indexes methods #6971

benbovy commented Aug 31, 2022 •

edited

Loading

benbovy commented Aug 31, 2022 •

edited

Loading

benbovy commented Aug 31, 2022

mathause left a comment

mathause Sep 1, 2022

benbovy Sep 1, 2022

mathause Sep 1, 2022

benbovy commented Sep 7, 2022

TomNicholas commented Sep 14, 2022

benbovy commented Sep 14, 2022

TomNicholas commented Sep 14, 2022 via email

Illviljan Sep 16, 2022

benbovy Sep 23, 2022

Illviljan Sep 23, 2022

headtr1ck Sep 23, 2022

benbovy Sep 27, 2022

headtr1ck Sep 27, 2022

benbovy Sep 27, 2022

headtr1ck Sep 27, 2022

benbovy Sep 27, 2022

headtr1ck Sep 27, 2022

benbovy commented Sep 27, 2022

Add set_xindex and drop_indexes methods #6971

Add set_xindex and drop_indexes methods #6971

Conversation

benbovy commented Aug 31, 2022 • edited Loading

benbovy commented Aug 31, 2022 • edited Loading

benbovy commented Aug 31, 2022

mathause left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbovy commented Sep 7, 2022

TomNicholas commented Sep 14, 2022

benbovy commented Sep 14, 2022

TomNicholas commented Sep 14, 2022 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbovy commented Sep 27, 2022

benbovy commented Aug 31, 2022 •

edited

Loading

benbovy commented Aug 31, 2022 •

edited

Loading