Add `set_index`, `reset_index` and `reorder_levels` methods #1028

benbovy · 2016-10-03T13:22:24Z

Another item in #719.

I added tests and updated the docs, so this is ready for review.

shoyer

Very nice -- thanks for working on this!

shoyer · 2016-10-03T14:15:46Z

xarray/core/dataarray.py

+        --------
+        DataArray.reset_index
+        """
+        indexers = utils.combine_pos_and_kw_args(indexers, kw_indexers,


I know I wrote this utility (and it is currently used by Dataset.reindex), but I regret it now! I feel like it's usually better to only have a single call signature, accepting either a dictionary or **kwargs, but not both.

This goes for the other new methods, too.

I would be for using **kwargs unless we choose the alternative signatures suggested in the comments above and below.

shoyer · 2016-10-03T14:17:49Z

xarray/core/dataarray.py

+        append : bool, optional
+            If True, append the supplied indexers to the existing indexes.
+            Otherwise replace the existing indexes (default).
+        inplace : bool, optional


I'm not sure we need an inplace=True option. I guess it doesn't hurt. (Just more to test)

shoyer · 2016-10-03T14:23:10Z

xarray/core/dataarray.py

+
+        Parameters
+        ----------
+        indexers : dict, optional


Instead of array.set_index({'record': ['level_0', 'level_1']}) or array.set_index(record=['level_0', 'level_1']), it might suffice to simply use array.set_index(['level_0', 'level_1']), without specifying which dimension these get added to. We can infer the dimension names from the variables.

The upside is that accepting a list instead of a dict is a little more succinct. The downside is that it's less explicit/self-documenting.

Another downside of using a single argument is that it is less consistent with other parts of the xarray's API which use keyword arguments (e.g., reindex, sel, etc.). IMHO the little additional verbosity is worth the gain in explicit/self-documenting here.

That said, a single dim_levels or levels argument may be more consistent with the signature you suggest below for reset_index: reset_index(dim, levels=None,...). So finally I don't really know what is better.

After discussion in #1030 , I'm wondering if it is common to have to rename a dimension after setting a new (multi-)index, such that either
set_index(['level_0', 'level_1'], name='new_dim_name') or
set_index(new_dim_name=['level_0', 'level_1']) would be a useful shortcut to set_index(...).rename(...).

shoyer · 2016-10-03T14:32:23Z

xarray/core/dataarray.py

+        else:
+            return self._replace(coords=coords)
+
+    def reset_index(self, dim_levels=None, drop=False, inplace=False,


Maybe switch the signature to use separate arguments like reset_index(dim, levels=None, ...) instead of using a dict/**kwargs. This would make the usual use a clearer, e.g., array.reset_index('record') instead of array.reset_index(record=None).

Also, after #1017 (optional indexes), the ability to write array.reset_index('x', drop=True) for clearing an index could be nice to have.

Makes sense!

shoyer · 2016-10-03T14:44:16Z

xarray/core/dataset.py

+            if isinstance(current_index, pd.MultiIndex):
+                names.extend(current_index.names)
+                for i in range(current_index.nlevels):
+                    arrays.append(current_index.get_level_values(i))


It might be premature optimization to worry about this, but something to watch out for is that this will refactorize the existing MultiIndex. It might be better to simply reuse existing levels and labels, if possible, and directly pass those into the MultiIndex constructor.

Note that to create levels and labels directly form an array (if necessary) you can use Categorical as used in MultiIndex.from_arrays, e.g.,

cat = pd.Categorical(array, ordered=True) levels = cat.categories labels = cat.codes

shoyer · 2016-10-03T14:51:32Z

xarray/core/dataarray.py

@@ -816,6 +816,118 @@ def swap_dims(self, dims_dict):
        ds = self._to_temp_dataset().swap_dims(dims_dict)
        return self._from_temp_dataset(ds)

+    def set_index(self, indexers=None, append=False, inplace=False,


A note on terminology: in xarray/pandas, indexer usually refers the argument/key used for indexing, whereas index refers to the set of existing labels, e.g., df.index vs df.loc[indexer]. So in this case I think the argument name indexes would make more sense.

OK, I've naively copied the signature of reindex but now I get it.

benbovy · 2016-10-20T10:32:57Z

Some API design questions (mostly from @shoyer's review) we need to fix:

We need to choose whether to use dim=indexes kwargs or fixed arg/kwarg relative to a given dimension for the signatures of .set_index(), .reset_index() and .reorder_levels().
Do we also allow .set_index() to rename the dimension(s) if needed, instead of doing .set_index(...).rename(...) ? Is this a common use case that is worth it?
After discussion in WIP: Optional indexes (no more default coordinates given by range(n)) #1017, it seems that we need an easy way to (re)set indexes either to no index or to range(n).

For point 1, my preference goes to dim=indexes kwargs, especially if we need 2 and 3. It's less succinct, but it's more close to the signatures of other xarray methods like .reindex() or .sel(), and it allows (re)setting the indexes of multiple dimensions in a single call. Given 2, I find set_index(new_dim_name=['level_1', 'level_3']) a bit more elegant than set_index(['level_1', 'level_2'], name='new_dim_name'). Given 3, array.reset_index('x') seems ambiguous compared to array.reset_index(x=None) (no index) and, e.g., array.reset_index(x='range') (range(n) index).

shoyer · 2016-10-22T01:18:40Z

We need to choose whether to use dim=indexes kwargs or fixed arg/kwarg relative to a given dimension for the signatures of .set_index(), .reset_index() and .reorder_levels().

For set_index and reorder_levels, I like the kwargs or a dictionary. It's nice and explicit. But for reset_index, I think we probably want a list.

It's not at all obvious to me what array.reset_index(x=None) does. It could just as easily mean "reset nothing from x" as "reset x to have a null index". In fact, the former seems more consistent with how we handle levels. In contrast, array.reset_index(['x']) pretty clearly means that the 'x' index should be reset.

Do we also allow .set_index() to rename the dimension(s) if needed, instead of doing .set_index(...).rename(...) ? Is this a common use case that is worth it?

My inclination is yes -- this feels like a common thing to do. But we could also safely add this later.

After discussion in #1017, it seems that we need an easy way to (re)set indexes either to no index or to range(n).

We definitely need a way to reset indexes to the default, but after #1017, I'm not sure we will need a way to set them to range(n).

Unfortunately, if x is a normal (non-multi) Index, array.reset_index('x') is not well defined. We need a name for the variable that was formerly named x (or could drop it, e.g., with array.reset_index('x', drop=True) or array.drop('x')), otherwise it will still be the index for the x-axis. Or, I suppose we could rename the x dimension to something else.

One option is to add some sort of prefix or suffix to the index name when it becomes a new variable, e.g., array.reset_index('x') renames the coordinate x to x_. This seems like a probably safe choice, though I hate to add more automatic names to the API.

benbovy · 2016-11-04T17:34:33Z

Sorry for the delay @shoyer. I've read your comments above and they all seem relevant. I'll find some time next week to get back on this.

benbovy · 2016-11-07T16:37:03Z

Just committed review changes.

.reset_index() doesn't accept kwargs anymore, though I don't know what to choose between the options below (currently option A is implemented):

option A: reset_index(dim, levels=None) where dim may accept multiple dimension names (in
that case levels must be a list of lists with the same length than dim, or simply None that
would then be applied to all given dimensions).
option B: same than option A, reset_index(dim, levels=None), except that dim only accepts
one dimension (thus a bit simpler but less flexible).
option C: reset_index(dim_or_levels) where one can provide a list of dimension(s) and/or
level(s). This is the most flexible and concise, though maybe less readable. Allow providing both
dimensions and levels may be ambiguous too.

if x is a normal (non-multi) Index, array.reset_index('x') is not well defined

Currently .reset_index() doesn't allow resetting normal indexes, but we can wait for #1017 before merging this.

shoyer · 2016-11-09T17:40:09Z

xarray/core/dataset.py

@@ -102,6 +103,105 @@ def calculate_dimensions(variables):
    return dims


+def merge_indexes(indexes, variables, coord_names, append=False):


Can you try adding type annotations here, like the ones I started adding in core/merge.py? I'm not even running mypy yet but I think these could significantly improve readability, and are lighter weight than adding a full docstring.

shoyer · 2016-11-09T17:42:33Z

xarray/core/dataarray.py

+
+        Returns
+        -------
+        reindexed : DataArray


wrong name -- should not be reindexed

shoyer · 2016-11-09T17:43:31Z

I kind of like Option C, given that we have guaranteed level names and variables to have no conflicts.

Did you go for allowing set_index() to rename variables?

benbovy · 2016-11-15T22:54:18Z

This is ready for another round of review.

I've changed the signature of reset_index to option C. It is also almost ready for #1017 (just added two small TODOs).

Did you go for allowing set_index() to rename variables?

Not yet, but as you said we could safely add this later.

shoyer · 2016-12-16T03:33:40Z

Can we update this for optional indexes? (now on master)

benbovy · 2016-12-20T15:34:34Z

This should now behave correctly with optional indexes.

shoyer

Looks great, thanks for persevering on this! My only concern is about the best place to put this new info in the docs (see comments inline).

shoyer · 2016-12-22T02:23:06Z

doc/indexing.rst

@@ -478,6 +478,49 @@ Both ``reindex_like`` and ``align`` work interchangeably between
    # this is a no-op, because there are no shared dimension names
    ds.reindex_like(other)

+.. _multi-index handling:
+
+Multi-index handling


These docs are great, but I wouldn't call them "indexing methods" exactly. Maybe move this section to Reshaping and reorganizing data?

Makes sense!

shoyer reviewed Oct 3, 2016

View reviewed changes

shoyer mentioned this pull request Oct 3, 2016

Concatenate multiple variables into one variable with a multi-index (categories) #1030

Closed

shoyer mentioned this pull request Oct 14, 2016

to_dataset lossy #1047

Closed

benbovy mentioned this pull request Oct 20, 2016

WIP: Optional indexes (no more default coordinates given by range(n)) #1017

Merged

5 tasks

benbovy mentioned this pull request Oct 20, 2016

Follow-ups on MultIndex support #719

Closed

7 tasks

shoyer mentioned this pull request Nov 4, 2016

Transpose some but not all dimensions #1081

Closed

Benoit Bovy added 8 commits November 7, 2016 16:20

add Dataset.set_index method

aad4a09

add set_index and reset_index methods for dataarray and dataset

4fe06b3

add reorder_levels method for dataset and dataarray

08ccdc2

add tests

9f53c72

update doc

363b463

fix tests py27

4388142

review changes

f8797fa

fix unresolved rebase conflict

12c5966

benbovy force-pushed the multi-index_methods branch from 5ac5996 to 12c5966 Compare November 7, 2016 16:10

fix reset_index example in docs

2e3e525

shoyer reviewed Nov 9, 2016

View reviewed changes

xarray/core/dataarray.py

Returns

-------

reindexed : DataArray

Copy link

Member

shoyer Nov 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong name -- should not be reindexed

Benoit Bovy added 4 commits November 15, 2016 15:52

Merge branch 'master' into multi-index_methods

1419c00

fix docstring

3dfa539

change signature of reset_index

60853fd

add type annotations

65ebc19

shoyer mentioned this pull request Dec 16, 2016

Things to complete before releasing xarray v0.9.0 #1167

Closed

4 tasks

Benoit Bovy added 3 commits December 20, 2016 13:51

Merge branch 'master' into multi-index_methods

bdefddf

update missing coordinate dims

83ca06b

fix and update docs

5ba2ffa

shoyer approved these changes Dec 22, 2016

View reviewed changes

updated doc

c58cb47

shoyer merged commit 7ad2544 into pydata:master Dec 27, 2016

shoyer mentioned this pull request Feb 1, 2017

set_index(keys, inplace=False) should be both a DataArray and Dataset method. #230

Closed

benbovy deleted the multi-index_methods branch August 30, 2023 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `set_index`, `reset_index` and `reorder_levels` methods #1028

Add `set_index`, `reset_index` and `reorder_levels` methods #1028

benbovy commented Oct 3, 2016

shoyer left a comment

shoyer Oct 3, 2016 •

edited

Loading

benbovy Oct 4, 2016

shoyer Oct 3, 2016 •

edited

Loading

shoyer Oct 3, 2016 •

edited

Loading

benbovy Oct 4, 2016

benbovy Oct 4, 2016

shoyer Oct 3, 2016

benbovy Oct 4, 2016

shoyer Oct 3, 2016 •

edited

Loading

shoyer Oct 3, 2016

benbovy Oct 4, 2016

benbovy commented Oct 20, 2016 •

edited

Loading

shoyer commented Oct 22, 2016

benbovy commented Nov 4, 2016

benbovy commented Nov 7, 2016

shoyer Nov 9, 2016

shoyer Nov 9, 2016

shoyer commented Nov 9, 2016

benbovy commented Nov 15, 2016

shoyer commented Dec 16, 2016

benbovy commented Dec 20, 2016

shoyer left a comment

shoyer Dec 22, 2016

benbovy Dec 24, 2016

		@@ -102,6 +103,105 @@ def calculate_dimensions(variables):
		return dims


		def merge_indexes(indexes, variables, coord_names, append=False):

Add set_index, reset_index and reorder_levels methods #1028

Add set_index, reset_index and reorder_levels methods #1028

Conversation

benbovy commented Oct 3, 2016

shoyer left a comment

Choose a reason for hiding this comment

shoyer Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

shoyer Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Oct 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbovy commented Oct 20, 2016 • edited Loading

shoyer commented Oct 22, 2016

benbovy commented Nov 4, 2016

benbovy commented Nov 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 9, 2016

benbovy commented Nov 15, 2016

shoyer commented Dec 16, 2016

benbovy commented Dec 20, 2016

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `set_index`, `reset_index` and `reorder_levels` methods #1028

Add `set_index`, `reset_index` and `reorder_levels` methods #1028

shoyer Oct 3, 2016 •

edited

Loading

shoyer Oct 3, 2016 •

edited

Loading

shoyer Oct 3, 2016 •

edited

Loading

shoyer Oct 3, 2016 •

edited

Loading

benbovy commented Oct 20, 2016 •

edited

Loading