-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set_index
, reset_index
and reorder_levels
methods
#1028
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice -- thanks for working on this!
-------- | ||
DataArray.reset_index | ||
""" | ||
indexers = utils.combine_pos_and_kw_args(indexers, kw_indexers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I wrote this utility (and it is currently used by Dataset.reindex
), but I regret it now! I feel like it's usually better to only have a single call signature, accepting either a dictionary or **kwargs
, but not both.
This goes for the other new methods, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be for using **kwargs
unless we choose the alternative signatures suggested in the comments above and below.
append : bool, optional | ||
If True, append the supplied indexers to the existing indexes. | ||
Otherwise replace the existing indexes (default). | ||
inplace : bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need an inplace=True
option. I guess it doesn't hurt. (Just more to test)
|
||
Parameters | ||
---------- | ||
indexers : dict, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of array.set_index({'record': ['level_0', 'level_1']})
or array.set_index(record=['level_0', 'level_1'])
, it might suffice to simply use array.set_index(['level_0', 'level_1'])
, without specifying which dimension these get added to. We can infer the dimension names from the variables.
The upside is that accepting a list instead of a dict is a little more succinct. The downside is that it's less explicit/self-documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another downside of using a single argument is that it is less consistent with other parts of the xarray's API which use keyword arguments (e.g., reindex
, sel
, etc.). IMHO the little additional verbosity is worth the gain in explicit/self-documenting here.
That said, a single dim_levels
or levels
argument may be more consistent with the signature you suggest below for reset_index: reset_index(dim, levels=None,...)
. So finally I don't really know what is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion in #1030 , I'm wondering if it is common to have to rename a dimension after setting a new (multi-)index, such that either
set_index(['level_0', 'level_1'], name='new_dim_name')
or
set_index(new_dim_name=['level_0', 'level_1'])
would be a useful shortcut to set_index(...).rename(...)
.
else: | ||
return self._replace(coords=coords) | ||
|
||
def reset_index(self, dim_levels=None, drop=False, inplace=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe switch the signature to use separate arguments like reset_index(dim, levels=None, ...)
instead of using a dict/**kwargs
. This would make the usual use a clearer, e.g., array.reset_index('record')
instead of array.reset_index(record=None)
.
Also, after #1017 (optional indexes), the ability to write array.reset_index('x', drop=True)
for clearing an index could be nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
if isinstance(current_index, pd.MultiIndex): | ||
names.extend(current_index.names) | ||
for i in range(current_index.nlevels): | ||
arrays.append(current_index.get_level_values(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be premature optimization to worry about this, but something to watch out for is that this will refactorize the existing MultiIndex. It might be better to simply reuse existing levels
and labels
, if possible, and directly pass those into the MultiIndex
constructor.
Note that to create levels
and labels
directly form an array (if necessary) you can use Categorical
as used in MultiIndex.from_arrays
, e.g.,
cat = pd.Categorical(array, ordered=True)
levels = cat.categories
labels = cat.codes
@@ -816,6 +816,118 @@ def swap_dims(self, dims_dict): | |||
ds = self._to_temp_dataset().swap_dims(dims_dict) | |||
return self._from_temp_dataset(ds) | |||
|
|||
def set_index(self, indexers=None, append=False, inplace=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note on terminology: in xarray/pandas, indexer
usually refers the argument/key used for indexing, whereas index
refers to the set of existing labels, e.g., df.index
vs df.loc[indexer]
. So in this case I think the argument name indexes
would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've naively copied the signature of reindex
but now I get it.
Some API design questions (mostly from @shoyer's review) we need to fix:
For point 1, my preference goes to |
For It's not at all obvious to me what
My inclination is yes -- this feels like a common thing to do. But we could also safely add this later.
We definitely need a way to reset indexes to the default, but after #1017, I'm not sure we will need a way to set them to Unfortunately, if One option is to add some sort of prefix or suffix to the index name when it becomes a new variable, e.g., |
Sorry for the delay @shoyer. I've read your comments above and they all seem relevant. I'll find some time next week to get back on this. |
5ac5996
to
12c5966
Compare
Just committed review changes.
Currently |
@@ -102,6 +103,105 @@ def calculate_dimensions(variables): | |||
return dims | |||
|
|||
|
|||
def merge_indexes(indexes, variables, coord_names, append=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try adding type annotations here, like the ones I started adding in core/merge.py
? I'm not even running mypy yet but I think these could significantly improve readability, and are lighter weight than adding a full docstring.
|
||
Returns | ||
------- | ||
reindexed : DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong name -- should not be reindexed
I kind of like Option C, given that we have guaranteed level names and variables to have no conflicts. Did you go for allowing |
This is ready for another round of review. I've changed the signature of
Not yet, but as you said we could safely add this later. |
Can we update this for optional indexes? (now on master) |
This should now behave correctly with optional indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for persevering on this! My only concern is about the best place to put this new info in the docs (see comments inline).
@@ -478,6 +478,49 @@ Both ``reindex_like`` and ``align`` work interchangeably between | |||
# this is a no-op, because there are no shared dimension names | |||
ds.reindex_like(other) | |||
|
|||
.. _multi-index handling: | |||
|
|||
Multi-index handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are great, but I wouldn't call them "indexing methods" exactly. Maybe move this section to Reshaping and reorganizing data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
Another item in #719.
I added tests and updated the docs, so this is ready for review.