Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex and data selection #767

Closed
benbovy opened this issue Feb 17, 2016 · 9 comments
Closed

MultiIndex and data selection #767

benbovy opened this issue Feb 17, 2016 · 9 comments

Comments

@benbovy
Copy link
Member

benbovy commented Feb 17, 2016

[Edited for more clarity]

First of all, I find the MultiIndex very useful and I'm looking forward to see the TODOs in #719 implemented in the next releases, especially the three first ones in the list!

Apart from these issues, I think that some other aspects may be improved, notably regarding data selection. Or maybe I've not correctly understood how to deal with multi-index and data selection...

To illustrate this, I use some fake spectral data with two discontinuous bands of different length / resolution:

In [1]: import pandas as pd

In [2]: import xarray as xr

In [3]: band = np.array(['foo', 'foo', 'bar', 'bar', 'bar'])

In [4]: wavenumber = np.array([4050.2, 4050.3, 4100.1, 4100.3, 4100.5])

In [5]: spectrum = np.array([1.7e-4, 1.4e-4, 1.2e-4, 1.0e-4, 8.5e-5])

In [6]: s = pd.Series(spectrum, index=[band, wavenumber])

In [7]: s.index.names = ('band', 'wavenumber')

In [8]: da = xr.DataArray(s, dims='band_wavenumber')

In [9]: da
Out[9]:
<xarray.DataArray (band_wavenumber: 5)>
array([  1.70000000e-04,   1.40000000e-04,   1.20000000e-04,
         1.00000000e-04,   8.50000000e-05])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ...

I extract the band 'bar' using sel:

In [10]: da_bar = da.sel(band_wavenumber='bar')

In [11]: da_bar
Out[11]:
<xarray.DataArray (band_wavenumber: 3)>
array([  1.20000000e-04,   1.00000000e-04,   8.50000000e-05])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('bar', 4100.1) ...

It selects the data the way I want, although using the dimension name is confusing in this case. It would be nice if we can also use the MultiIndex names as arguments of the sel method, even though I don't know if it is easy to implement.

Futhermore, da_bar still has the 'band_wavenumber' dimension and the 'band' index-level, but it is not very useful anymore. Ideally, I'd rather like to obtain a DataArray object with a 'wavenumber' dimension / coordinate and the 'bar' band name dropped from the multi-index, i.e., something would require automatic index-level removal and/or automatic unstack when selecting data.

Extracting the band 'bar' from the pandas Series object gives something closer to what I need (see below), but using pandas is not an option as my spectral data involves other dimensions (e.g., time, scans, iterations...) not shown here for simplicity.

In [12]: s_bar = s.loc['bar']

In [13]: s_bar
Out[13]:
wavenumber
4100.1    0.000120
4100.3    0.000100
4100.5    0.000085
dtype: float64

The problem is also that the unstacked DataArray object resulting from the selection has the same dimensions and size than the original, unstacked DataArray object. The only difference is that unselected values are replaced by nan.

In [13]: da.unstack('band_wavenumber')
Out[13]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[             nan,              nan,   1.20000000e-04,
          1.00000000e-04,   8.50000000e-05],
       [  1.70000000e-04,   1.40000000e-04,              nan,
                     nan,              nan]])
Coordinates:
  * band        (band) object 'bar' 'foo'
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03

In [14]: da_bar.unstack('band_wavenumber')
Out[14]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[             nan,              nan,   1.20000000e-04,
          1.00000000e-04,   8.50000000e-05],
       [             nan,              nan,              nan,
                     nan,              nan]])
Coordinates:
  * band        (band) object 'bar' 'foo'
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03
@benbovy
Copy link
Member Author

benbovy commented Feb 18, 2016

Mmm now I'm wondering if the problem I explained above isn't just related to the 3rd TODO item in #719 (make levels accessible as coordinate variables).

Sorry for the post if it is the case.

@shoyer
Copy link
Member

shoyer commented Feb 18, 2016

This is a really good point that honestly I had not thought carefully about before. I agree that it would be very nice to have this behavior, though. This will require a bit of internal refactoring to pass on the level information to the MultiIndex during indexing.

To remove unused levels after unstacking, you need to add an explicit dropna, e.g., da.unstack('band_wavenumber').dropna('band_wavenumber'). This is definitely a break from pandas, but for a good reason (IMO). See here for discussion on this point.

I raised another issue for the bug related to copying MultiIndex that you had in the earlier version of this PR (#769).

More broadly, if you care about MultiIndex support, it would be great to get some help pushing it. I'm happy to answer questions, but I'm at a new job and don't have a lot of time to work on new development.

@benbovy
Copy link
Member Author

benbovy commented Feb 18, 2016

Thanks for the tip. So I finally obtain the desired result when selecting the band 'bar' by doing this:

In [21]: (da.sel(band_wavenumber='bar')
    ...:  .unstack('band_wavenumber')
    ...:  .dropna('band', how='all')
    ...:  .dropna('wavenumber', how='any')
    ...:  .sel(band='bar'))
Out[21]:
<xarray.DataArray (wavenumber: 3)>
array([  1.20000000e-04,   1.00000000e-04,   8.50000000e-05])
Coordinates:
    band        <U3 'bar'
  * wavenumber  (wavenumber) float64 4.1e+03 4.1e+03 4.1e+03

But it's still a lot of code to write for such a common operation.

I'd be happy to think more deeply about this and contribute to the development of this great package ! (within the limits of my skills)

@benbovy
Copy link
Member Author

benbovy commented Mar 2, 2016

Thinking about this issue, I'd like to know what you think of the suggestions below before considering any pull request.

The following line code gives the same result than in my previous comment, but it is more explicit and shorter:

da.unstack('band_wavenumber').sel(band='bar').dropna('wavenumber', how='any')

A nice shortcut to this would be adding a new xs method to DataArray and Dataset, which would be quite similar to the xs method of Pandas but here with an additional dim keyword argument:

da.xs('bar', dim='band_wavenumber', level='band', drop_level=True)

Like Pandas, the default value of drop_level would be True. But here drop_level rather sets whether or not to apply dropna to all (unstacked) index levels of dim except the specified level.

I think that this solution is better than, e.g., directly providing index level names as arguments of the sel method. This may be confusing and there may be conflict when different dimensions have the same index level names.

Another, though less elegant, solution would be to provide dictionnaries to the sel method:

da.sel(band_wavenumber={'band': 'bar'})

Besides this, It would be nice if the drop_level=True behavior could be applied by default to any selection (i.e., also when using loc, sel, etc.), like with Pandas. I don't know how Pandas does this (I'll look into that), but at first glance this would here imply checking for each dimension if it has a multi-index and then checking the labels for each index level.

@benbovy
Copy link
Member Author

benbovy commented Mar 2, 2016

OK, I've read more carefully the discussion you referred to, and now I understand why it is preferable to call dropna explicitely. My last suggestion above is not compatible with this.

The xs method (not sure about the name) may still provide a concise way to perform a selection with explicit unstack and dropna. Maybe it is more appropriate to use dropna instead of drop_level:

da.xs('bar', dim='band_wavenumber', level='band', dropna=True)

@shoyer
Copy link
Member

shoyer commented Mar 2, 2016

The good news about writing our own custom way to select levels is that because we can avoid the stack/unstack, we can simply omit unused levels without worrying about doing dropna with unstack. So as long as we are implementing this in own other method (e.g., sel or xs), we can default to drop_level=True.

I would be OK with xs, but da.xs('bar', dim='band_wavenumber', level='band') feels much more verbose to me than da.sel(band_wavenumber={'band': 'bar'}). The later solution involves inventing no new API, and because dictionaries are not hashable there's no potential conflict with existing functionality.

Last year at the SciPy conference sprints, @jonathanrocher was working on adding similar dictionary support into .loc in pandas (i.e., da.loc[{'band': 'band'}]). I don't think he ever finished up that PR, but he might have a branch worth looking at as a starting point.

I think that this solution is better than, e.g., directly providing index level names as arguments of the sel method. This may be confusing and there may be conflict when different dimensions have the same index level names.

This is a fair point, but such scenarios are unlikely to appear in practice. We might be able to, for example, update our handling of MultiIndexes to guarantee that level names cannot conflict with other variables. This might be done by inserting dummy-variables of some sort into the _coords dict whenever a MultiIndex is added. It would take some work to ensure this works smoothly, though.

Besides this, It would be nice if the drop_level=True behavior could be applied by default to any selection (i.e., also when using loc, sel, etc.), like with Pandas. I don't know how Pandas does this (I'll look into that), but at first glance this would here imply checking for each dimension if it has a multi-index and then checking the labels for each index level.

Yes, agreed. Unfortunately the pandas code that handles this is a complete mess of spaghetti code (see pandas/core/indexers.py). So are welcome to try decoding it, but in my opinion you might be better off starting from scratch. In xarray, the function convert_label_indexer would need an updated interface that allows it to possibly return a new pandas.Index object to replace the existing index.

@benbovy
Copy link
Member Author

benbovy commented Mar 3, 2016

From this point of view I agree that da.sel(band_wavenumber={'band': 'bar'}) is a nicer solution!
I'll follow your suggestion of returning a new pandas.Index object from convert_label_indexer.

Unless I miss a better solution, we can use pandas.MultiIndex.get_loc_level to get both the indexer and the new pandas.Index object. However, there may still be some advanced cases where it won't behave as expected. For example, selecting both the band 'bar' and a range of wavenumber values (that doesn't exactly match the range of that band)

da.sel(band_wavenumber={'band': 'bar', 'wavenumber': slice(4000, 4100.3)})`

will a-priori return a stacked DataArray with the full multi-index:

In [32]: idx = da.band_wavenumber.to_index()
In [33]: idx.get_loc_level(('bar', slice(4000, 4100.3)), level=('band', 'wavenumber'))
Out[33]:
(array([False, False,  True,  True, False], dtype=bool),
 MultiIndex(levels=[['bar', 'foo'], [4050.2, 4050.3, 4100.1, 4100.3, 4100.5]],
            labels=[[0, 0], [2, 3]],
            names=['band', 'wavenumber']))

@shoyer
Copy link
Member

shoyer commented Mar 3, 2016

If you try that doing that indexing with a pandas.Series, you actually get an error message:

In [71]: s.loc['bar', slice(4000, 4100.3)]
# ...
/Users/shoyer/conda/envs/xarray-dev/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1363             # nested tuple slicing
   1364             if is_nested_tuple(key, labels):
-> 1365                 locs = labels.get_locs(key)
   1366                 indexer = [ slice(None) ] * self.ndim
   1367                 indexer[axis] = locs

/Users/shoyer/conda/envs/xarray-dev/lib/python3.5/site-packages/pandas/core/index.py in get_locs(self, tup)
   5692         if not self.is_lexsorted_for_tuple(tup):
   5693             raise KeyError('MultiIndex Slicing requires the index to be fully lexsorted'
-> 5694                            ' tuple len ({0}), lexsort depth ({1})'.format(len(tup), self.lexsort_depth))
   5695
   5696         # indexer
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0,)

I guess it's also worth investigating get_locs as an alternative or companion to get_loc_level.

@benbovy
Copy link
Member Author

benbovy commented Sep 14, 2016

Fixed in #802 and #947.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants