Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate multiple variables into one variable with a multi-index (categories) #1030

Closed
benbovy opened this issue Oct 3, 2016 · 3 comments
Labels

Comments

@benbovy
Copy link
Member

benbovy commented Oct 3, 2016

I often have to deal with datasets in this form (multiple variables of different sizes, each representing different categories, on the same physical dimension but using different names as they have different labels),

<xarray.Dataset>
Dimensions:     (wn_band1: 4, wn_band2: 6, wn_band3: 8)
Coordinates:
  * wn_band1    (wn_band1) float64 200.0 266.7 333.3 400.0
  * wn_band2    (wn_band2) float64 500.0 560.0 620.0 680.0 740.0 800.0
  * wn_band3    (wn_band3) float64 1.5e+03 1.643e+03 1.786e+03 1.929e+03 ...
Data variables:
    data_band3  (wn_band3) float64 0.7515 0.5302 0.6697 0.9621 0.01815 ...
    data_band1  (wn_band1) float64 0.3801 0.6649 0.01884 0.9407
    data_band2  (wn_band2) float64 0.8813 0.4481 0.2353 0.9681 0.1085 0.0835

where it would be more convenient to have the data re-arranged into the following form (concatenate the variables into a single variable with a multi-index with the labels of both the categories and the physical coordinate):

<xarray.Dataset>
Dimensions:   (spectrum: 18)
Coordinates:
  * spectrum  (spectrum) MultiIndex
  - band      (spectrum) int64 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3
  - wn        (spectrum) float64 200.0 266.7 333.3 400.0 500.0 560.0 620.0 ...
Data variables:
    data      (spectrum) float64 0.3801 0.6649 0.01884 0.9407 0.8813 0.4481 ...

The latter would allow using xarray's nice features like ds.groupby('band').mean().

Currently, the best way that I've found to transform the data is something like:

data = np.concatenate([ds.data_band1, ds.data_band2, ds.data_band3])
wn = np.concatenate([ds.wn_band1, ds.wn_band2, ds.wn_band3])
band = np.concatenate([np.repeat(1, 4), np.repeat(2, 6), np.repeat(3, 8)])

midx = pd.MultiIndex.from_arrays([band, wn], names=('band', 'wn'))
ds2 = xr.Dataset({'data': ('spectrum', data)}, coords={'spectrum': midx})

Maybe I miss a better way to do this? If I don't, it would be nice to have a convenience method for this, unless this use case is too rare to be worth it. Also not sure at all on what would be a good API such a method.

@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

One option that gets you part way there:

arrays = [ds['data_band%d' % i].rename({'wn_band%d' % i: 'wn'}).assign_coords(band=i)
          for i in range(1, 4)]
combined = xr.concat(arrays, dim='wn')

This would still need some work (e.g., with set_index #1028) to set the MultiIndex. Ideally, maybe you could write something like combined.set_index(spectrum=['band', 'wn']) to create the new dimension and MultiIndex all at once.

It does seem like something like the key argument to pandas.concat would make sense here:
http://pandas.pydata.org/pandas-docs/stable/merging.html#more-concatenating-with-group-keys

The API is not so obvious for us, though, because we need to supply the new dimension name and levels all at once. Maybe something like xr.concat(arrays, dim={'spectrum': ['band', 'wn']} would work.

@benbovy
Copy link
Member Author

benbovy commented Oct 4, 2016

Thanks for the tip @shoyer !

Using something like combined.set_index(spectrum=['band', 'wn']) or xr.concat(arrays, dim={'spectrum': ['band', 'wn']}) would be nice, although it may be a bit weird to use the key spectrum to rename the wn dimension here.

For now, I'm fine with setting the MultIndex using the more explicit - though more verbose - combined.set_index(wn=['band', 'wn']).rename({'wn': 'spectrum'})

@stale
Copy link

stale bot commented Jan 26, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 26, 2019
@stale stale bot closed this as completed Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants