Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.open_dataset has issues if the dataset returned by the backend contains a multiindex #7139

Closed
4 tasks done
lukasbindreiter opened this issue Oct 7, 2022 · 5 comments · Fixed by #7150
Closed
4 tasks done

Comments

@lukasbindreiter
Copy link
Contributor

What happened?

As a follow up of this comment: #6752 (comment) I'm currently trying to implement a custom NetCDF4 backend that allows me to also handle multiindices when loading a NetCDF dataset using xr.open_dataset.

I'm using the following two functions to convert the dataset to a NetCDF compatible version and back again:
#1077 (comment).

Here is a small code example:

Creating the dataset

import xarray as xr
import pandas

def create_multiindex(**kwargs):
    return pandas.MultiIndex.from_arrays(list(kwargs.values()), names=kwargs.keys())

dataset = xr.Dataset()
dataset.coords["observation"] = ["A", "B"]
dataset.coords["wavelength"] = [0.4, 0.5, 0.6, 0.7]
dataset.coords["stokes"] = ["I", "Q"]
dataset["measurement"] = create_multiindex(
    observation=["A", "A", "B", "B"],
    wavelength=[0.4, 0.5, 0.6, 0.7],
    stokes=["I", "Q", "I", "I"],
)

Saving as NetCDF

from cf_xarray import encode_multi_index_as_compress
patched = encode_multi_index_as_compress(dataset)
patched.to_netcdf("multiindex.nc")

And loading again

from cf_xarray import decode_compress_to_multi_index
loaded = xr.open_dataset("multiindex.nc")
loaded = decode_compress_to_multiindex(loaded)
assert loaded.equals(dataset)  # works

Custom Backend

While the manual patching for saving is currently still required, I tried to at least work around the added function call in open_dataset by creating a custom NetCDF Backend:

# registered as netcdf4-multiindex backend in setup.py
class MultiindexNetCDF4BackendEntrypoint(NetCDF4BackendEntrypoint):
    def open_dataset(self, *args, handle_multiindex=True, **kwargs):
        ds = super().open_dataset(*args, **kwargs)

        if handle_multiindex:  # here is where the restore operation happens:
            ds = decode_compress_to_multiindex(ds)

        return ds

The error

>>> loaded = xr.open_dataset("multiindex.nc", engine="netcdf4-multiindex", handle_multiindex=True)  # fails

File ~/.local/share/virtualenvs/test-oePfdNug/lib/python3.8/site-packages/xarray/core/variable.py:2795, in IndexVariable.data(self, data)
   2793 @Variable.data.setter  # type: ignore[attr-defined]
   2794 def data(self, data):
-> 2795     raise ValueError(
   2796         f"Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable {self.name!r}. "
   2797         f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate."
   2798     )

ValueError: Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable 'measurement'. Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate.

but this works:

>>> loaded = xr.open_dataset("multiindex.nc", engine="netcdf4-multiindex", handle_multiindex=False)
>>> loaded = decode_compress_to_multiindex(loaded)
>>> assert loaded.equals(dataset)

So I'm guessing xarray is performing some operation on the dataset returned by the backend, and one of those leads to a failure if there is a multiindex already contained.

What did you expect to happen?

I expected that it doesn't matter wheter decode_compress_to_multi_index is called inside the backend or afterwards, and the same dataset will be returned each time.

Minimal Complete Verifiable Example

See above.

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

I'm also open to other suggestions how I could simplify the usage of multiindices, maybe there is an approach that doesn't require a custom backend at all?

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jan 28 2022, 09:41:12) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.102.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3

xarray: 2022.9.0
pandas: 1.5.0
numpy: 1.23.3
scipy: 1.9.1
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.2
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.6.0
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.3.0
pip: 22.2.2
conda: None
pytest: 7.1.3
IPython: 8.5.0
sphinx: 4.5.0

@lukasbindreiter lukasbindreiter added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 7, 2022
@benbovy
Copy link
Member

benbovy commented Oct 7, 2022

Hi @lukasbindreiter, could you add the whole error traceback please?

@dcherian
Copy link
Contributor

dcherian commented Oct 7, 2022

I can see this type of decoding breaking some assumption in the file reading process. A full traceback would help identify where.

I think the real solution is actually #4490, so you could explicitly provide a coder.

@dcherian dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 7, 2022
@lukasbindreiter
Copy link
Contributor Author

Here is the full stacktrace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [12], line 7
----> 7 loaded = xr.open_dataset("multiindex.nc", engine="netcdf4-multiindex", handle_multiindex=True)
      8 print(loaded)

File ~/.local/share/virtualenvs/test-oePfdNug/lib/python3.8/site-packages/xarray/backends/api.py:537, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    530 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
    531 backend_ds = backend.open_dataset(
    532     filename_or_obj,
    533     drop_variables=drop_variables,
    534     **decoders,
    535     **kwargs,
    536 )
--> 537 ds = _dataset_from_backend_dataset(
    538     backend_ds,
    539     filename_or_obj,
    540     engine,
    541     chunks,
    542     cache,
    543     overwrite_encoded_chunks,
    544     inline_array,
    545     drop_variables=drop_variables,
    546     **decoders,
    547     **kwargs,
    548 )
    549 return ds

File ~/.local/share/virtualenvs/test-oePfdNug/lib/python3.8/site-packages/xarray/backends/api.py:345, in _dataset_from_backend_dataset(backend_ds, filename_or_obj, engine, chunks, cache, overwrite_encoded_chunks, inline_array, **extra_tokens)
    340 if not isinstance(chunks, (int, dict)) and chunks not in {None, "auto"}:
    341     raise ValueError(
    342         f"chunks must be an int, dict, 'auto', or None. Instead found {chunks}."
    343     )
--> 345 _protect_dataset_variables_inplace(backend_ds, cache)
    346 if chunks is None:
    347     ds = backend_ds

File ~/.local/share/virtualenvs/test-oePfdNug/lib/python3.8/site-packages/xarray/backends/api.py:239, in _protect_dataset_variables_inplace(dataset, cache)
    237 if cache:
    238     data = indexing.MemoryCachedArray(data)
--> 239 variable.data = data

File ~/.local/share/virtualenvs/test-oePfdNug/lib/python3.8/site-packages/xarray/core/variable.py:2795, in IndexVariable.data(self, data)
   2793 @Variable.data.setter  # type: ignore[attr-defined]
   2794 def data(self, data):
-> 2795     raise ValueError(
   2796         f"Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable {self.name!r}. "
   2797         f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate."
   2798     )

ValueError: Cannot assign to the .data attribute of dimension coordinate a.k.a IndexVariable 'measurement'. Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate.

@benbovy
Copy link
Member

benbovy commented Oct 10, 2022

Looks like the backend logic needs some updates to make it compatible with the new xarray data model with explicit indexes (i.e., possible indexed coordinates with name != dimension like for multi-index levels now), e.g., here:

def _protect_dataset_variables_inplace(dataset, cache):
for name, variable in dataset.variables.items():
if name not in variable.dims:
# no need to protect IndexVariable objects
data = indexing.CopyOnWriteArray(variable._data)
if cache:
data = indexing.MemoryCachedArray(data)
variable.data = data

@lukasbindreiter
Copy link
Contributor Author

Based on your suggestion above I tried this single line fix which resolved my issue: #7150

However I'm not sure if this is the correct approach, since I'm not all to deeply familiar with the indexing model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants