Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordinates not removed for variable encoding during reset_coords #7245

Open
4 tasks done
hmaarrfk opened this issue Nov 2, 2022 · 5 comments
Open
4 tasks done

coordinates not removed for variable encoding during reset_coords #7245

hmaarrfk opened this issue Nov 2, 2022 · 5 comments

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 2, 2022

What happened?

When calling reset_coords on a dataset that is loaded from disk, the coordinates are not removed from the encoding of the variable.

This means, that at save time they will be resaved as coordinates... annoying. (and erroneous)

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

dataset = xr.Dataset(
    data_vars={'images': (('y', 'x'),  np.zeros((10, 2)))},
    coords={'zar': 1}
)

dataset.to_netcdf('foo.nc', mode='w')

# %%
foo_loaded = xr.open_dataset('foo.nc')

foo_loaded_reset = foo_loaded.reset_coords()

# %%
assert 'zar' in foo_loaded.coords
assert 'zar' not in foo_loaded_reset.coords
assert 'zar' in foo_loaded_reset.data_vars
foo_loaded_reset.to_netcdf('bar.nc', mode='w')

# %% Now load the dataset

bar_loaded = xr.open_dataset('bar.nc')
assert 'zar' not in bar_loaded.coords, 'zar is erroneously a coordinate'

# %%
# This is the problem
assert 'zar' not in foo_loaded_reset.images.encoding['coordinates'].split(' '), "zar should not be in here"

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

        for _, variable in obj._variables.items():
            coords_in_encoding = set(variable.encoding.get('coordinates', ' ').split(' '))
            variable.encoding['coordinates'] = ' '.join(coords_in_encoding - set(names))

suggested fix in dataset.py, reset_coords

if drop:

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.13 | packaged by Ramona Optics | (main, Aug 31 2022, 22:30:30) 
[GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.10.0
pandas: 1.5.1
numpy: 1.23.4
scipy: 1.9.3
netCDF4: 1.6.1
pydap: None
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.13.3
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.10.0
distributed: 2022.10.0
matplotlib: 3.6.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.10.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.3
conda: 22.9.0
pytest: 7.2.0
IPython: 7.33.0
sphinx: 5.3.0
/home/mark/mambaforge/envs/mcam_dev/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

@hmaarrfk hmaarrfk added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 2, 2022
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 2, 2022

And if you want to have a clean encoding dictionary, you may want to do the following:

        names = set(names)
        for _, variable in obj._variables.items():
            if 'coordinates' in variable.encoding:
                coords_in_encoding = set(variable.encoding.get('coordinates').split(' '))
                remaining_coords = coords_in_encoding - names
                if len(remaining_coords) == 0:
                    del variable.encoding['coordinates']
                else:
                    variable.encoding['coordinates'] = ' '.join(remaining_coords)

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 2, 2022

While the above "fix" addresses the issues with renaming coordinates, I think there are plenty of usecases where we would still end up with strange, or unexpected results. For example.

  1. Load a dataset with many non-indexing coordinates.
  2. Dropping variables (that happen to be coordinates).
  3. Then adding back a variable with the same name.
  4. Upon save, encoding would dictate that it is a coordinate of a particular variable and will promote it to a coordinate instead of data.

We could apply the "fix" to the drop_vars method as well, but I think it may be hard (though not impossible) to hit all the cases.

I think a more "generic", albeit breaking" fix would be to remove the "coordinates" entirely from encoding after the dataset has been loaded. That said, this only "works" if dataset['variable_name'].encoding['coordinates'] is considered a private variable. That is, users are not supposed to be adding to it at will.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 2, 2023

Kind bump

@dcherian
Copy link
Contributor

This is another motivating reason for #5082. It's too hard to keep attrs or encoding in sync given Xarray's data model.

Since encoding is frequently out-of-date, it just causes a lot of problems. In general, the advice is to manually set encoding if you care about how your dataset is written to disk.

@dcherian dcherian removed needs triage Issue that has not been reviewed by xarray team member bug labels Jan 15, 2023
@hmaarrfk
Copy link
Contributor Author

Thank you for your explination.

Do you think it is safe to "strip" encoding after "loading" the data? or is it still used after the initial call to open_dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants