Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding non-utf-8 encoded strings with the h5netcdf engine #5563

Closed
kiksekage opened this issue Jul 2, 2021 · 4 comments · Fixed by #8874
Closed

Decoding non-utf-8 encoded strings with the h5netcdf engine #5563

kiksekage opened this issue Jul 2, 2021 · 4 comments · Fixed by #8874

Comments

@kiksekage
Copy link

What happened:
Trying to load a netCDF file-like (io.BytesIO object) with attribute strings in non-utf-8 encoding with the h5netcdf engine leads to UnicodeDecodeError.

What you expected to happen:
Loading the same file, albeit persisted to disk, with the netcdf4 engine works fine, however, since the netcdf4 engine doesnt support the file-like objects I ran into this issue.

Traceback:
Traceback (most recent call last):
File "", line 1, in
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/api.py", line 242, in load_dataset
with open_dataset(filename_or_obj, **kwargs) as ds:
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/api.py", line 496, in open_dataset
backend_ds = backend.open_dataset(
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 384, in open_dataset
ds = store_entrypoint.open_dataset(
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/store.py", line 22, in open_dataset
vars, attrs = store.load()
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/common.py", line 126, in load
attributes = FrozenDict(self.get_attrs())
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 234, in get_attrs
return FrozenDict(read_attributes(self.ds))
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf
.py", line 75, in read_attributes
v = maybe_decode_bytes(v)
File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf
.py", line 63, in maybe_decode_bytes
return txt.decode("utf-8")

Minimal Complete Verifiable Example:

import xarray as xr
import netCDF4

title = b'\xc3'

f = netCDF4.Dataset('test.nc', 'w')
f.title = title
f.close()
xr.load_dataset("test.nc", engine="h5netcdf")

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.0 (default, Feb 25 2021, 22:10:10)
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 4.15.0-136-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.18.1
pandas: 1.2.4
numpy: 1.20.3
scipy: None
netCDF4: 1.5.6
pydap: None
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: None
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 57.0.0
pip: 21.1.3
conda: None
pytest: 6.2.4
IPython: 7.25.0
sphinx: None

@kmuehlbauer
Copy link
Contributor

Automatic decoding of bytes was implemented in #477 to properly decode returned bytes for CF decoding. In the case of non-utf-8 this brakes as shown.

hdf5 (and with that netCDF4) only has a notion of ASCII and UTF-8 for encoding (see h5py docs, https://docs.h5py.org/en/stable/strings.html#encodings). So the example above creates a non-standard file.

The question is what should be returned in the non-standard case if the attribute contains non-utf-8 encoded bytes? We could catch the UnicodeDecodeError and return something else (what?). But that would open the door for breakages with decoding CF metadata. I'm not sure if that can be properly resolved within xarray.

Why are those attributes in non-utf-8 encoding? Legacy data?

@kmuehlbauer
Copy link
Contributor

Revisiting this now. Is there any way forward here, or should we close as wont fix?

@dcherian
Copy link
Contributor

Can we raise a warning and leave them encoded?

@kmuehlbauer
Copy link
Contributor

Should work, I can have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants