Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and loading an array of strings changes datatype to object #7652

Closed
basnijholt opened this issue Mar 20, 2023 · 13 comments · Fixed by #7869
Closed

Saving and loading an array of strings changes datatype to object #7652

basnijholt opened this issue Mar 20, 2023 · 13 comments · Fixed by #7869

Comments

@basnijholt
Copy link

basnijholt commented Mar 20, 2023

What is your issue?

See the code below

import xarray as xr
# Make an xarray with an array of strings
da = xr.DataArray(
    data=[["a", "b"], ["c", "d"]],
    dims=["x", "y"],
    coords={"x": [0, 1], "y": [0, 1]},
)
da.to_netcdf("test.nc", mode='w')
# Load the xarray back in
da_loaded = xr.load_dataarray("test.nc")
assert da.dtype == da_loaded.dtype, "Dtypes don't match"

Now da_loaded.dtype is dtype('O').

Same happens with engine="h5netcdf".

import xarray
xarray.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.3.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.7.0
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.3.2
pip: 23.0
conda: None
pytest: 7.2.1
mypy: 1.0.1
IPython: 8.10.0
sphinx: 5.3.0
@basnijholt basnijholt added the needs triage Issue that has not been reviewed by xarray team member label Mar 20, 2023
@jbweston
Copy link

jbweston commented Mar 20, 2023

It seems that the "string" information is stored in the encoding of the loaded dataset. set in this block:

vlen_dtype = h5py.check_dtype(vlen=var.dtype)
if vlen_dtype is str:
encoding["dtype"] = str

but this encoding is not "applied" to the dtype of the dataset's variable.

@basnijholt
Copy link
Author

basnijholt commented Mar 20, 2023

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

import xarray as xr

da1 = xr.DataArray(data=[True], dims=["x"], coords={"x": [0]})
da1.to_netcdf("test.nc", mode="w")

da2 = xr.load_dataarray("test.nc")
da2.to_netcdf("test.nc", mode="w")

da3 = xr.load_dataarray("test.nc")
assert da1.dtype == da3.dtype, "Dtypes don't match"

@basnijholt
Copy link
Author

Another fun one where int64 -> int32:

import xarray as xr

da = xr.DataArray(data=[1], dims=["x"], coords={"x": [0]})
da.to_netcdf("test.nc", mode="w")
da2 = xr.load_dataarray("test.nc")
da.dtype, da2.dtype

@dcherian dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 20, 2023
@kmuehlbauer
Copy link
Contributor

@basnijholt For the string issue this is somehwat kind of netcdf/numpy based issue with VLEN types.

XRef: https://unidata.github.io/netcdf4-python/#dealing-with-strings

The most flexible way to store arrays of strings is with the Variable-length (vlen) string data type. However, this requires the use of the NETCDF4 data model, and the vlen type does not map very well numpy arrays (you have to use numpy arrays of dtype=object, which are arrays of arbitrary python objects).

And numpy will create a VLEN string array if no dtype is given, like in your case.

At least netCDF4 and h5netcdf backends are consistent in their writing (creating similar hdf5-files) and reading back (object-dtype):

plain netCDF4
import netCDF4 as nc
import numpy as np
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
auto = False
with nc.Dataset("test-plain-netcdf4.nc", mode="w") as ds:
    print("Write NC-File")
    ds.set_auto_maskandscale(auto)
    ds.set_auto_chartostring(auto)
    ds.createDimension("x", size=2)
    ds.createDimension("y", size=2)
    var = ds.createVariable("da", data.dtype.str,  dimensions=("x", "y"))
    var[:] = data
    print("Variable\n")
    print(var)
    print(var.dtype)
    print("\nContents\n")
    print(var[:])
    print(var[:].dtype)
with nc.Dataset("test-plain-netcdf4.nc") as ds:
    print("\nRead NC-File")
    ds.set_auto_maskandscale(auto)
    ds.set_auto_chartostring(auto)
    da = ds["da"]
    print("Variable\n")
    print(da)
    print(da.dtype)
    da = ds["da"][:]
    print("\nContents\n")
    print(da)
    print(da.dtype)
source dtype: <U1

Write NC-File
Variable

<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions: 
current shape = (2, 2)
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

Read NC-File
Variable

<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions: 
current shape = (2, 2)
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object
netcdf test-plain-netcdf4 {
dimensions:
	x = 2 ;
	y = 2 ;
variables:
	string da(x, y) ;
data:

 da =
  "a", "b",
  "c", "d" ;
}
HDF5 "test-plain-netcdf4.nc" {
DATASET "da" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
   DATA {
   (0,0): "a", "b",
   (1,0): "c", "d"
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (), ()
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 0, 1
      }
   }
}
}
plain h5netcdf
import h5netcdf.legacyapi as h5nc
import h5py
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
with h5nc.Dataset("test-plain-h5netcdf.nc", mode="w") as ds:
    print("Write NC-File")
    ds.createDimension("x", 2)
    ds.createDimension("y", 2)
    dtype = h5py.string_dtype()
    print("Source dtype:", dtype)
    var = ds.createVariable("da", dtype,  dimensions=("x", "y"))
    var[:] = data
    print("Variable\n")
    print(var)
    print(var.dtype)
    print("\nContents\n")
    print(var[:])
    print(var[:].dtype)
with h5nc.Dataset("test-plain-h5netcdf.nc") as ds:
    print("\nRead NC-File")
    da = ds["da"]
    print("Variable\n")
    print(da)
    print(da.dtype)
    da = ds["da"][:]
    print("\nContents\n")
    print(da)
    print(da.dtype)
source dtype: <U1

Write NC-File
Source dtype: object
Variable

<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

Read NC-File
Variable

<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object
netcdf test-plain-h5netcdf {
dimensions:
	x = 2 ;
	y = 2 ;
variables:
	string da(x, y) ;
data:

 da =
  "a", "b",
  "c", "d" ;
}
HDF5 "test-plain-h5netcdf.nc" {
DATASET "da" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
   DATA {
   (0,0): "a", "b",
   (1,0): "c", "d"
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (), ()
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 0, 1
      }
   }
   ATTRIBUTE "_Netcdf4Dimid" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 0
      }
   }
}
}

Both get written out as:

DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }

If you use fixed length strings (eg. |S1) the dtype is preserved during roundtrip:

import xarray as xr
# Make an xarray with an array of fixed-length strings
data = np.array([["a", "b"], ["c", "d"]], dtype="|S1")
da = xr.DataArray(
    data=data,
    dims=["x", "y"],
    coords={"x": [0, 1], "y": [0, 1]},
)
da.to_netcdf("test.nc", mode='w')
# Load the xarray back in
da_loaded = xr.load_dataarray("test.nc")
assert da.dtype == da_loaded.dtype, "Dtypes don't match"
Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.21-150400.24.46-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.1
netCDF4: 1.6.3
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.6
cfgrib: None
iris: None
bottleneck: None
dask: 2023.3.1
distributed: 2023.3.1
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.6.0
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.11.0
sphinx: None

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Mar 21, 2023

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

That's an issue with netcdf file format, too, it has no bool-dtype.
XRef: #1500

data = np.array([True], dtype=bool)
with nc.Dataset("test-bool-netcdf4.nc", mode="w") as ds:
    ds.createDimension("x", size=1)
    var = ds.createVariable("da", data.dtype.str,  dimensions=("x"))
    var[:] = data
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[42], line 4
      2 with nc.Dataset("test-bool-netcdf4.nc", mode="w") as ds:
      3     ds.createDimension("x", size=1)
----> 4     var = ds.createVariable("da", data.dtype.str,  dimensions=("x"))
      5     var[:] = data

File src/netCDF4/_netCDF4.pyx:2945, in netCDF4._netCDF4.Dataset.createVariable()

File src/netCDF4/_netCDF4.pyx:4121, in netCDF4._netCDF4.Variable.__init__()

TypeError: illegal primitive data type, must be one of dict_keys(['S1', 'i1', 'u1', 'i2', 'u2', 'i4', 'u4', 'i8', 'u8', 'f4', 'f8']), got bool

Update:
Xarray is forwarding the information to the file, by adding a dtype-attribute. It looks like this information is not correctly distributed back to .encoding in the case of saving/loading/saving/loading. I'd consider that one a bug.

Reason:
While decoding the .encoding-dtype is set as original_dtype (int8 in our case), but it should either be removed or explicitely set as bool.

encoding.setdefault("dtype", original_dtype)
if "dtype" in attributes and attributes["dtype"] == "bool":
del attributes["dtype"]
data = BoolTypeArray(data)

@kmuehlbauer
Copy link
Contributor

Another fun one where int64 -> int32:

Can't reproduce this one with my environment. See above for details.

@kmuehlbauer kmuehlbauer mentioned this issue Mar 21, 2023
4 tasks
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Mar 21, 2023

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

@basnijholt I'd appreciate if you could test #7654 for that particular case.

Update: added another commit which handles the vlen string case.

@basnijholt
Copy link
Author

basnijholt commented Mar 22, 2023

@kmuehlbauer
Copy link
Contributor

Great, much appreciated, thanks! Let's iterate over there then.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Mar 22, 2023

OK, I've finally gotten to the bottom of this, so I'm writing my findings here:

int64 -> int32

This works with h5netcdf/netcdf4-backends in any case. That's a feature1 of NETCDF3-format which will be used if scipy is installed and netCDF4 is not installed (in the case of engine=None). It has not notion of int64 so this is silently cast to int32 on write.

<U1 -> object
This works with the changes applied in #7654 for h5netcdf/netcdf4-backends in normal cases (writing out as VLEN string). Again, NETCDF3 format does not have a notion of string so all strings have to be converted to an internal NC_CHAR representation. I'm not sure it will do any good to try to make this one work.

My suggestion would be, just use NETCDF4-format and one of the capable backends.

Footnotes

  1. https://docs.unidata.ucar.edu/nug/current/md_types.html

@dcherian
Copy link
Contributor

dcherian commented Mar 22, 2023

@kmuehlbauer this is amazing!

It would be very valuable to add this list of limitations to the documentation: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf

@basnijholt
Copy link
Author

@kmuehlbauer, great!

I can confirm that both problems are indeed fixed on my end when using h5netcdf and the code from your PR 🎉

@kmuehlbauer
Copy link
Contributor

@kmuehlbauer this is amazing!

It would be very valuable to add this list of limitations to the documentation: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf

I've added a bit to this over at #7654.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants