Saving and loading an array of strings changes datatype to object #7652

basnijholt · 2023-03-20T19:15:03Z

What is your issue?

See the code below

import xarray as xr
# Make an xarray with an array of strings
da = xr.DataArray(
    data=[["a", "b"], ["c", "d"]],
    dims=["x", "y"],
    coords={"x": [0, 1], "y": [0, 1]},
)
da.to_netcdf("test.nc", mode='w')
# Load the xarray back in
da_loaded = xr.load_dataarray("test.nc")
assert da.dtype == da_loaded.dtype, "Dtypes don't match"

Now da_loaded.dtype is dtype('O').

Same happens with engine="h5netcdf".

import xarray
xarray.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.3.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.7.0
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.3.2
pip: 23.0
conda: None
pytest: 7.2.1
mypy: 1.0.1
IPython: 8.10.0
sphinx: 5.3.0

The text was updated successfully, but these errors were encountered:

jbweston · 2023-03-20T19:37:02Z

It seems that the "string" information is stored in the encoding of the loaded dataset. set in this block:

xarray/xarray/backends/h5netcdf_.py

Lines 208 to 210 in f1ff956

    
           vlen_dtype = h5py.check_dtype(vlen=var.dtype) 
        
           if vlen_dtype is str: 
        
               encoding["dtype"] = str

but this encoding is not "applied" to the dtype of the dataset's variable.

basnijholt · 2023-03-20T21:25:59Z

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

import xarray as xr

da1 = xr.DataArray(data=[True], dims=["x"], coords={"x": [0]})
da1.to_netcdf("test.nc", mode="w")

da2 = xr.load_dataarray("test.nc")
da2.to_netcdf("test.nc", mode="w")

da3 = xr.load_dataarray("test.nc")
assert da1.dtype == da3.dtype, "Dtypes don't match"

basnijholt · 2023-03-20T21:34:30Z

Another fun one where int64 -> int32:

import xarray as xr

da = xr.DataArray(data=[1], dims=["x"], coords={"x": [0]})
da.to_netcdf("test.nc", mode="w")
da2 = xr.load_dataarray("test.nc")
da.dtype, da2.dtype

kmuehlbauer · 2023-03-21T08:37:48Z

@basnijholt For the string issue this is somehwat kind of netcdf/numpy based issue with VLEN types.

XRef: https://unidata.github.io/netcdf4-python/#dealing-with-strings

The most flexible way to store arrays of strings is with the Variable-length (vlen) string data type. However, this requires the use of the NETCDF4 data model, and the vlen type does not map very well numpy arrays (you have to use numpy arrays of dtype=object, which are arrays of arbitrary python objects).

And numpy will create a VLEN string array if no dtype is given, like in your case.

At least netCDF4 and h5netcdf backends are consistent in their writing (creating similar hdf5-files) and reading back (object-dtype):

plain netCDF4

import netCDF4 as nc
import numpy as np
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
auto = False
with nc.Dataset("test-plain-netcdf4.nc", mode="w") as ds:
    print("Write NC-File")
    ds.set_auto_maskandscale(auto)
    ds.set_auto_chartostring(auto)
    ds.createDimension("x", size=2)
    ds.createDimension("y", size=2)
    var = ds.createVariable("da", data.dtype.str,  dimensions=("x", "y"))
    var[:] = data
    print("Variable\n")
    print(var)
    print(var.dtype)
    print("\nContents\n")
    print(var[:])
    print(var[:].dtype)
with nc.Dataset("test-plain-netcdf4.nc") as ds:
    print("\nRead NC-File")
    ds.set_auto_maskandscale(auto)
    ds.set_auto_chartostring(auto)
    da = ds["da"]
    print("Variable\n")
    print(da)
    print(da.dtype)
    da = ds["da"][:]
    print("\nContents\n")
    print(da)
    print(da.dtype)

source dtype: <U1

Write NC-File
Variable

<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions: 
current shape = (2, 2)
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

Read NC-File
Variable

<class 'netCDF4._netCDF4.Variable'>
vlen da(x, y)
vlen data type: <class 'str'>
unlimited dimensions: 
current shape = (2, 2)
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

netcdf test-plain-netcdf4 {
dimensions:
	x = 2 ;
	y = 2 ;
variables:
	string da(x, y) ;
data:

 da =
  "a", "b",
  "c", "d" ;
}
HDF5 "test-plain-netcdf4.nc" {
DATASET "da" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
   DATA {
   (0,0): "a", "b",
   (1,0): "c", "d"
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (), ()
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 0, 1
      }
   }
}
}

plain h5netcdf

import h5netcdf.legacyapi as h5nc
import h5py
data = np.array([["a", "b"], ["c", "d"]], dtype="<U1")
print(f"source dtype: {data.dtype.str}\n", )
with h5nc.Dataset("test-plain-h5netcdf.nc", mode="w") as ds:
    print("Write NC-File")
    ds.createDimension("x", 2)
    ds.createDimension("y", 2)
    dtype = h5py.string_dtype()
    print("Source dtype:", dtype)
    var = ds.createVariable("da", dtype,  dimensions=("x", "y"))
    var[:] = data
    print("Variable\n")
    print(var)
    print(var.dtype)
    print("\nContents\n")
    print(var[:])
    print(var[:].dtype)
with h5nc.Dataset("test-plain-h5netcdf.nc") as ds:
    print("\nRead NC-File")
    da = ds["da"]
    print("Variable\n")
    print(da)
    print(da.dtype)
    da = ds["da"][:]
    print("\nContents\n")
    print(da)
    print(da.dtype)

source dtype: <U1

Write NC-File
Source dtype: object
Variable

<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

Read NC-File
Variable

<h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>>
Attributes:
<class 'str'>

Contents

[['a' 'b']
 ['c' 'd']]
object

netcdf test-plain-h5netcdf {
dimensions:
	x = 2 ;
	y = 2 ;
variables:
	string da(x, y) ;
data:

 da =
  "a", "b",
  "c", "d" ;
}
HDF5 "test-plain-h5netcdf.nc" {
DATASET "da" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
   DATA {
   (0,0): "a", "b",
   (1,0): "c", "d"
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (), ()
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 0, 1
      }
   }
   ATTRIBUTE "_Netcdf4Dimid" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 0
      }
   }
}
}

Both get written out as:

DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }

If you use fixed length strings (eg. |S1) the dtype is preserved during roundtrip:

import xarray as xr
# Make an xarray with an array of fixed-length strings
data = np.array([["a", "b"], ["c", "d"]], dtype="|S1")
da = xr.DataArray(
    data=data,
    dims=["x", "y"],
    coords={"x": [0, 1], "y": [0, 1]},
)
da.to_netcdf("test.nc", mode='w')
# Load the xarray back in
da_loaded = xr.load_dataarray("test.nc")
assert da.dtype == da_loaded.dtype, "Dtypes don't match"

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.21-150400.24.46-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.1
netCDF4: 1.6.3
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.6
cfgrib: None
iris: None
bottleneck: None
dask: 2023.3.1
distributed: 2023.3.1
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.6.0
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.11.0
sphinx: None

kmuehlbauer · 2023-03-21T08:55:37Z

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

That's an issue with netcdf file format, too, it has no bool-dtype.
XRef: #1500

data = np.array([True], dtype=bool)
with nc.Dataset("test-bool-netcdf4.nc", mode="w") as ds:
    ds.createDimension("x", size=1)
    var = ds.createVariable("da", data.dtype.str,  dimensions=("x"))
    var[:] = data

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[42], line 4
      2 with nc.Dataset("test-bool-netcdf4.nc", mode="w") as ds:
      3     ds.createDimension("x", size=1)
----> 4     var = ds.createVariable("da", data.dtype.str,  dimensions=("x"))
      5     var[:] = data

File src/netCDF4/_netCDF4.pyx:2945, in netCDF4._netCDF4.Dataset.createVariable()

File src/netCDF4/_netCDF4.pyx:4121, in netCDF4._netCDF4.Variable.__init__()

TypeError: illegal primitive data type, must be one of dict_keys(['S1', 'i1', 'u1', 'i2', 'u2', 'i4', 'u4', 'i8', 'u8', 'f4', 'f8']), got bool

Update:
Xarray is forwarding the information to the file, by adding a dtype-attribute. It looks like this information is not correctly distributed back to .encoding in the case of saving/loading/saving/loading. I'd consider that one a bug.

Reason:
While decoding the .encoding-dtype is set as original_dtype (int8 in our case), but it should either be removed or explicitely set as bool.

xarray/xarray/conventions.py

Lines 400 to 404 in f1ff956

    
           encoding.setdefault("dtype", original_dtype) 
        
           if "dtype" in attributes and attributes["dtype"] == "bool": 
        
               del attributes["dtype"] 
        
               data = BoolTypeArray(data)

kmuehlbauer · 2023-03-21T08:58:51Z

Another fun one where int64 -> int32:

Can't reproduce this one with my environment. See above for details.

kmuehlbauer · 2023-03-21T10:36:43Z

A similar problem where saving, loading, saving, loading changes the dtype bool -> int8:

@basnijholt I'd appreciate if you could test #7654 for that particular case.

Update: added another commit which handles the vlen string case.

basnijholt · 2023-03-22T04:43:46Z

Thanks a lot @kmuehlbauer!

I replied in your PR 😄

I can confirm that this fixes

Saving and loading an array of strings changes datatype to object #7652 (comment) (bool -> int8)

But not int64 -> int32, and <U1 -> O

Saving and loading an array of strings changes datatype to object #7652 (comment)

Saving and loading an array of strings changes datatype to object #7652 (comment)

kmuehlbauer · 2023-03-22T06:49:42Z

Great, much appreciated, thanks! Let's iterate over there then.

kmuehlbauer · 2023-03-22T08:12:28Z

OK, I've finally gotten to the bottom of this, so I'm writing my findings here:

int64 -> int32

This works with h5netcdf/netcdf4-backends in any case. That's a feature¹ of NETCDF3-format which will be used if scipy is installed and netCDF4 is not installed (in the case of engine=None). It has not notion of int64 so this is silently cast to int32 on write.

<U1 -> object
This works with the changes applied in #7654 for h5netcdf/netcdf4-backends in normal cases (writing out as VLEN string). Again, NETCDF3 format does not have a notion of string so all strings have to be converted to an internal NC_CHAR representation. I'm not sure it will do any good to try to make this one work.

My suggestion would be, just use NETCDF4-format and one of the capable backends.

https://docs.unidata.ucar.edu/nug/current/md_types.html ↩

dcherian · 2023-03-22T15:03:59Z

@kmuehlbauer this is amazing!

It would be very valuable to add this list of limitations to the documentation: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf

basnijholt · 2023-03-22T18:04:29Z

@kmuehlbauer, great!

I can confirm that both problems are indeed fixed on my end when using h5netcdf and the code from your PR 🎉

kmuehlbauer · 2023-03-23T12:36:17Z

@kmuehlbauer this is amazing!

It would be very valuable to add this list of limitations to the documentation: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf

I've added a bit to this over at #7654.

basnijholt added the needs triage Issue that has not been reviewed by xarray team member label Mar 20, 2023

dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 20, 2023

kmuehlbauer mentioned this issue Mar 21, 2023

cf-coding #7654

Closed

4 tasks

dcherian added the topic-documentation label Mar 22, 2023

ghiggi mentioned this issue May 23, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

kmuehlbauer mentioned this issue May 23, 2023

preserve vlen string dtypes, allow vlen string fill_values #7869

Merged

6 tasks

dcherian closed this as completed in #7869 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving and loading an array of strings changes datatype to object #7652

Saving and loading an array of strings changes datatype to object #7652

basnijholt commented Mar 20, 2023 •

edited

Loading

jbweston commented Mar 20, 2023 •

edited

Loading

basnijholt commented Mar 20, 2023 •

edited

Loading

basnijholt commented Mar 20, 2023

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023 •

edited

Loading

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023 •

edited

Loading

basnijholt commented Mar 22, 2023 •

edited

Loading

kmuehlbauer commented Mar 22, 2023

kmuehlbauer commented Mar 22, 2023 •

edited

Loading

dcherian commented Mar 22, 2023 •

edited

Loading

basnijholt commented Mar 22, 2023

kmuehlbauer commented Mar 23, 2023

Saving and loading an array of strings changes datatype to object #7652

Saving and loading an array of strings changes datatype to object #7652

Comments

basnijholt commented Mar 20, 2023 • edited Loading

What is your issue?

jbweston commented Mar 20, 2023 • edited Loading

basnijholt commented Mar 20, 2023 • edited Loading

basnijholt commented Mar 20, 2023

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023 • edited Loading

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023 • edited Loading

basnijholt commented Mar 22, 2023 • edited Loading

kmuehlbauer commented Mar 22, 2023

kmuehlbauer commented Mar 22, 2023 • edited Loading

Footnotes

dcherian commented Mar 22, 2023 • edited Loading

basnijholt commented Mar 22, 2023

kmuehlbauer commented Mar 23, 2023

basnijholt commented Mar 20, 2023 •

edited

Loading

jbweston commented Mar 20, 2023 •

edited

Loading

basnijholt commented Mar 20, 2023 •

edited

Loading

kmuehlbauer commented Mar 21, 2023 •

edited

Loading

kmuehlbauer commented Mar 21, 2023 •

edited

Loading

basnijholt commented Mar 22, 2023 •

edited

Loading

kmuehlbauer commented Mar 22, 2023 •

edited

Loading

dcherian commented Mar 22, 2023 •

edited

Loading