Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` #8575

spencerkclark · 2023-12-30T01:25:17Z

I finally had a moment to think about this some more following discussion in #8253. This PR adds a chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta, which enables lazy encoding of time-like values, and by extension, preservation of chunks when writing time-like values to zarr. With these changes, the test added by @malmans2 in #8253 passes.

Though it largely reuses existing code, the lazy encoding implemented in this PR is stricter than eager encoding in a couple ways:

It requires either both the encoding units and dtype be prescribed, or neither be prescribed; prescribing one or the other is not supported, since it requires inferring one or the other from the data. In the case that neither is specified, the dtype is set to np.int64 and the units are either "nanoseconds since 1970-01-01" or "microseconds since 1970-01-01" depending on whether we are encoding np.datetime64[ns] values or cftime.datetime objects. In the case of timedelta64[ns] values, the units are set to "nanoseconds".
In addition, if an integer dtype is prescribed, but the units are set such that floating point values would be required, it raises instead of modifying the units to enable integer encoding. This is a requirement since the data units may differ between chunks, so overriding could result in inconsistent units.

As part of this PR, since dask requires we know the dtype of the array returned by the function passed to map_blocks, I also added logic to handle casting to the specified encoding dtype in an overflow-and-integer safe manner. This means an informative error message would be raised in the situation described in #8542:

OverflowError: Not possible to cast encoded times from dtype('int64') to dtype('int16') without overflow. Consider removing the dtype encoding, at which point xarray will make an appropriate choice, or explicitly switching to a larger integer dtype.

I eventually want to think about this on the decoding side as well, but that can wait for another PR.

Closes Saving a DataArray of datetime objects as zarr is not a lazy operation despite compute=False #7132
Closes chunks management with datetime64 and timedelta64 datatype #8230
Closes Writing a datetime coord ignores chunks #8432
Closes fix zarr datetime64 chunks #8253
Addresses xr.to_netcdf() alters time dimension #8542
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

spencerkclark · 2024-01-01T14:42:34Z

OK I think this may be ready for review.

The one awkward aspect about using nanoseconds as the fall-back encoding unit for dask-backed time fields is that it virtually requires that 64-bit integers be used, which are not supported by the "NETCDF4_CLASSIC", "NETCDF3_64BIT", or "NETCDF3_CLASSIC" file formats, where the maximum integer size is 32 bits. E.g. you can end up with an error message like this during dtype coercion:

>>> import pandas as pd; import xarray as xr
>>> times = pd.date_range("2000", periods=10, freq="D")
>>> da = xr.DataArray(times, dims=["time"], name="foo").chunk({"time": 2})
>>> da.to_dataset().to_netcdf("test.nc", format="NETCDF3_CLASSIC")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/spencer/software/xarray/xarray/core/dataset.py", line 2310, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File "/Users/spencer/software/xarray/xarray/backends/api.py", line 1315, in to_netcdf
    dump_to_store(
  File "/Users/spencer/software/xarray/xarray/backends/api.py", line 1362, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 352, in store
    variables, attributes = self.encode(variables, attributes)
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 442, in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 442, in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
  File "/Users/spencer/software/xarray/xarray/backends/netCDF4_.py", line 484, in encode_variable
    variable = encode_nc3_variable(variable)
  File "/Users/spencer/software/xarray/xarray/backends/netcdf3.py", line 114, in encode_nc3_variable
    data = coerce_nc3_dtype(data)
  File "/Users/spencer/software/xarray/xarray/backends/netcdf3.py", line 68, in coerce_nc3_dtype
    raise ValueError(
ValueError: could not safely cast array from dtype int64 to int32

To work around this you would either need to load the times into memory before saving, or explicitly specify a different units encoding, which we could potentially note as a possible fix in the message. The alternative would be to specify default time units like seconds, but then an error would be raised if the times had sub-second components, so it's a bit of a tradeoff. I guess ideally we could set the default units and dtype depending on the file format, but that would be a bit more involved. I'm open to other people's opinions.

dcherian · 2024-01-01T22:46:36Z

To work around this you would either need to load the times into memory before saving, or explicitly specify a different units encoding, which we could potentially note as a possible fix in the message.

I agree with raising a better error message when writing netCDF3. Explicitly specifying units for dask arrays (or really any time array) seems like good practice!

dcherian

Thanks @spencerkclark My only comment is that it should be straightforward to apply this to Cubed arrays to.

dcherian · 2024-01-01T22:49:03Z

xarray/coding/times.py

+    if isinstance(dates, np.ndarray):
+        return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)
+    elif is_duck_dask_array(dates):
+        return _lazily_encode_cf_datetime(dates, units, calendar, dtype)


Suggested change

if isinstance(dates, np.ndarray):

return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)

elif is_duck_dask_array(dates):

return _lazily_encode_cf_datetime(dates, units, calendar, dtype)

if is_chunked_array(dates):

return _lazily_encode_cf_datetime(dates, units, calendar, dtype)

elif is_duck_array(dates):

return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)

Is there a reason other arrays might fail?

Indeed, I do not think so. It's mainly that mypy complains if the function may not always return something:

xarray/coding/times.py: note: In function "encode_cf_datetime": xarray/coding/times.py:728: error: Missing return statement [return]

I suppose instead of elif is_duck_array(dates) I could simply use else, since I use asarray above?

I guess one issue that occurs to me is that as written I think _eagerly_encode_cf_datetime will always return a NumPy array, due to its reliance on doing calculations through pandas or cftime. In other words it won't necessarily fail, but it will not work quite like the other array types, where the type you put in is what you get out.

dcherian · 2024-01-01T22:49:55Z

xarray/coding/times.py

+            f"Got a units encoding of {units} and a dtype encoding of {dtype}."
+        )
+
+    num = dask.array.map_blocks(


Suggested change

num = dask.array.map_blocks(

num = chunkmanager.map_blocks(

There's achunkmanager dance now to handle both dask and cubed

Ah that's neat! Is there anything special I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)?

One other minor detail is how to handle typing with this. I gave it a shot (hopefully I'm in the right ballpark), but I still encounter this error:

xarray/coding/times.py: note: In function "_lazily_encode_cf_datetime": xarray/coding/times.py:860: error: "T_ChunkedArray" has no attribute "dtype" [attr-defined]

I'm not a typing expert so any guidance would be appreciated!

I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)

I believe the dask tests are fine since the cubed tests live elsewhere.

error: "T_ChunkedArray" has no attribute "dtype" [attr-defined

cc @TomNicholas

I believe the dask tests are fine since the cubed tests live elsewhere.

Yep.

Is there anything special I need to do

The correct ChunkManager for the type of the chunked array needs to be detected, but you've already done that bit!

error: "T_ChunkedArray" has no attribute "dtype" [attr-defined

This can't be typed properly yet (because we don't yet have a full protocol to describe the T_DuckArray in xarray.core.types.py), but try changing the definition of T_ChunkedArray in parallelcompat.py to this

T_ChunkedArray = TypeVar("T_ChunkedArray", bound=Any)

Gotcha, thanks for the typing guidance @TomNicholas. Adding bound=Any to the T_ChunkedArray definition indeed solves the dtype attribute issue. It seems I am not able to use T_DuckArray as an input type for a function, however — I'm getting errors like this:

error: Cannot use a covariant type variable as a parameter [misc]

Do you have any thoughts on how to handle that? I'm guessing the covariant=True parameter was added for a reason in its definition?

xarray/xarray/core/types.py

Line 176 in 41d33f5

T_DuckArray = TypeVar("T_DuckArray", bound=Any, covariant=True)

Does #8575 (comment) make the output side of this more difficult to handle as well (at least if we want to be as accurate as possible)?

@TomNicholas do you have any thoughts on this? From my perspective the only thing holding this PR up at this point is typing.

IMO a type: ignore is fine.

Hey, sorry for forgetting about this @spencerkclark !

I'm guessing the covariant=True parameter was added for a reason in its definition?

Yes - that fixes some typing errors in the ChunkManager methods IIRC. Unfortunately I don't yet understand how to make that work and also fix your error 🙁

the only thing holding this PR up at this point is typing.

I agree with Deepak that typing should not hold this up. Even if it needs some ignores then the typing you've added is still very useful to indicate intent!

xarray/coding/times.py

spencerkclark · 2024-01-02T12:47:20Z

I agree with raising a better error message when writing netCDF3. Explicitly specifying units for dask arrays (or really any time array) seems like good practice!

Sounds good — the error message now looks like this:

ValueError: could not safely cast array from dtype int64 to int32. A subtle cause for this can be chunked variables containing time-like values without explicitly defined dtype and units encoding values, for which xarray will attempt encoding with int64 values and maximally fine-grain units, e.g. 'nanoseconds since 1970-01-01'. To address this, specify a dtype and units encoding for these variables such that they can be encoded with int32 values. For example, use units like 'seconds since 1970-01-01' and a dtype of np.int32 if appropriate.

xarray/coding/times.py

…spencerkclark/xarray into dask-friendly-datetime-encoding

spencerkclark · 2024-01-27T20:45:34Z

Thanks @dcherian and @TomNicholas! Let me know if the typing looks good now—I added type: ignore comments where necessary to address #8575 (comment) following Tom's other suggestions.

As an aside, in light of #8641, I tweaked the netCDF3 coercion error message again to cover more possibilities:

ValueError: could not safely cast array from int64 to int32. While it is not always the case, a common reason for this is that xarray has deemed it safest to encode np.datetime64[ns] or np.timedelta64[ns] values with int64 values representing units of 'nanoseconds'. This is either due to the fact that the times are known to require nanosecond precision for an accurate round trip, or that the times are unknown prior to writing due to being contained in a chunked array. Ways to work around this are either to use a backend that supports writing int64 values, or to manually specify the encoding['units'] and encoding['dtype'] (e.g. 'seconds since 1970-01-01' and np.dtype('int32')) on the time variable(s) such that the times can be serialized in a netCDF3 file (note that depending on the situation, however, this latter option may result in an inaccurate round trip).

max-sixty · 2024-01-29T19:30:17Z

Thanks a lot @spencerkclark !

spencerkclark · 2024-01-30T02:17:56Z

Thanks all for your patience—glad to see this in!

* main: (153 commits) Add overloads to get_axis_num (pydata#8547) Fix CI: temporary pin pytest version to 7.4.* (pydata#8682) Bump the actions group with 1 update (pydata#8678) [namedarray] split `.set_dims()` into `.expand_dims()` and `broadcast_to()` (pydata#8380) Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` (pydata#8575) Fix NetCDF4 C version detection (pydata#8675) groupby: Don't set `method` by default on flox>=0.9 (pydata#8657) Fix automatic broadcasting when wrapping array api class (pydata#8669) Fix unstack method when wrapping array api class (pydata#8668) Fix `variables` arg typo in `Dataset.sortby()` docstring (pydata#8670) dt.weekday_name - removal of function (pydata#8664) Add `dev` dependencies to `pyproject.toml` (pydata#8661) CI: Pin scientific-python/upload-nightly-action to release sha (pydata#8662) Update HOW_TO_RELEASE.md by clarifying where RTD build can be found (pydata#8655) ruff: use extend-exclude (pydata#8649) new whats-new section (pydata#8652) xfail another test on windows (pydata#8648) use first element of residual in _nonpolyfit_1d (pydata#8647) whatsnew for v2024.01.1 implement `isnull` using `full_like` instead of `zeros_like` (pydata#7395) ...

spencerkclark mentioned this pull request Dec 30, 2023

fix zarr datetime64 chunks #8253

Closed

3 tasks

spencerkclark force-pushed the dask-friendly-datetime-encoding branch from 35f8681 to e5150c9 Compare December 30, 2023 01:33

spencerkclark changed the title ~~Add a dask-friendly code path to encode_cf_datetime~~ Add a dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Dec 31, 2023

spencerkclark changed the title ~~Add a dask-friendly code path to encode_cf_datetime and encode_cf_timedelta~~ Add dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Dec 31, 2023

spencerkclark force-pushed the dask-friendly-datetime-encoding branch from f0b9a8d to 6ebb917 Compare December 31, 2023 15:54

spencerkclark added 8 commits January 1, 2024 07:45

Add proof of concept dask-friendly datetime encoding

644cc4d

Add dask support for timedelta encoding and more tests

2558164

Minor error message edits; add what's new entry

180dbcd

Add return type for new tests

edd587c

Fix typo in what's new

2a243dd

Add what's new entry for update following pydata#8542

a28bc2a

Add full type hints to encoding functions

8ede271

Combine datetime64 and timedelta64 zarr tests; add cftime zarr test

eea3bb7

spencerkclark force-pushed the dask-friendly-datetime-encoding branch from 86b591b to eea3bb7 Compare January 1, 2024 12:45

Minor edits to what's new

207a725

spencerkclark marked this pull request as ready for review January 1, 2024 14:42

dcherian reviewed Jan 1, 2024

View reviewed changes

Address initial review comments

2797f1c

TomNicholas reviewed Jan 2, 2024

View reviewed changes

xarray/coding/times.py Outdated Show resolved Hide resolved

xarray/coding/times.py Outdated Show resolved Hide resolved

xarray/coding/times.py Show resolved Hide resolved

xarray/coding/times.py Show resolved Hide resolved

spencerkclark changed the title ~~Add dask-friendly code path to encode_cf_datetime and encode_cf_timedelta~~ Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta Jan 2, 2024

spencerkclark added 7 commits January 27, 2024 13:58

Add proof of concept dask-friendly datetime encoding

eed6ab7

Add dask support for timedelta encoding and more tests

bf4da23

Minor error message edits; add what's new entry

23081b6

Add return type for new tests

b65dfab

Fix typo in what's new

d4071c9

Add what's new entry for update following pydata#8542

141f1c2

Add full type hints to encoding functions

2e56529

spencerkclark added 12 commits January 27, 2024 13:58

Combine datetime64 and timedelta64 zarr tests; add cftime zarr test

5271842

Minor edits to what's new

bf1f6ba

Address initial review comments

63c32d8

Initial work toward addressing typing comments

7230c1a

Restore covariant=True in T_DuckArray; add type: ignores

9d12948

Tweak netCDF3 error message

a1c8133

Merge branch 'dask-friendly-datetime-encoding' of https://github.com/…

4b1e978

…spencerkclark/xarray into dask-friendly-datetime-encoding

Merge branch 'main' into dask-friendly-datetime-encoding

2f59fbe

Move what's new entry

76761ba

Remove extraneous text from merge in what's new

52d6428

Remove unused type: ignore comment

6b4127e

Remove word from netCDF3 error message

d9d9701

dcherian merged commit d8c3b1a into pydata:main Jan 29, 2024
29 checks passed

spencerkclark deleted the dask-friendly-datetime-encoding branch January 30, 2024 02:17

kmuehlbauer mentioned this pull request Feb 5, 2024

Using netcdf3 with datetime64[ns] quickly overflows int32 #8641

Closed

5 tasks

spencerkclark mentioned this pull request Jun 23, 2024

Default time encoding of nanoseconds is NOT good. #9154

Open

5 tasks

maxrjones mentioned this pull request Jun 24, 2024

Support nanosecond precision time coordinates carbonplan/ndpyramid#138

Open

kmuehlbauer mentioned this pull request Jun 26, 2024

Can not save timedelta data arrays with small integer dtypes and _FillValue #9134

Open

5 tasks

dcherian mentioned this pull request Jun 30, 2024

encode_cf_datetime() casts dask arrays to NumPy arrays #3834

Closed

kmuehlbauer mentioned this pull request Sep 13, 2024

writing datetime64 in netCDF may produce badly formatted or unreadable files #9488

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` #8575

Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` #8575

spencerkclark commented Dec 30, 2023 •

edited by dcherian

Loading

spencerkclark commented Jan 1, 2024

dcherian commented Jan 1, 2024 •

edited

Loading

dcherian left a comment

dcherian Jan 1, 2024

spencerkclark Jan 2, 2024

dcherian Jan 2, 2024

spencerkclark Jan 2, 2024 •

edited

Loading

dcherian Jan 1, 2024

spencerkclark Jan 2, 2024

dcherian Jan 2, 2024

TomNicholas Jan 2, 2024 •

edited

Loading

spencerkclark Jan 2, 2024

spencerkclark Jan 21, 2024

dcherian Jan 22, 2024

TomNicholas Jan 23, 2024

spencerkclark commented Jan 2, 2024

spencerkclark commented Jan 27, 2024

max-sixty commented Jan 29, 2024

spencerkclark commented Jan 30, 2024

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Conversation

spencerkclark commented Dec 30, 2023 • edited by dcherian Loading

spencerkclark commented Jan 1, 2024

dcherian commented Jan 1, 2024 • edited Loading

dcherian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerkclark Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerkclark commented Jan 2, 2024

spencerkclark commented Jan 27, 2024

max-sixty commented Jan 29, 2024

spencerkclark commented Jan 30, 2024

Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` #8575

Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` #8575

spencerkclark commented Dec 30, 2023 •

edited by dcherian

Loading

dcherian commented Jan 1, 2024 •

edited

Loading

spencerkclark Jan 2, 2024 •

edited

Loading

TomNicholas Jan 2, 2024 •

edited

Loading