Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Merged
merged 29 commits into from
Jan 29, 2024

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Dec 30, 2023

I finally had a moment to think about this some more following discussion in #8253. This PR adds a chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta, which enables lazy encoding of time-like values, and by extension, preservation of chunks when writing time-like values to zarr. With these changes, the test added by @malmans2 in #8253 passes.

Though it largely reuses existing code, the lazy encoding implemented in this PR is stricter than eager encoding in a couple ways:

  1. It requires either both the encoding units and dtype be prescribed, or neither be prescribed; prescribing one or the other is not supported, since it requires inferring one or the other from the data. In the case that neither is specified, the dtype is set to np.int64 and the units are either "nanoseconds since 1970-01-01" or "microseconds since 1970-01-01" depending on whether we are encoding np.datetime64[ns] values or cftime.datetime objects. In the case of timedelta64[ns] values, the units are set to "nanoseconds".
  2. In addition, if an integer dtype is prescribed, but the units are set such that floating point values would be required, it raises instead of modifying the units to enable integer encoding. This is a requirement since the data units may differ between chunks, so overriding could result in inconsistent units.

As part of this PR, since dask requires we know the dtype of the array returned by the function passed to map_blocks, I also added logic to handle casting to the specified encoding dtype in an overflow-and-integer safe manner. This means an informative error message would be raised in the situation described in #8542:

OverflowError: Not possible to cast encoded times from dtype('int64') to dtype('int16') without overflow. Consider removing the dtype encoding, at which point xarray will make an appropriate choice, or explicitly switching to a larger integer dtype.

I eventually want to think about this on the decoding side as well, but that can wait for another PR.

@spencerkclark spencerkclark mentioned this pull request Dec 30, 2023
3 tasks
@spencerkclark spencerkclark force-pushed the dask-friendly-datetime-encoding branch from 35f8681 to e5150c9 Compare December 30, 2023 01:33
@spencerkclark spencerkclark changed the title Add a dask-friendly code path to encode_cf_datetime Add a dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Dec 31, 2023
@spencerkclark spencerkclark changed the title Add a dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Add dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Dec 31, 2023
@spencerkclark spencerkclark force-pushed the dask-friendly-datetime-encoding branch from f0b9a8d to 6ebb917 Compare December 31, 2023 15:54
@spencerkclark spencerkclark force-pushed the dask-friendly-datetime-encoding branch from 86b591b to eea3bb7 Compare January 1, 2024 12:45
@spencerkclark
Copy link
Member Author

OK I think this may be ready for review.

The one awkward aspect about using nanoseconds as the fall-back encoding unit for dask-backed time fields is that it virtually requires that 64-bit integers be used, which are not supported by the "NETCDF4_CLASSIC", "NETCDF3_64BIT", or "NETCDF3_CLASSIC" file formats, where the maximum integer size is 32 bits. E.g. you can end up with an error message like this during dtype coercion:

>>> import pandas as pd; import xarray as xr
>>> times = pd.date_range("2000", periods=10, freq="D")
>>> da = xr.DataArray(times, dims=["time"], name="foo").chunk({"time": 2})
>>> da.to_dataset().to_netcdf("test.nc", format="NETCDF3_CLASSIC")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/spencer/software/xarray/xarray/core/dataset.py", line 2310, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File "/Users/spencer/software/xarray/xarray/backends/api.py", line 1315, in to_netcdf
    dump_to_store(
  File "/Users/spencer/software/xarray/xarray/backends/api.py", line 1362, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 352, in store
    variables, attributes = self.encode(variables, attributes)
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 442, in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
  File "/Users/spencer/software/xarray/xarray/backends/common.py", line 442, in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
  File "/Users/spencer/software/xarray/xarray/backends/netCDF4_.py", line 484, in encode_variable
    variable = encode_nc3_variable(variable)
  File "/Users/spencer/software/xarray/xarray/backends/netcdf3.py", line 114, in encode_nc3_variable
    data = coerce_nc3_dtype(data)
  File "/Users/spencer/software/xarray/xarray/backends/netcdf3.py", line 68, in coerce_nc3_dtype
    raise ValueError(
ValueError: could not safely cast array from dtype int64 to int32

To work around this you would either need to load the times into memory before saving, or explicitly specify a different units encoding, which we could potentially note as a possible fix in the message. The alternative would be to specify default time units like seconds, but then an error would be raised if the times had sub-second components, so it's a bit of a tradeoff. I guess ideally we could set the default units and dtype depending on the file format, but that would be a bit more involved. I'm open to other people's opinions.

@spencerkclark spencerkclark marked this pull request as ready for review January 1, 2024 14:42
@dcherian
Copy link
Contributor

dcherian commented Jan 1, 2024

To work around this you would either need to load the times into memory before saving, or explicitly specify a different units encoding, which we could potentially note as a possible fix in the message.

I agree with raising a better error message when writing netCDF3. Explicitly specifying units for dask arrays (or really any time array) seems like good practice!

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @spencerkclark My only comment is that it should be straightforward to apply this to Cubed arrays to.

Comment on lines 731 to 734
if isinstance(dates, np.ndarray):
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)
elif is_duck_dask_array(dates):
return _lazily_encode_cf_datetime(dates, units, calendar, dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(dates, np.ndarray):
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)
elif is_duck_dask_array(dates):
return _lazily_encode_cf_datetime(dates, units, calendar, dtype)
if is_chunked_array(dates):
return _lazily_encode_cf_datetime(dates, units, calendar, dtype)
elif is_duck_array(dates):
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype)

Is there a reason other arrays might fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I do not think so. It's mainly that mypy complains if the function may not always return something:

xarray/coding/times.py: note: In function "encode_cf_datetime":
xarray/coding/times.py:728: error: Missing return statement  [return]

I suppose instead of elif is_duck_array(dates) I could simply use else, since I use asarray above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

Copy link
Member Author

@spencerkclark spencerkclark Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one issue that occurs to me is that as written I think _eagerly_encode_cf_datetime will always return a NumPy array, due to its reliance on doing calculations through pandas or cftime. In other words it won't necessarily fail, but it will not work quite like the other array types, where the type you put in is what you get out.

f"Got a units encoding of {units} and a dtype encoding of {dtype}."
)

num = dask.array.map_blocks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num = dask.array.map_blocks(
num = chunkmanager.map_blocks(

There's achunkmanager dance now to handle both dask and cubed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's neat! Is there anything special I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)?

One other minor detail is how to handle typing with this. I gave it a shot (hopefully I'm in the right ballpark), but I still encounter this error:

xarray/coding/times.py: note: In function "_lazily_encode_cf_datetime":
xarray/coding/times.py:860: error: "T_ChunkedArray" has no attribute "dtype"  [attr-defined]

I'm not a typing expert so any guidance would be appreciated!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)

I believe the dask tests are fine since the cubed tests live elsewhere.

error: "T_ChunkedArray" has no attribute "dtype" [attr-defined

cc @TomNicholas

Copy link
Member

@TomNicholas TomNicholas Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the dask tests are fine since the cubed tests live elsewhere.

Yep.

Is there anything special I need to do

The correct ChunkManager for the type of the chunked array needs to be detected, but you've already done that bit!

error: "T_ChunkedArray" has no attribute "dtype" [attr-defined

This can't be typed properly yet (because we don't yet have a full protocol to describe the T_DuckArray in xarray.core.types.py), but try changing the definition of T_ChunkedArray in parallelcompat.py to this

T_ChunkedArray = TypeVar("T_ChunkedArray", bound=Any)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the typing guidance @TomNicholas. Adding bound=Any to the T_ChunkedArray definition indeed solves the dtype attribute issue. It seems I am not able to use T_DuckArray as an input type for a function, however — I'm getting errors like this:

error: Cannot use a covariant type variable as a parameter  [misc]

Do you have any thoughts on how to handle that? I'm guessing the covariant=True parameter was added for a reason in its definition?

T_DuckArray = TypeVar("T_DuckArray", bound=Any, covariant=True)

Does #8575 (comment) make the output side of this more difficult to handle as well (at least if we want to be as accurate as possible)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas do you have any thoughts on this? From my perspective the only thing holding this PR up at this point is typing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO a type: ignore is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for forgetting about this @spencerkclark !

I'm guessing the covariant=True parameter was added for a reason in its definition?

Yes - that fixes some typing errors in the ChunkManager methods IIRC. Unfortunately I don't yet understand how to make that work and also fix your error 🙁

the only thing holding this PR up at this point is typing.

I agree with Deepak that typing should not hold this up. Even if it needs some ignores then the typing you've added is still very useful to indicate intent!

xarray/coding/times.py Outdated Show resolved Hide resolved
xarray/coding/times.py Outdated Show resolved Hide resolved
@spencerkclark
Copy link
Member Author

I agree with raising a better error message when writing netCDF3. Explicitly specifying units for dask arrays (or really any time array) seems like good practice!

Sounds good — the error message now looks like this:

ValueError: could not safely cast array from dtype int64 to int32. A subtle cause for this can be chunked variables containing time-like values without explicitly defined dtype and units encoding values, for which xarray will attempt encoding with int64 values and maximally fine-grain units, e.g. 'nanoseconds since 1970-01-01'. To address this, specify a dtype and units encoding for these variables such that they can be encoded with int32 values. For example, use units like 'seconds since 1970-01-01' and a dtype of np.int32 if appropriate.

xarray/coding/times.py Outdated Show resolved Hide resolved
xarray/coding/times.py Outdated Show resolved Hide resolved
xarray/coding/times.py Show resolved Hide resolved
xarray/coding/times.py Show resolved Hide resolved
@spencerkclark spencerkclark changed the title Add dask-friendly code path to encode_cf_datetime and encode_cf_timedelta Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta Jan 2, 2024
@spencerkclark
Copy link
Member Author

Thanks @dcherian and @TomNicholas! Let me know if the typing looks good now—I added type: ignore comments where necessary to address #8575 (comment) following Tom's other suggestions.

As an aside, in light of #8641, I tweaked the netCDF3 coercion error message again to cover more possibilities:

ValueError: could not safely cast array from int64 to int32. While it is not always the case, a common reason for this is that xarray has deemed it safest to encode np.datetime64[ns] or np.timedelta64[ns] values with int64 values representing units of 'nanoseconds'. This is either due to the fact that the times are known to require nanosecond precision for an accurate round trip, or that the times are unknown prior to writing due to being contained in a chunked array. Ways to work around this are either to use a backend that supports writing int64 values, or to manually specify the encoding['units'] and encoding['dtype'] (e.g. 'seconds since 1970-01-01' and np.dtype('int32')) on the time variable(s) such that the times can be serialized in a netCDF3 file (note that depending on the situation, however, this latter option may result in an inaccurate round trip).

@dcherian dcherian merged commit d8c3b1a into pydata:main Jan 29, 2024
29 checks passed
@max-sixty
Copy link
Collaborator

Thanks a lot @spencerkclark !

@spencerkclark
Copy link
Member Author

Thanks all for your patience—glad to see this in!

@spencerkclark spencerkclark deleted the dask-friendly-datetime-encoding branch January 30, 2024 02:17
andersy005 added a commit to TomNicholas/xarray that referenced this pull request Jan 30, 2024
* main: (153 commits)
  Add overloads to get_axis_num (pydata#8547)
  Fix CI: temporary pin pytest version to 7.4.* (pydata#8682)
  Bump the actions group with 1 update (pydata#8678)
  [namedarray] split `.set_dims()` into `.expand_dims()` and `broadcast_to()` (pydata#8380)
  Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` (pydata#8575)
  Fix NetCDF4 C version detection (pydata#8675)
  groupby: Don't set `method` by default on flox>=0.9 (pydata#8657)
  Fix automatic broadcasting when wrapping array api class (pydata#8669)
  Fix unstack method when wrapping array api class (pydata#8668)
  Fix `variables` arg typo in `Dataset.sortby()` docstring (pydata#8670)
  dt.weekday_name - removal of function (pydata#8664)
  Add `dev` dependencies to `pyproject.toml` (pydata#8661)
  CI: Pin scientific-python/upload-nightly-action to release sha (pydata#8662)
  Update HOW_TO_RELEASE.md by clarifying where RTD build can be found (pydata#8655)
  ruff: use extend-exclude (pydata#8649)
  new whats-new section (pydata#8652)
  xfail another test on windows (pydata#8648)
  use first element of residual in _nonpolyfit_1d (pydata#8647)
  whatsnew for v2024.01.1
  implement `isnull` using `full_like` instead of `zeros_like` (pydata#7395)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants