-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chunk-friendly code path to encode_cf_datetime
and encode_cf_timedelta
#8575
Add chunk-friendly code path to encode_cf_datetime
and encode_cf_timedelta
#8575
Conversation
35f8681
to
e5150c9
Compare
encode_cf_datetime
encode_cf_datetime
and encode_cf_timedelta
encode_cf_datetime
and encode_cf_timedelta
encode_cf_datetime
and encode_cf_timedelta
f0b9a8d
to
6ebb917
Compare
86b591b
to
eea3bb7
Compare
OK I think this may be ready for review. The one awkward aspect about using nanoseconds as the fall-back encoding unit for dask-backed time fields is that it virtually requires that 64-bit integers be used, which are not supported by the
To work around this you would either need to load the times into memory before saving, or explicitly specify a different units encoding, which we could potentially note as a possible fix in the message. The alternative would be to specify default time units like seconds, but then an error would be raised if the times had sub-second components, so it's a bit of a tradeoff. I guess ideally we could set the default units and dtype depending on the file format, but that would be a bit more involved. I'm open to other people's opinions. |
I agree with raising a better error message when writing netCDF3. Explicitly specifying units for dask arrays (or really any time array) seems like good practice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @spencerkclark My only comment is that it should be straightforward to apply this to Cubed arrays to.
xarray/coding/times.py
Outdated
if isinstance(dates, np.ndarray): | ||
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype) | ||
elif is_duck_dask_array(dates): | ||
return _lazily_encode_cf_datetime(dates, units, calendar, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(dates, np.ndarray): | |
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype) | |
elif is_duck_dask_array(dates): | |
return _lazily_encode_cf_datetime(dates, units, calendar, dtype) | |
if is_chunked_array(dates): | |
return _lazily_encode_cf_datetime(dates, units, calendar, dtype) | |
elif is_duck_array(dates): | |
return _eagerly_encode_cf_datetime(dates, units, calendar, dtype) |
Is there a reason other arrays might fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I do not think so. It's mainly that mypy complains if the function may not always return something:
xarray/coding/times.py: note: In function "encode_cf_datetime":
xarray/coding/times.py:728: error: Missing return statement [return]
I suppose instead of elif is_duck_array(dates)
I could simply use else
, since I use asarray
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one issue that occurs to me is that as written I think _eagerly_encode_cf_datetime
will always return a NumPy array, due to its reliance on doing calculations through pandas or cftime. In other words it won't necessarily fail, but it will not work quite like the other array types, where the type you put in is what you get out.
xarray/coding/times.py
Outdated
f"Got a units encoding of {units} and a dtype encoding of {dtype}." | ||
) | ||
|
||
num = dask.array.map_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num = dask.array.map_blocks( | |
num = chunkmanager.map_blocks( |
There's achunkmanager
dance now to handle both dask
and cubed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's neat! Is there anything special I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)?
One other minor detail is how to handle typing with this. I gave it a shot (hopefully I'm in the right ballpark), but I still encounter this error:
xarray/coding/times.py: note: In function "_lazily_encode_cf_datetime":
xarray/coding/times.py:860: error: "T_ChunkedArray" has no attribute "dtype" [attr-defined]
I'm not a typing expert so any guidance would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do to test this or is it sufficient to demonstrate that it at least works with dask (as my tests already do)
I believe the dask tests are fine since the cubed tests live elsewhere.
error: "T_ChunkedArray" has no attribute "dtype" [attr-defined
cc @TomNicholas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the dask tests are fine since the cubed tests live elsewhere.
Yep.
Is there anything special I need to do
The correct ChunkManager
for the type of the chunked array needs to be detected, but you've already done that bit!
error: "T_ChunkedArray" has no attribute "dtype" [attr-defined
This can't be typed properly yet (because we don't yet have a full protocol to describe the T_DuckArray
in xarray.core.types.py
), but try changing the definition of T_ChunkedArray
in parallelcompat.py
to this
T_ChunkedArray = TypeVar("T_ChunkedArray", bound=Any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for the typing guidance @TomNicholas. Adding bound=Any
to the T_ChunkedArray
definition indeed solves the dtype
attribute issue. It seems I am not able to use T_DuckArray
as an input type for a function, however — I'm getting errors like this:
error: Cannot use a covariant type variable as a parameter [misc]
Do you have any thoughts on how to handle that? I'm guessing the covariant=True
parameter was added for a reason in its definition?
Line 176 in 41d33f5
T_DuckArray = TypeVar("T_DuckArray", bound=Any, covariant=True) |
Does #8575 (comment) make the output side of this more difficult to handle as well (at least if we want to be as accurate as possible)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas do you have any thoughts on this? From my perspective the only thing holding this PR up at this point is typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO a type: ignore
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry for forgetting about this @spencerkclark !
I'm guessing the covariant=True parameter was added for a reason in its definition?
Yes - that fixes some typing errors in the ChunkManager
methods IIRC. Unfortunately I don't yet understand how to make that work and also fix your error 🙁
the only thing holding this PR up at this point is typing.
I agree with Deepak that typing should not hold this up. Even if it needs some ignore
s then the typing you've added is still very useful to indicate intent!
Sounds good — the error message now looks like this:
|
encode_cf_datetime
and encode_cf_timedelta
encode_cf_datetime
and encode_cf_timedelta
…spencerkclark/xarray into dask-friendly-datetime-encoding
Thanks @dcherian and @TomNicholas! Let me know if the typing looks good now—I added As an aside, in light of #8641, I tweaked the netCDF3 coercion error message again to cover more possibilities:
|
Thanks a lot @spencerkclark ! |
Thanks all for your patience—glad to see this in! |
* main: (153 commits) Add overloads to get_axis_num (pydata#8547) Fix CI: temporary pin pytest version to 7.4.* (pydata#8682) Bump the actions group with 1 update (pydata#8678) [namedarray] split `.set_dims()` into `.expand_dims()` and `broadcast_to()` (pydata#8380) Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` (pydata#8575) Fix NetCDF4 C version detection (pydata#8675) groupby: Don't set `method` by default on flox>=0.9 (pydata#8657) Fix automatic broadcasting when wrapping array api class (pydata#8669) Fix unstack method when wrapping array api class (pydata#8668) Fix `variables` arg typo in `Dataset.sortby()` docstring (pydata#8670) dt.weekday_name - removal of function (pydata#8664) Add `dev` dependencies to `pyproject.toml` (pydata#8661) CI: Pin scientific-python/upload-nightly-action to release sha (pydata#8662) Update HOW_TO_RELEASE.md by clarifying where RTD build can be found (pydata#8655) ruff: use extend-exclude (pydata#8649) new whats-new section (pydata#8652) xfail another test on windows (pydata#8648) use first element of residual in _nonpolyfit_1d (pydata#8647) whatsnew for v2024.01.1 implement `isnull` using `full_like` instead of `zeros_like` (pydata#7395) ...
I finally had a moment to think about this some more following discussion in #8253. This PR adds a chunk-friendly code path to
encode_cf_datetime
andencode_cf_timedelta
, which enables lazy encoding of time-like values, and by extension, preservation of chunks when writing time-like values to zarr. With these changes, the test added by @malmans2 in #8253 passes.Though it largely reuses existing code, the lazy encoding implemented in this PR is stricter than eager encoding in a couple ways:
np.int64
and the units are either"nanoseconds since 1970-01-01"
or"microseconds since 1970-01-01"
depending on whether we are encodingnp.datetime64[ns]
values orcftime.datetime
objects. In the case oftimedelta64[ns]
values, the units are set to"nanoseconds"
.As part of this PR, since dask requires we know the dtype of the array returned by the function passed to
map_blocks
, I also added logic to handle casting to the specified encoding dtype in an overflow-and-integer safe manner. This means an informative error message would be raised in the situation described in #8542:I eventually want to think about this on the decoding side as well, but that can wait for another PR.
whats-new.rst