-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix open_mfdataset()
dropping time encoding attrs
#309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Steve, here's my first attempt at fixing #308. It is a relatively minimal update that involves adding a new function called _keep_time_encoding()
and refactoring a few others.
I will open up a separate issue to ensure we address the FIXME comments and avoid tech debt.
xcdat/dataset.py
Outdated
def _keep_time_encoding(paths: Paths) -> Dict[Hashable, Any]: | ||
""" | ||
Returns the time encoding attributes from the first dataset in a list of | ||
paths. | ||
|
||
Time encoding information is critical for several xCDAT operations such as | ||
temporal averaging (e.g., uses the "calendar" attr). This function is a | ||
workaround to the undesired xarray behavior/quirk with | ||
`xr.open_mfdataset()`, which drops the `.encoding` dict from the final | ||
merged dataset (refer to https://github.com/pydata/xarray/issues/2436). | ||
|
||
Parameters | ||
---------- | ||
paths: Paths | ||
The paths to the dataset(s). | ||
|
||
Returns | ||
------- | ||
Dict[Hashable, Any] | ||
The time encoding dictionary. | ||
""" | ||
first_path = _get_first_path(paths) | ||
|
||
# xcdat.open_dataset() is called instead of xr.open_dataset() because | ||
# we want to handle decoding non-CF compliant as well. | ||
# FIXME: Remove `type: ignore` comment after properly handling the type | ||
# annotations in `_get_first_path()`. | ||
ds = open_dataset(first_path, decode_times=True, add_bounds=False) # type: ignore | ||
|
||
time_coord = get_axis_coord(ds, "T") | ||
|
||
return time_coord.encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new _keep_time_encoding()
function.
def _get_first_path(path: Paths) -> Optional[Union[pathlib.Path, str]]: | ||
"""Returns the first path from a list of paths. | ||
|
||
Parameters | ||
---------- | ||
path : Paths | ||
A list of paths. | ||
|
||
Returns | ||
------- | ||
str | ||
Returns the first path from a list of paths. | ||
""" | ||
# FIXME: This function should throw an exception if the first file | ||
# is not a supported type. | ||
# FIXME: The `type: ignore` comments should be removed after properly | ||
# handling the types. | ||
first_file: Optional[Union[pathlib.Path, str]] = None | ||
|
||
if isinstance(path, str) and "*" in path: | ||
first_file = glob(path)[0] | ||
elif isinstance(path, str) or isinstance(path, pathlib.Path): | ||
first_file = path | ||
elif isinstance(path, list): | ||
if any(isinstance(sublist, list) for sublist in path): | ||
first_file = path[0][0] # type: ignore | ||
else: | ||
first_file = path[0] # type: ignore | ||
|
||
return first_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted _get_first_path()
to be reused in other functions (_keep_time_encoding()
and _has_cf_compliant_time()
).
xcdat/dataset.py
Outdated
if time_encoding is not None: | ||
time_dim = get_axis_dim(dataset, "T") | ||
dataset[time_dim].encoding = time_encoding | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were to options for where to include the logic for adding time_encoding
attrs back to the final dataset:
- In
_postprocess_dataset()
, which is shown here- On the other hand, this function is used by both
open_dataset()
andopen_mfdataset()
, but this logic is only needed foropen_mfdataset()
.
- On the other hand, this function is used by both
- Keep this logic stored in
open_mfdataset()
, based on above reasoning.
I chose 1 since it is a "postprocessing" operation, although I am open to 2 since the logic is specific to open_mfdataset()
.
I will think about this more.
tests/test_dataset.py
Outdated
def test_mfdataset_keeps_time_encoding_dict(self): | ||
# FIXME: This test always passes because `xr.open_mfdatset()` always | ||
# keeps the time encoding attrs, which isn't the expected behavior. | ||
# Based on this test, if datasets are generated in xarray and written | ||
# out with `to_netcdf()` and then opened and merged using | ||
# `xr.open_mfdataset()`, the time encoding attributes are not dropped. | ||
# On the other hand, if multiple real world datasets that did not | ||
# originate from xarray (written out with `.to_netcdf()`) are opened | ||
# using `xr.open_mfdataset()`, the time encoding attrs are dropped. | ||
# (Refer to https://github.com/pydata/xarray/issues/2436). My theory is | ||
# that xarray maintains the time encoding attrs if datasets are written | ||
# out with `.to_netcdf()`, and drops it for other cases such | ||
# as opening multiple datasets from other sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A giant FIXME:
comment for why this test always passes.
One workaround might be to include a subset of a real world dataset in the test suite, although I prefer to not do this to keep the test suite lightweight and reproducible without the need to reference a specific file. We can address this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking a closer look at this since the test is failing in the build, and the branch wasn't rebased on the latest main
with #302.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After fixing the expected
dataset, it still always passes even without _keep_time_encoding()
called.
The FIXME: comment is still valid.
# Type annotation for the `paths` arg. | ||
Paths = Union[ | ||
str, | ||
pathlib.Path, | ||
List[str], | ||
List[pathlib.Path], | ||
List[List[str]], | ||
List[List[pathlib.Path]], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted Paths
type annotation since it was reused in multiple functions and making the function param definitions long.
I noticed xarray refactored the paths
type annotation here: https://github.com/pydata/xarray/blob/f8fee902360f2330ab8c002d54480d357365c172/xarray/backends/api.py#L734
paths: str | NestedSequence[str | os.PathLike],
# FIXME: This function should throw an exception if the first file | ||
# is not a supported type. | ||
# FIXME: The `type: ignore` comments should be removed after properly | ||
# handling the types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few FIXME to _get_first_path()
. The first FIXME is a bit tricky based on the nesting of elements, so I deferred it to later.
a9916df
to
ef9c1dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomvothecoder - I reviewed this PR and the fix for the mfdataset encoding works great.
Since the fix is only a couple lines, I wonder if it makes sense to go with Option 2 and just fix this in the xcdat.open_mfdataset
function? It might make sense to keep it in _postprocess_dataset()
if there were instances where the encoding wasn't handled correctly during an xcdat.open_dataset
, but I don't think this should happen.
I don't view this as an urgent PR – so it might also make sense to fix the broken test (which looks like a surprising amount of work?).
Thanks for the input. I agree,
I marked this as critical because
|
c2b9498
to
65181cc
Compare
65181cc
to
86760e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pochedls, I managed to fix the test and added additional improvements.
I will merge this PR now.
tests/test_dataset.py
Outdated
# FIXME: This test always passes because `xr.open_mfdatset()` always | ||
# keeps the time encoding attrs, which isn't the expected behavior. | ||
# Based on this test, if datasets are generated in xarray and written | ||
# out with `to_netcdf()` and then opened and merged using | ||
# `xr.open_mfdataset()`, the time encoding attributes are not dropped. | ||
# On the other hand, if multiple real world datasets that did not | ||
# originate from xarray (written out with `.to_netcdf()`) are opened | ||
# using `xr.open_mfdataset()`, the time encoding attrs are dropped. | ||
# (Refer to https://github.com/pydata/xarray/issues/2436). My theory is | ||
# that xarray maintains the time encoding attrs if datasets are written | ||
# out with `.to_netcdf()`, and drops it for other cases such | ||
# as opening multiple datasets from other sources. | ||
ds1 = generate_dataset(cf_compliant=True, has_bounds=True) | ||
ds1.to_netcdf(self.file_path1) | ||
|
||
# Create another dataset that extends the time coordinates by 1 value, | ||
# to mimic a multifile dataset. | ||
ds2 = generate_dataset(cf_compliant=True, has_bounds=True) | ||
ds2 = ds2.rename_vars({"ts": "tas"}) | ||
ds2 = ds2.isel(dict(time=slice(0, 1))) | ||
ds2["time"].values[:] = np.array( | ||
["2002-01-16T12:00:00.000000000"], | ||
dtype="datetime64[ns]", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My theory about xarray generated datasets keeping time encoding attributes was wrong. Xarray drops time encoding attributes if coordinates need to merged. If datasets share the same coordinates, no merging needs to be performed so the time encoding attributes are maintained, which was happening in the old version of this test.
I fixed this test by attempting to open two datasets, with the second being an extension of the first by 1 time coordinate.
# Update "original_shape" to reflect the final time coordinates shape. | ||
ds[time_dim].encoding["original_shape"] = ds[time_dim].shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the "original_shape" of the final merged time coordinates.
time_coord = get_axis_coord(ds, "T") | ||
|
||
time_encoding = time_coord.encoding | ||
time_encoding["source"] = paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set "source" to the paths
arg.
Codecov Report
@@ Coverage Diff @@
## main #309 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 14
Lines 1170 1187 +17
=========================================
+ Hits 1170 1187 +17
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Description
xarray.open_mfdataset
drops.encoding
attributes, need to handle withxcdat.open_mfdataset()
#308Checklist
If applicable: