Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix open_mfdataset() dropping time encoding attrs #309

Merged
merged 6 commits into from
Aug 15, 2022

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Aug 11, 2022

Description

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: Critical labels Aug 11, 2022
@tomvothecoder tomvothecoder self-assigned this Aug 11, 2022
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Steve, here's my first attempt at fixing #308. It is a relatively minimal update that involves adding a new function called _keep_time_encoding() and refactoring a few others.

I will open up a separate issue to ensure we address the FIXME comments and avoid tech debt.

xcdat/dataset.py Outdated
Comment on lines 409 to 452
def _keep_time_encoding(paths: Paths) -> Dict[Hashable, Any]:
"""
Returns the time encoding attributes from the first dataset in a list of
paths.

Time encoding information is critical for several xCDAT operations such as
temporal averaging (e.g., uses the "calendar" attr). This function is a
workaround to the undesired xarray behavior/quirk with
`xr.open_mfdataset()`, which drops the `.encoding` dict from the final
merged dataset (refer to https://github.com/pydata/xarray/issues/2436).

Parameters
----------
paths: Paths
The paths to the dataset(s).

Returns
-------
Dict[Hashable, Any]
The time encoding dictionary.
"""
first_path = _get_first_path(paths)

# xcdat.open_dataset() is called instead of xr.open_dataset() because
# we want to handle decoding non-CF compliant as well.
# FIXME: Remove `type: ignore` comment after properly handling the type
# annotations in `_get_first_path()`.
ds = open_dataset(first_path, decode_times=True, add_bounds=False) # type: ignore

time_coord = get_axis_coord(ds, "T")

return time_coord.encoding
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new _keep_time_encoding() function.

Comment on lines +492 to +533
def _get_first_path(path: Paths) -> Optional[Union[pathlib.Path, str]]:
"""Returns the first path from a list of paths.

Parameters
----------
path : Paths
A list of paths.

Returns
-------
str
Returns the first path from a list of paths.
"""
# FIXME: This function should throw an exception if the first file
# is not a supported type.
# FIXME: The `type: ignore` comments should be removed after properly
# handling the types.
first_file: Optional[Union[pathlib.Path, str]] = None

if isinstance(path, str) and "*" in path:
first_file = glob(path)[0]
elif isinstance(path, str) or isinstance(path, pathlib.Path):
first_file = path
elif isinstance(path, list):
if any(isinstance(sublist, list) for sublist in path):
first_file = path[0][0] # type: ignore
else:
first_file = path[0] # type: ignore

return first_file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted _get_first_path() to be reused in other functions (_keep_time_encoding() and _has_cf_compliant_time()).

xcdat/dataset.py Outdated
Comment on lines 593 to 608
if time_encoding is not None:
time_dim = get_axis_dim(dataset, "T")
dataset[time_dim].encoding = time_encoding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were to options for where to include the logic for adding time_encoding attrs back to the final dataset:

  1. In _postprocess_dataset(), which is shown here
    • On the other hand, this function is used by both open_dataset() and open_mfdataset(), but this logic is only needed for open_mfdataset().
  2. Keep this logic stored in open_mfdataset(), based on above reasoning.

I chose 1 since it is a "postprocessing" operation, although I am open to 2 since the logic is specific to open_mfdataset().

I will think about this more.

Comment on lines 197 to 320
def test_mfdataset_keeps_time_encoding_dict(self):
# FIXME: This test always passes because `xr.open_mfdatset()` always
# keeps the time encoding attrs, which isn't the expected behavior.
# Based on this test, if datasets are generated in xarray and written
# out with `to_netcdf()` and then opened and merged using
# `xr.open_mfdataset()`, the time encoding attributes are not dropped.
# On the other hand, if multiple real world datasets that did not
# originate from xarray (written out with `.to_netcdf()`) are opened
# using `xr.open_mfdataset()`, the time encoding attrs are dropped.
# (Refer to https://github.com/pydata/xarray/issues/2436). My theory is
# that xarray maintains the time encoding attrs if datasets are written
# out with `.to_netcdf()`, and drops it for other cases such
# as opening multiple datasets from other sources.
Copy link
Collaborator Author

@tomvothecoder tomvothecoder Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A giant FIXME: comment for why this test always passes.

One workaround might be to include a subset of a real world dataset in the test suite, although I prefer to not do this to keep the test suite lightweight and reproducible without the need to reference a specific file. We can address this in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a closer look at this since the test is failing in the build, and the branch wasn't rebased on the latest main with #302.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After fixing the expected dataset, it still always passes even without _keep_time_encoding() called.

The FIXME: comment is still valid.

Comment on lines +25 to +33
# Type annotation for the `paths` arg.
Paths = Union[
str,
pathlib.Path,
List[str],
List[pathlib.Path],
List[List[str]],
List[List[pathlib.Path]],
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted Paths type annotation since it was reused in multiple functions and making the function param definitions long.

I noticed xarray refactored the paths type annotation here: https://github.com/pydata/xarray/blob/f8fee902360f2330ab8c002d54480d357365c172/xarray/backends/api.py#L734

paths: str | NestedSequence[str | os.PathLike],

Comment on lines +505 to +520
# FIXME: This function should throw an exception if the first file
# is not a supported type.
# FIXME: The `type: ignore` comments should be removed after properly
# handling the types.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few FIXME to _get_first_path(). The first FIXME is a bit tricky based on the nesting of elements, so I deferred it to later.

@tomvothecoder tomvothecoder requested a review from pochedls August 11, 2022 17:51
@tomvothecoder tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from a9916df to ef9c1dd Compare August 11, 2022 17:56
@tomvothecoder tomvothecoder marked this pull request as draft August 11, 2022 18:06
@tomvothecoder tomvothecoder marked this pull request as ready for review August 11, 2022 18:06
Copy link
Collaborator

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder - I reviewed this PR and the fix for the mfdataset encoding works great.

Since the fix is only a couple lines, I wonder if it makes sense to go with Option 2 and just fix this in the xcdat.open_mfdataset function? It might make sense to keep it in _postprocess_dataset() if there were instances where the encoding wasn't handled correctly during an xcdat.open_dataset, but I don't think this should happen.

I don't view this as an urgent PR – so it might also make sense to fix the broken test (which looks like a surprising amount of work?).

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Aug 15, 2022

Since the fix is only a couple lines, I wonder if it makes sense to go with Option 2 and just fix this in the xcdat.open_mfdataset function? It might make sense to keep it in _postprocess_dataset() if there were instances where the encoding wasn't handled correctly during an xcdat.open_dataset, but I don't think this should

Thanks for the input. I agree, _postprocess_dataset() is a reusable function so the logic for time encoding should be coupled to open_mfdataset() where it is needed. I just pushed a commit to follow option 2.

I don't view this as an urgent PR – so it might also make sense to fix the broken test (which > looks like a surprising amount of work?).

I marked this as critical because TemporalAccessor requires .encoding attrs (specifically "calendar"), which is not properly set with xr.open_mfdataset().

A possible quick fix to the test is to have two minimal, dummy .nc dataset with just a time axis and coordinates. It probably needs to be generated outside of xarray. I'll figure this out, and then merge this PR.
EDIT: I fixed the test by generating two datasets, with the second being an extension of the first one by 1 time coordinate. Refer to review comment below for more details.

@tomvothecoder tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from c2b9498 to 65181cc Compare August 15, 2022 18:07
@tomvothecoder tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from 65181cc to 86760e7 Compare August 15, 2022 18:10
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pochedls, I managed to fix the test and added additional improvements.

I will merge this PR now.

Comment on lines 309 to 319
# FIXME: This test always passes because `xr.open_mfdatset()` always
# keeps the time encoding attrs, which isn't the expected behavior.
# Based on this test, if datasets are generated in xarray and written
# out with `to_netcdf()` and then opened and merged using
# `xr.open_mfdataset()`, the time encoding attributes are not dropped.
# On the other hand, if multiple real world datasets that did not
# originate from xarray (written out with `.to_netcdf()`) are opened
# using `xr.open_mfdataset()`, the time encoding attrs are dropped.
# (Refer to https://github.com/pydata/xarray/issues/2436). My theory is
# that xarray maintains the time encoding attrs if datasets are written
# out with `.to_netcdf()`, and drops it for other cases such
# as opening multiple datasets from other sources.
ds1 = generate_dataset(cf_compliant=True, has_bounds=True)
ds1.to_netcdf(self.file_path1)

# Create another dataset that extends the time coordinates by 1 value,
# to mimic a multifile dataset.
ds2 = generate_dataset(cf_compliant=True, has_bounds=True)
ds2 = ds2.rename_vars({"ts": "tas"})
ds2 = ds2.isel(dict(time=slice(0, 1)))
ds2["time"].values[:] = np.array(
["2002-01-16T12:00:00.000000000"],
dtype="datetime64[ns]",
)
Copy link
Collaborator Author

@tomvothecoder tomvothecoder Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My theory about xarray generated datasets keeping time encoding attributes was wrong. Xarray drops time encoding attributes if coordinates need to merged. If datasets share the same coordinates, no merging needs to be performed so the time encoding attributes are maintained, which was happening in the old version of this test.

I fixed this test by attempting to open two datasets, with the second being an extension of the first by 1 time coordinate.

Comment on lines +235 to +236
# Update "original_shape" to reflect the final time coordinates shape.
ds[time_dim].encoding["original_shape"] = ds[time_dim].shape
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the "original_shape" of the final merged time coordinates.

time_coord = get_axis_coord(ds, "T")

time_encoding = time_coord.encoding
time_encoding["source"] = paths
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set "source" to the paths arg.

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2022

Codecov Report

Merging #309 (86760e7) into main (4c51879) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #309   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1170      1187   +17     
=========================================
+ Hits          1170      1187   +17     
Impacted Files Coverage Δ
xcdat/dataset.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tomvothecoder tomvothecoder merged commit b6e1be8 into main Aug 15, 2022
@tomvothecoder tomvothecoder deleted the bug/308-open_mfdataset-drop-encoding branch August 15, 2022 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: xarray.open_mfdataset drops .encoding attributes, need to handle with xcdat.open_mfdataset()
3 participants