Fix `open_mfdataset()` dropping time encoding attrs #309

tomvothecoder · 2022-08-11T17:40:50Z

Description

Closes [Bug]: xarray.open_mfdataset drops .encoding attributes, need to handle with xcdat.open_mfdataset() #308

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

tomvothecoder

Hey Steve, here's my first attempt at fixing #308. It is a relatively minimal update that involves adding a new function called _keep_time_encoding() and refactoring a few others.

I will open up a separate issue to ensure we address the FIXME comments and avoid tech debt.

tomvothecoder · 2022-08-11T17:41:42Z

xcdat/dataset.py

+def _keep_time_encoding(paths: Paths) -> Dict[Hashable, Any]:
+    """
+    Returns the time encoding attributes from the first dataset in a list of
+    paths.
+
+    Time encoding information is critical for several xCDAT operations such as
+    temporal averaging (e.g., uses the "calendar" attr). This function is a
+    workaround to the undesired xarray behavior/quirk with
+    `xr.open_mfdataset()`, which drops the `.encoding` dict from the final
+    merged dataset (refer to https://github.com/pydata/xarray/issues/2436).
+
+    Parameters
+    ----------
+    paths: Paths
+        The paths to the dataset(s).
+
+    Returns
+    -------
+    Dict[Hashable, Any]
+        The time encoding dictionary.
+    """
+    first_path = _get_first_path(paths)
+
+    # xcdat.open_dataset() is called instead of xr.open_dataset() because
+    # we want to handle decoding non-CF compliant as well.
+    # FIXME: Remove `type: ignore` comment after properly handling the type
+    # annotations in `_get_first_path()`.
+    ds = open_dataset(first_path, decode_times=True, add_bounds=False)  # type: ignore
+
+    time_coord = get_axis_coord(ds, "T")
+
+    return time_coord.encoding


The new _keep_time_encoding() function.

tomvothecoder · 2022-08-11T17:42:24Z

xcdat/dataset.py

+def _get_first_path(path: Paths) -> Optional[Union[pathlib.Path, str]]:
+    """Returns the first path from a list of paths.
+
+    Parameters
+    ----------
+    path : Paths
+        A list of paths.
+
+    Returns
+    -------
+    str
+        Returns the first path from a list of paths.
+    """
+    # FIXME: This function should throw an exception if the first file
+    # is not a supported type.
+    # FIXME: The `type: ignore` comments should be removed after properly
+    # handling the types.
+    first_file: Optional[Union[pathlib.Path, str]] = None
+
+    if isinstance(path, str) and "*" in path:
+        first_file = glob(path)[0]
+    elif isinstance(path, str) or isinstance(path, pathlib.Path):
+        first_file = path
+    elif isinstance(path, list):
+        if any(isinstance(sublist, list) for sublist in path):
+            first_file = path[0][0]  # type: ignore
+        else:
+            first_file = path[0]  # type: ignore
+
+    return first_file


Extracted _get_first_path() to be reused in other functions (_keep_time_encoding() and _has_cf_compliant_time()).

tomvothecoder · 2022-08-11T17:45:47Z

xcdat/dataset.py

+    if time_encoding is not None:
+        time_dim = get_axis_dim(dataset, "T")
+        dataset[time_dim].encoding = time_encoding
+


There were to options for where to include the logic for adding time_encoding attrs back to the final dataset:

In _postprocess_dataset(), which is shown here

On the other hand, this function is used by both open_dataset() and open_mfdataset(), but this logic is only needed for open_mfdataset().

Keep this logic stored in open_mfdataset(), based on above reasoning.

I chose 1 since it is a "postprocessing" operation, although I am open to 2 since the logic is specific to open_mfdataset().

I will think about this more.

tomvothecoder · 2022-08-11T17:46:37Z

tests/test_dataset.py

+    def test_mfdataset_keeps_time_encoding_dict(self):
+        # FIXME: This test always passes because `xr.open_mfdatset()` always
+        # keeps the time encoding attrs, which isn't the expected behavior.
+        # Based on this test, if datasets are generated in xarray and written
+        # out with `to_netcdf()` and then opened and merged using
+        # `xr.open_mfdataset()`, the time encoding attributes are not dropped.
+        # On the other hand, if multiple real world datasets that did not
+        # originate from xarray (written out with `.to_netcdf()`) are opened
+        # using `xr.open_mfdataset()`, the time encoding attrs are dropped.
+        # (Refer to https://github.com/pydata/xarray/issues/2436). My theory is
+        # that xarray maintains the time encoding attrs if datasets are written
+        # out with `.to_netcdf()`, and drops it for other cases such
+        # as opening multiple datasets from other sources.


A giant FIXME: comment for why this test always passes.

One workaround might be to include a subset of a real world dataset in the test suite, although I prefer to not do this to keep the test suite lightweight and reproducible without the need to reference a specific file. We can address this in the future.

I'm taking a closer look at this since the test is failing in the build, and the branch wasn't rebased on the latest main with #302.

After fixing the expected dataset, it still always passes even without _keep_time_encoding() called.

The FIXME: comment is still valid.

tomvothecoder · 2022-08-11T17:48:37Z

xcdat/dataset.py

+# Type annotation for the `paths` arg.
+Paths = Union[
+    str,
+    pathlib.Path,
+    List[str],
+    List[pathlib.Path],
+    List[List[str]],
+    List[List[pathlib.Path]],
+]


Extracted Paths type annotation since it was reused in multiple functions and making the function param definitions long.

I noticed xarray refactored the paths type annotation here: https://github.com/pydata/xarray/blob/f8fee902360f2330ab8c002d54480d357365c172/xarray/backends/api.py#L734

paths: str | NestedSequence[str | os.PathLike],

tomvothecoder · 2022-08-11T17:49:49Z

xcdat/dataset.py

+    # FIXME: This function should throw an exception if the first file
+    # is not a supported type.
+    # FIXME: The `type: ignore` comments should be removed after properly
+    # handling the types.


I added a few FIXME to _get_first_path(). The first FIXME is a bit tricky based on the nesting of elements, so I deferred it to later.

pochedls

@tomvothecoder - I reviewed this PR and the fix for the mfdataset encoding works great.

Since the fix is only a couple lines, I wonder if it makes sense to go with Option 2 and just fix this in the xcdat.open_mfdataset function? It might make sense to keep it in _postprocess_dataset() if there were instances where the encoding wasn't handled correctly during an xcdat.open_dataset, but I don't think this should happen.

I don't view this as an urgent PR – so it might also make sense to fix the broken test (which looks like a surprising amount of work?).

tomvothecoder · 2022-08-15T17:32:27Z

Since the fix is only a couple lines, I wonder if it makes sense to go with Option 2 and just fix this in the xcdat.open_mfdataset function? It might make sense to keep it in _postprocess_dataset() if there were instances where the encoding wasn't handled correctly during an xcdat.open_dataset, but I don't think this should

Thanks for the input. I agree, _postprocess_dataset() is a reusable function so the logic for time encoding should be coupled to open_mfdataset() where it is needed. I just pushed a commit to follow option 2.

I don't view this as an urgent PR – so it might also make sense to fix the broken test (which > looks like a surprising amount of work?).

I marked this as critical because TemporalAccessor requires .encoding attrs (specifically "calendar"), which is not properly set with xr.open_mfdataset().

A possible quick fix to the test is to have two minimal, dummy .nc dataset with just a time axis and coordinates. It probably needs to be generated outside of xarray. I'll figure this out, and then merge this PR.
EDIT: I fixed the test by generating two datasets, with the second being an extension of the first one by 1 time coordinate. Refer to review comment below for more details.

tomvothecoder

@pochedls, I managed to fix the test and added additional improvements.

I will merge this PR now.

tomvothecoder · 2022-08-15T18:01:05Z

tests/test_dataset.py

-        # FIXME: This test always passes because `xr.open_mfdatset()` always
-        # keeps the time encoding attrs, which isn't the expected behavior.
-        # Based on this test, if datasets are generated in xarray and written
-        # out with `to_netcdf()` and then opened and merged using
-        # `xr.open_mfdataset()`, the time encoding attributes are not dropped.
-        # On the other hand, if multiple real world datasets that did not
-        # originate from xarray (written out with `.to_netcdf()`) are opened
-        # using `xr.open_mfdataset()`, the time encoding attrs are dropped.
-        # (Refer to https://github.com/pydata/xarray/issues/2436). My theory is
-        # that xarray maintains the time encoding attrs if datasets are written
-        # out with `.to_netcdf()`, and drops it for other cases such
-        # as opening multiple datasets from other sources.
        ds1 = generate_dataset(cf_compliant=True, has_bounds=True)
        ds1.to_netcdf(self.file_path1)
+
+        # Create another dataset that extends the time coordinates by 1 value,
+        # to mimic a multifile dataset.
        ds2 = generate_dataset(cf_compliant=True, has_bounds=True)
-        ds2 = ds2.rename_vars({"ts": "tas"})
+        ds2 = ds2.isel(dict(time=slice(0, 1)))
+        ds2["time"].values[:] = np.array(
+            ["2002-01-16T12:00:00.000000000"],
+            dtype="datetime64[ns]",
+        )


My theory about xarray generated datasets keeping time encoding attributes was wrong. Xarray drops time encoding attributes if coordinates need to merged. If datasets share the same coordinates, no merging needs to be performed so the time encoding attributes are maintained, which was happening in the old version of this test.

I fixed this test by attempting to open two datasets, with the second being an extension of the first by 1 time coordinate.

tomvothecoder · 2022-08-15T18:10:48Z

xcdat/dataset.py

+        # Update "original_shape" to reflect the final time coordinates shape.
+        ds[time_dim].encoding["original_shape"] = ds[time_dim].shape


Use the "original_shape" of the final merged time coordinates.

tomvothecoder · 2022-08-15T18:11:08Z

xcdat/dataset.py

    time_coord = get_axis_coord(ds, "T")

+    time_encoding = time_coord.encoding
+    time_encoding["source"] = paths


Set "source" to the paths arg.

codecov-commenter · 2022-08-15T18:14:47Z

Codecov Report

Merging #309 (86760e7) into main (4c51879) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #309   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1170      1187   +17     
=========================================
+ Hits          1170      1187   +17

Impacted Files	Coverage Δ
xcdat/dataset.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: Critical labels Aug 11, 2022

tomvothecoder self-assigned this Aug 11, 2022

tomvothecoder commented Aug 11, 2022

View reviewed changes

tomvothecoder requested a review from pochedls August 11, 2022 17:51

Fix open_mfdataset() dropping time encoding attrs

ef9c1dd

tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from a9916df to ef9c1dd Compare August 11, 2022 17:56

Update test

11fa200

tomvothecoder marked this pull request as draft August 11, 2022 18:06

tomvothecoder marked this pull request as ready for review August 11, 2022 18:06

tomvothecoder added 2 commits August 11, 2022 11:18

Fix test

c8292b5

Fix test to avoid breaking between xarray versions

51db486

pochedls reviewed Aug 12, 2022

View reviewed changes

Move time_encoding logic to open_mfdataset()

51815e7

tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from c2b9498 to 65181cc Compare August 15, 2022 18:07

Use original_shape of final ds and fix test

86760e7

tomvothecoder force-pushed the bug/308-open_mfdataset-drop-encoding branch from 65181cc to 86760e7 Compare August 15, 2022 18:10

tomvothecoder commented Aug 15, 2022

View reviewed changes

tomvothecoder merged commit b6e1be8 into main Aug 15, 2022

tomvothecoder deleted the bug/308-open_mfdataset-drop-encoding branch August 15, 2022 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `open_mfdataset()` dropping time encoding attrs #309

Fix `open_mfdataset()` dropping time encoding attrs #309

tomvothecoder commented Aug 11, 2022 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022 •

edited

Loading

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022

tomvothecoder Aug 11, 2022

pochedls left a comment

tomvothecoder commented Aug 15, 2022 •

edited

Loading

tomvothecoder left a comment

tomvothecoder Aug 15, 2022 •

edited

Loading

tomvothecoder Aug 15, 2022

tomvothecoder Aug 15, 2022

codecov-commenter commented Aug 15, 2022 •

edited

Loading

		# Update "original_shape" to reflect the final time coordinates shape.
		ds[time_dim].encoding["original_shape"] = ds[time_dim].shape

Fix open_mfdataset() dropping time encoding attrs #309

Fix open_mfdataset() dropping time encoding attrs #309

Conversation

tomvothecoder commented Aug 11, 2022 • edited Loading

Description

Checklist

tomvothecoder left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pochedls left a comment

Choose a reason for hiding this comment

tomvothecoder commented Aug 15, 2022 • edited Loading

tomvothecoder left a comment

Choose a reason for hiding this comment

tomvothecoder Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 15, 2022 • edited Loading

Codecov Report

Fix `open_mfdataset()` dropping time encoding attrs #309

Fix `open_mfdataset()` dropping time encoding attrs #309

tomvothecoder commented Aug 11, 2022 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Aug 11, 2022 •

edited

Loading

tomvothecoder commented Aug 15, 2022 •

edited

Loading

tomvothecoder Aug 15, 2022 •

edited

Loading

codecov-commenter commented Aug 15, 2022 •

edited

Loading