Consolidate dimension coordinates #210

TomAugspurger · 2021-09-22T23:42:19Z

This consolidates dimension coordinates to address the performance
issues from having many small coordinate chunks.

Closes #209

The tests will fail right now with

/home/taugspurger/src/pangeo-forge/pangeo-forge-recipes/tests/recipe_tests/test_XarrayZarrRecipe.py:282: AssertionError: Left and right Dataset objects are not identical


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
E   AssertionError: Left and right Dataset objects are not identical

    Differing coordinates:
    L * time     (time) datetime64[ns] NaT 2010-01-02 ... 2010-01-09 2010-01-10
    R * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
    Differing data variables:
    L   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
        long_name: Fantastic Foo
    R   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
        long_name: Fantastic Foo
    L   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
        long_name: Beautiful Bar
    R   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
        long_name: Beautiful Bar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

it's a bit hard to see, but the time value changed. The value of 0 is being interpreted as a NaT by xarray now. I'm sure I'm missing something basic with metadata when rewriting the variable, but I haven't found it yet. Posting this before I figure that out in case someone knows it offhand.

This consolidates dimension coordinates to address the performance issues from having many small coordinate chunks. Closes pangeo-forge#209

TomAugspurger · 2021-09-22T23:43:23Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        group = zarr.open(target_mapper)
+        for dim in ds.dims:
+            attrs = dict(group[dim].attrs)
+            data = group[dim][:]


This assumes the coordinate fits in-memory on a single machine. I don't know if we're assuming that anywhere else (probably in the tests), but it's probably a safe assumption for now.

I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates.

It also assumes that dim in group. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError here.

rabernat

I'm confused about how this implementation actually changes the chunks. To me it looks like you're just reading the data and writing it back to the same Zarr Array. But I am probably missing something.

rabernat · 2021-09-22T23:50:53Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+    if consolidate_dimension_coordinates:
+        logger.info("Consolidating dimension coordinate arrays")
+        target_mapper = target.get_mapper()
+        ds = xr.open_zarr(target_mapper)  # Probably a better way to get the dimension coords?


You could make a set out of the _ARRAY_DIMENSIONS attribute on each array in the group.

Thanks. Done in 1c36e21.

rabernat · 2021-09-22T23:52:24Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        group = zarr.open(target_mapper)
+        for dim in ds.dims:
+            attrs = dict(group[dim].attrs)
+            data = group[dim][:]


I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates.

It also assumes that dim in group. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError here.

rabernat · 2021-09-22T23:54:45Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        for dim in ds.dims:
+            attrs = dict(group[dim].attrs)
+            data = group[dim][:]
+            group[dim] = data


AFAICT this will not actually change the chunking. Since the array already exists, this statement will stripe the data over the existing chunks.

rabernat · 2021-09-22T23:55:18Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+            attrs = dict(group[dim].attrs)
+            data = group[dim][:]
+            group[dim] = data
+            group[dim].attrs.update(attrs)


Why are you updating the attrs here? It doesn't look like they could have changed at all.

rabernat · 2021-09-23T00:01:21Z

The value of 0 is being interpreted as a NaT by xarray now.

This may indicate that the encoding has changed. To debug, I would open the target data with decode_times=False and see what is in the time array. Then call xr.decode_cf and look at the .encoding attribute on time.

TomAugspurger · 2021-09-23T01:38:02Z

To me it looks like you're just reading the data and writing it back to the same Zarr Array

IIUC, the difference is that group["key"][:] = value writes with the same structure (chunks, metadata, etc.), while group["key"] = value completely overwrites the old array, so it gets new chunks.

In [3]: group = zarr.group()

In [4]: group["a"] = zarr.ones(10, chunks=(2,))

In [5]: data = group["a"][:]

In [6]: group["a"][:] = data

In [7]: group["a"].chunks
Out[7]: (2,)

In [8]: group["a"] = data

In [9]: group["a"].chunks
Out[9]: (10,)

Which should answer #210 (comment) and #210 (comment).

However, writing an array with attrs to a group like group[key] = array_with_attrs doesn't actually result in an array with attrs in the store. I haven't looked to see if this is a bug in Zarr or not.

b2fcf0d has a fix that passes tests by using group.array(...).

rabernat · 2021-09-23T11:39:34Z

tests/recipe_tests/test_XarrayZarrRecipe.py

+    RecipeClass, file_pattern, kwargs, ds_expected, target = netCDFtoZarr_recipe
+
+    rec = RecipeClass(file_pattern, **kwargs)
+    rec.consolidate_dimension_coordinates = False


Perfect...my one last comment was going to be that we needed a test for both options, but that's already here.

Consolidate dimension coordinates

c590029

This consolidates dimension coordinates to address the performance issues from having many small coordinate chunks. Closes pangeo-forge#209

TomAugspurger commented Sep 22, 2021

View reviewed changes

rabernat reviewed Sep 22, 2021

View reviewed changes

Fixed attrs updating

b2fcf0d

Tom Augspurger added 2 commits September 22, 2021 20:41

Added test for negative condition

9e50d62

Gather coords with zarr

1c36e21

rabernat reviewed Sep 23, 2021

View reviewed changes

rabernat merged commit 7311c3e into pangeo-forge:master Sep 23, 2021

rabernat mentioned this pull request Sep 23, 2021

Pre-compute sequence-dim coordinates in prepare_target #90

Closed

TomAugspurger deleted the fix/consolidate-coordinates branch September 23, 2021 13:28

TomAugspurger mentioned this pull request Sep 23, 2021

Ensure dimension coordinates are not chunked? #209

Closed

cisaacstern mentioned this pull request Sep 29, 2021

KeyError raised by consolidate_dimension_coordinates #214

Closed

cisaacstern mentioned this pull request Oct 27, 2021

NOAA OISST Zarr is now on IPFS - next steps w/ Filecoin? pangeo-forge/roadmap#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate dimension coordinates #210

Consolidate dimension coordinates #210

TomAugspurger commented Sep 22, 2021

TomAugspurger Sep 22, 2021

rabernat Sep 22, 2021

rabernat left a comment

rabernat Sep 22, 2021

TomAugspurger Sep 23, 2021

rabernat Sep 22, 2021

rabernat Sep 22, 2021

rabernat Sep 22, 2021

rabernat commented Sep 23, 2021

TomAugspurger commented Sep 23, 2021 •

edited

Loading

rabernat Sep 23, 2021

Consolidate dimension coordinates #210

Consolidate dimension coordinates #210

Conversation

TomAugspurger commented Sep 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Sep 23, 2021

TomAugspurger commented Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Sep 23, 2021 •

edited

Loading