Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate dimension coordinates #210

Merged

Conversation

TomAugspurger
Copy link
Contributor

This consolidates dimension coordinates to address the performance
issues from having many small coordinate chunks.

Closes #209

The tests will fail right now with

/home/taugspurger/src/pangeo-forge/pangeo-forge-recipes/tests/recipe_tests/test_XarrayZarrRecipe.py:282: AssertionError: Left and right Dataset objects are not identical


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
E   AssertionError: Left and right Dataset objects are not identical

    Differing coordinates:
    L * time     (time) datetime64[ns] NaT 2010-01-02 ... 2010-01-09 2010-01-10
    R * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
    Differing data variables:
    L   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
        long_name: Fantastic Foo
    R   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
        long_name: Fantastic Foo
    L   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
        long_name: Beautiful Bar
    R   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
        long_name: Beautiful Bar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

it's a bit hard to see, but the time value changed. The value of 0 is being interpreted as a NaT by xarray now. I'm sure I'm missing something basic with metadata when rewriting the variable, but I haven't found it yet. Posting this before I figure that out in case someone knows it offhand.

This consolidates dimension coordinates to address the performance
issues from having many small coordinate chunks.

Closes pangeo-forge#209
group = zarr.open(target_mapper)
for dim in ds.dims:
attrs = dict(group[dim].attrs)
data = group[dim][:]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the coordinate fits in-memory on a single machine. I don't know if we're assuming that anywhere else (probably in the tests), but it's probably a safe assumption for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates.

It also assumes that dim in group. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError here.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about how this implementation actually changes the chunks. To me it looks like you're just reading the data and writing it back to the same Zarr Array. But I am probably missing something.

if consolidate_dimension_coordinates:
logger.info("Consolidating dimension coordinate arrays")
target_mapper = target.get_mapper()
ds = xr.open_zarr(target_mapper) # Probably a better way to get the dimension coords?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make a set out of the _ARRAY_DIMENSIONS attribute on each array in the group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done in 1c36e21.

group = zarr.open(target_mapper)
for dim in ds.dims:
attrs = dict(group[dim].attrs)
data = group[dim][:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates.

It also assumes that dim in group. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError here.

for dim in ds.dims:
attrs = dict(group[dim].attrs)
data = group[dim][:]
group[dim] = data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this will not actually change the chunking. Since the array already exists, this statement will stripe the data over the existing chunks.

attrs = dict(group[dim].attrs)
data = group[dim][:]
group[dim] = data
group[dim].attrs.update(attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you updating the attrs here? It doesn't look like they could have changed at all.

@rabernat
Copy link
Contributor

The value of 0 is being interpreted as a NaT by xarray now.

This may indicate that the encoding has changed. To debug, I would open the target data with decode_times=False and see what is in the time array. Then call xr.decode_cf and look at the .encoding attribute on time.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 23, 2021

To me it looks like you're just reading the data and writing it back to the same Zarr Array

IIUC, the difference is that group["key"][:] = value writes with the same structure (chunks, metadata, etc.), while group["key"] = value completely overwrites the old array, so it gets new chunks.

In [3]: group = zarr.group()

In [4]: group["a"] = zarr.ones(10, chunks=(2,))

In [5]: data = group["a"][:]

In [6]: group["a"][:] = data

In [7]: group["a"].chunks
Out[7]: (2,)

In [8]: group["a"] = data

In [9]: group["a"].chunks
Out[9]: (10,)

Which should answer #210 (comment) and #210 (comment).

However, writing an array with attrs to a group like group[key] = array_with_attrs doesn't actually result in an array with attrs in the store. I haven't looked to see if this is a bug in Zarr or not.

b2fcf0d has a fix that passes tests by using group.array(...).

RecipeClass, file_pattern, kwargs, ds_expected, target = netCDFtoZarr_recipe

rec = RecipeClass(file_pattern, **kwargs)
rec.consolidate_dimension_coordinates = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect...my one last comment was going to be that we needed a test for both options, but that's already here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure dimension coordinates are not chunked?
2 participants