-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate dimension coordinates #210
Consolidate dimension coordinates #210
Conversation
This consolidates dimension coordinates to address the performance issues from having many small coordinate chunks. Closes pangeo-forge#209
group = zarr.open(target_mapper) | ||
for dim in ds.dims: | ||
attrs = dict(group[dim].attrs) | ||
data = group[dim][:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes the coordinate fits in-memory on a single machine. I don't know if we're assuming that anywhere else (probably in the tests), but it's probably a safe assumption for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates
.
It also assumes that dim in group
. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about how this implementation actually changes the chunks. To me it looks like you're just reading the data and writing it back to the same Zarr Array. But I am probably missing something.
if consolidate_dimension_coordinates: | ||
logger.info("Consolidating dimension coordinate arrays") | ||
target_mapper = target.get_mapper() | ||
ds = xr.open_zarr(target_mapper) # Probably a better way to get the dimension coords? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make a set out of the _ARRAY_DIMENSIONS
attribute on each array in the group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done in 1c36e21.
group = zarr.open(target_mapper) | ||
for dim in ds.dims: | ||
attrs = dict(group[dim].attrs) | ||
data = group[dim][:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a safe assumption, since the dimension coordinates are guaranteed to be 1D. We could note this caveat in the docs on consolidate_dimension_coordinates
.
It also assumes that dim in group
. But that's not always the case. Xarray Datasets can have dimensions with no corresponding coordinate, which would give a KeyError
here.
for dim in ds.dims: | ||
attrs = dict(group[dim].attrs) | ||
data = group[dim][:] | ||
group[dim] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this will not actually change the chunking. Since the array already exists, this statement will stripe the data over the existing chunks.
attrs = dict(group[dim].attrs) | ||
data = group[dim][:] | ||
group[dim] = data | ||
group[dim].attrs.update(attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you updating the attrs here? It doesn't look like they could have changed at all.
This may indicate that the encoding has changed. To debug, I would open the target data with |
IIUC, the difference is that In [3]: group = zarr.group()
In [4]: group["a"] = zarr.ones(10, chunks=(2,))
In [5]: data = group["a"][:]
In [6]: group["a"][:] = data
In [7]: group["a"].chunks
Out[7]: (2,)
In [8]: group["a"] = data
In [9]: group["a"].chunks
Out[9]: (10,) Which should answer #210 (comment) and #210 (comment). However, writing an array with attrs to a group like b2fcf0d has a fix that passes tests by using |
RecipeClass, file_pattern, kwargs, ds_expected, target = netCDFtoZarr_recipe | ||
|
||
rec = RecipeClass(file_pattern, **kwargs) | ||
rec.consolidate_dimension_coordinates = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect...my one last comment was going to be that we needed a test for both options, but that's already here.
This consolidates dimension coordinates to address the performance
issues from having many small coordinate chunks.
Closes #209
The tests will fail right now with
it's a bit hard to see, but the
time
value changed. The value of0
is being interpreted as a NaT by xarray now. I'm sure I'm missing something basic with metadata when rewriting the variable, but I haven't found it yet. Posting this before I figure that out in case someone knows it offhand.