-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rechunking for Xarray datasets #52
Conversation
fd3994c
to
e43c200
Compare
Codecov Report
@@ Coverage Diff @@
## master #52 +/- ##
==========================================
+ Coverage 95.00% 97.75% +2.75%
==========================================
Files 10 10
Lines 400 445 +45
Branches 78 88 +10
==========================================
+ Hits 380 435 +55
+ Misses 10 5 -5
+ Partials 10 5 -5
Continue to review full report at Codecov.
|
e43c200
to
48c0883
Compare
48c0883
to
7a90792
Compare
7a90792
to
8d0efc4
Compare
This looks great @eric-czech, thanks for working on it.
+1 |
rechunker/api.py
Outdated
|
||
copy_specs = [] | ||
for variable in source: | ||
array = source[variable].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xarray's backend encoding functions are designed to work on xarray.Variable objects, so if you're going to use those I would recommend accessing source.variables[variable]
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other part that is missing here is that you are only writing the data variables, not the coordinates. Iterating over source.variables
should solve that problem, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it -- made some associated changes starting at https://github.com/pangeo-data/rechunker/pull/52/files#diff-5f579dcc7633808e087447d18c990e51R315.
Test with coordinates is at https://github.com/pangeo-data/rechunker/pull/52/files#diff-cbfcc614e55a07a4d1df274c34ed50b4R51.
I suspect this is only really relevant if the source is from xarray:
|
4ef5568
to
e817922
Compare
e817922
to
fc1b17a
Compare
Ok I re-worked this one a good bit (apologies for the big changes since first review). Some notes on the latest commit at fc1b17a:
import zarr
import xarray as xr
import numpy as np
from rechunker.api import rechunk
shape = (100, 50)
ds = xr.Dataset(
dict(
a=(("x", "y"), np.ones(shape, dtype='f4')),
b=(("x"), np.ones(shape[0])),
c=(("y"), np.ones(shape[1]))
),
coords=dict(
cx=(("x"), np.ones(shape[0])),
cy=(("y"), np.ones(shape[1]))
)
).chunk(chunks=25)
rechunked = rechunk(
ds,
target_chunks=dict(a=(10, 10), b=(10,), c=(10,)),
max_mem='50MB',
target_store="/tmp/store.zarr",
target_options=dict(
a=dict(
compressor=zarr.Blosc(cname="zstd"),
dtype="int16",
scale_factor=0.1,
_FillValue=-9999,
)
)
)
print(rechunked)
<Rechunked>
* Source : <xarray.Dataset>
Dimensions: (x: 100, y: 50)
Coordinates:
cx (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
cy (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
a (x, y) float32 dask.array<chunksize=(25, 25), meta=np.ndarray>
b (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
c (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>
* Intermediate: <zarr.hierarchy.Group '/'>
* Target : <zarr.hierarchy.Group '/'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulted the temporary store location to somewhere in the system temp dir for Zarr groups and Xarray datasets
- It didn't make much sense IMO for
temp_store
to be optional in the API and not have a default be set for collections of arrays. Otherwise when it's not set, an assertion error was thrown if any one array is rechunked to a different size.
I would rather require an explicit temp directory for now. My concern is that using a local directory as a default is likely to result in unexpected errors when scaling up rechunker for "production" use cases that run on multiple machines. Perhaps there is some way we might ask Executors to provide a location for temporary storage
Note that in the future there will be Executors that don't require temporary arrays on disk (e.g., see Beam in #36).
tests/test_rechunk.py
Outdated
@pytest.mark.parametrize("target_chunks", [(20, 10)]) | ||
@pytest.mark.parametrize("max_mem", ["10MB"]) | ||
@pytest.mark.parametrize("pass_temp", [True, False]) | ||
@pytest.mark.parametrize("executor", ["dask", api._get_executor("dask")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to also test non-dask executors on xarray.Dataset objects, e.g. beam. I assume this would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm it doesn't, but I can't tell if it should. I'm converting all the arrays in the dataset to dask and sending them through the dask.array.Array codepath, which is apparently also broken for all other executors. This is the trace I get in one form or another from different executors:
Traceback (most recent call last):
File "/Users/eczech/repos/pydata/rechunker/tests/test_rechunk.py", line 100, in test_rechunk_dataset
rechunked.execute()
File "/Users/eczech/repos/pydata/rechunker/rechunker/api.py", line 77, in execute
self._executor.execute_plan(self._plan, **kwargs)
File "/Users/eczech/repos/pydata/rechunker/rechunker/executors/python.py", line 31, in execute_plan
plan()
File "/Users/eczech/repos/pydata/rechunker/rechunker/executors/python.py", line 44, in _execute_all
task()
File "/Users/eczech/repos/pydata/rechunker/rechunker/executors/python.py", line 39, in _direct_array_copy
target[key] = source[key]
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1115, in __setitem__
self.set_basic_selection(selection, value, fields=fields)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1210, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1550, in _set_selection
self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1665, in _chunk_setitem
fields=fields)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/zarr/core.py", line 1687, in _chunk_setitem_nosync
chunk = value.astype(self._dtype, order=self._order, copy=False)
File "/Users/eczech/.conda/envs/rechunker-dev/lib/python3.7/site-packages/dask/array/core.py", line 1843, in astype
"arguments: {0!s}".format(list(extra))
TypeError: astype does not take the following keyword arguments: ['order']
Two potential solutions I see are:
- Coerce all dataset variables to zarr when not using the dask executor
- Make the other executors work on dask arrays
The first seems like a bad idea, and I'm not sure how to do the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it would just to change target[key] = source[key]
to target[key] = np.asarray(source[key])
inside _direct_array_copy
? Or perhaps this fix could be done upstream inside __setitem__
on Zarr arrays?
Anyways, this can definitely be saved for later!
Thanks for all the hard work happening here!
👍 to this. Our main use of rechunker is using dask in the cloud with object store, where there is no shared local filesystem. I'd like to avoid any default assumptions about the nature of the storage. Going forward, maybe we could consider adding some sort of config system for rechunker, which would allow you to specify your preferred way of creating temporary storage. |
Ah of course, makes sense. In 67ee2aa, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both? |
This sounds fine to me for now. Long term, I do think it could make sense to pass an |
67ee2aa
to
8502a33
Compare
Ok, 8502a33 adds a similar error for dask array sources.
I see, maybe the error in #52 (comment) is actually pretty superficial? I can't tell whether or not that's hinting at a fundamental limitation. |
Is there anything else you guys think I should address on this one? |
Hey @shoyer sorry to keep bugging you about this one, but is there anything else you'd like me to change? |
Hi @eric-czech. Thanks for your work on this! And thanks for your patience. I'm fine with merging now. I assume issues will come up as people try it out, and we can iterate as needed. |
This is an attempt at #45.
I'm not sure what the best way to go about this is, but I thought I would get something working and then get thoughts from you guys on where to go next. Notes:
rechunk_dataset(source: Dataset, encoding: Mapping, max_mem, target_store, temp_store, executor)
. I'm usingencoding
to indicate the target chunkings, along with any other compressor/filter options, so there is some consistency withDataset.to_zarr
. I think it would probably be best ifDataset
/DataArray
were other possible options in the mainrechunk
function with the sametarget_chunks
parameter and any other options in{target|temp}_options
. For the sake of discussion I thought it was easier to review quickly if all the new code was in one place. I'll happily combine the functions if this is on the right track.