Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for aligned chunks when writing to existing variables #8459

Merged
merged 14 commits into from
Mar 29, 2024
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ Deprecations
Bug fixes
~~~~~~~~~

- Writing to unaligned zarr chunks with ``region`` will now raise an error
max-sixty marked this conversation as resolved.
Show resolved Hide resolved
(unless ``safe_chunks=False``); previously an error would only be raised on
new variables. (:pull:`8459`, :issue:`8371`)
By `Maximilian Roos <https://github.com/max-sixty>`_.
- Port `bug fix from pandas <https://github.com/pandas-dev/pandas/pull/55283>`_
to eliminate the adjustment of resample bin edges in the case that the
resampling frequency has units of days and is greater than one day
Expand Down
14 changes: 11 additions & 3 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -679,6 +679,17 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
if v.encoding == {"_FillValue": None} and fill_value is None:
v.encoding = {}

# We need to do this for both new and existing variables to ensure we're not
# writing to a partial chunk, even though we don't use the `encoding` value
# when writing to an existing variable. See
# https://github.com/pydata/xarray/issues/8371 for details.
encoding = extract_zarr_variable_encoding(
dcherian marked this conversation as resolved.
Show resolved Hide resolved
v,
raise_on_invalid=check,
name=vn,
safe_chunks=self._safe_chunks,
)

if name in self.zarr_group:
# existing variable
# TODO: if mode="a", consider overriding the existing variable
Expand Down Expand Up @@ -709,9 +720,6 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
zarr_array = self.zarr_group[name]
else:
# new variable
encoding = extract_zarr_variable_encoding(
v, raise_on_invalid=check, name=vn, safe_chunks=self._safe_chunks
)
encoded_attrs = {}
# the magic for storing the hidden dimension data
encoded_attrs[DIMENSION_KEY] = dims
Expand Down
19 changes: 19 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -5434,3 +5434,22 @@
ds_region.to_zarr(
tmp_path / "test.zarr", region={"x": slice(0, 1), "y": slice(0, 1)}
)


@requires_zarr
def test_zarr_region_chunk_align(tmp_path):
"""
Check that writing to unaligned chunks with `region` fails, assuming `safe_chunks=False`.
"""
ds = (
xr.DataArray(np.arange(120).reshape(4, 3, -1), dims=list("abc"))
.rename("var1")
.to_dataset()
)

ds.chunk(5).to_zarr(tmp_path / "foo.zarr", compute=False, mode="w")

Check failure on line 5450 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.10 all-but-dask

test_zarr_region_chunk_align ValueError: unrecognized chunk manager dask - must be one of: []
with pytest.raises(ValueError):
for r in range(ds.sizes["a"]):
ds.chunk(3).isel(a=[r]).to_zarr(

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.9

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.11

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / macos-latest py3.9

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / macos-latest py3.11

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.9 min-all-deps

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Check failure on line 5453 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.10 flaky

test_zarr_region_chunk_align NotImplementedError: Specified zarr chunks encoding['chunks']=(4, 3, 5) for variable named 'var1' would overlap multiple dask chunks ((1,), (3,), (3, 3, 3, 1)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
tmp_path / "foo.zarr", region=dict(a=slice(r, r + 1))
)
Loading