Check for aligned chunks when writing to existing variables #8459

max-sixty · 2023-11-16T18:56:06Z

While I don't feel super confident that this is designed to protect against any bugs, it does solve the immediate problem in #8371, by hoisting the encoding check above the code that runs for only new variables. The encoding check is somewhat implicit, so this was an easy thing to miss prior.

Closes Writing to regions with unaligned chunks can lose data #8371,
Closes to_zarr silently loses data when using append_dim, if chunks are different to zarr store #8882
Closes Possible race condition when appending to an existing zarr #8876
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

doc/whats-new.rst

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

rabernat · 2023-11-20T16:33:45Z

@max-sixty thanks so much for tackling this tricky issue. Can you help me understand whether the following edge case is covered here...

For our existing "safe chunks" check, we have a special case for the last chunk, which does not have to evenly divide the zarr chunks:

xarray/xarray/backends/zarr.py

Lines 185 to 189 in bb8511e

    
           if var_chunks and enc_chunks_tuple: 
        
               for zchunk, dchunks in zip(enc_chunks_tuple, var_chunks): 
        
                   for dchunk in dchunks[:-1]: 
        
                       if dchunk % zchunk: 
        
                           base_error = (

(i.e. we just skip the check for the last chunk; dchunks[:-1]) That is appropriate because, as implemented, the assumption is that we are always writing from the start of the array.

With regions (and also with append), that assumption is no longer true; we now have an initial offset to consider. Imagine, for example, an array with chunksize 10, where we want to write the region slice(5, 35) (offset 5). An acceptable Dask chunking would be (5, 10, 10, 5). I'm fairly certain that the existing chunk alignment code does not consider this and would not allow it. On the other hand, it would allow (10, 10, 10), which would be wrong because of the offset!

Is handling these sorts of scenarios in scope for this PR?

max-sixty · 2023-11-20T18:53:29Z

Excellent point @rabernat ! Let me consider that.

@dcherian could you confirm you agree with the change assuming we handle that? I wasn't sure where your were at in #8371. TY!

dcherian · 2023-11-29T14:40:58Z

Would you confirm you agree with the change assuming we handle that?

Yes! sorry for the delay. I didn't realize that we could skip this guardrail with safe_chunks=False

xarray/backends/zarr.py

for more information, see https://pre-commit.ci

dcherian · 2024-03-27T17:03:44Z

we now have an initial offset to consider. Imagine, for example, an array with chunksize 10, where we want to write the region slice(5, 35) (offset 5). An acceptable Dask chunking would be (5, 10, 10, 5). I'm fairly certain that the existing chunk alignment code does not consider this and would not allow it. On the other hand, it would allow (10, 10, 10), which would be wrong because of the offset!

This is a great observation but I think we can merge this PR for two reasons:

this is an improvement over the status quo
Users making region writes should ensure that the chunk boundaries line up. I've added a warning to the docstring for this.
Users can skip these checks and recover old behaviour with safe_chunks=False so they can intentionally opt-in to unsafe writes.

Of course, it'd be nice to catch this case in the future.

rsemlal-murmuration · 2024-03-28T08:50:51Z

I think this adresses also, the problem of this issue: #8876

* main: (26 commits) [pre-commit.ci] pre-commit autoupdate (pydata#8900) Bump the actions group with 1 update (pydata#8896) New empty whatsnew entry (pydata#8899) Update reference to 'Weighted quantile estimators' (pydata#8898) 2024.03.0: Add whats-new (pydata#8891) Add typing to test_groupby.py (pydata#8890) Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867) Check for aligned chunks when writing to existing variables (pydata#8459) Add dt.date to plottable types (pydata#8873) Optimize writes to existing Zarr stores. (pydata#8875) Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886) Don't allow overwriting indexes with region writes (pydata#8877) Migrate datatree.py module into xarray.core. (pydata#8789) warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874) groupby: Dispatch quantile to flox. (pydata#8720) Opt out of auto creating index variables (pydata#8711) Update docs on view / copies (pydata#8744) Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869) numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865) upstream-dev CI: Fix interp and cumtrapz (pydata#8861) ...

Check for aligned chunks when writing to existing variables

0fff9e6

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Nov 16, 2023

max-sixty mentioned this pull request Nov 16, 2023

Writing to regions with unaligned chunks can lose data #8371

Closed

5 tasks

853ddd5

dcherian reviewed Nov 16, 2023

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

max-sixty and others added 4 commits November 16, 2023 11:27

89e988c

Update doc/whats-new.rst

1005c9a

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

4337b27

Merge branch 'main' into zarr-region-chunks

9f19aca

dcherian reviewed Nov 29, 2023

View reviewed changes

xarray/backends/zarr.py Show resolved Hide resolved

dcherian and others added 2 commits January 3, 2024 17:17

Merge branch 'main' into zarr-region-chunks

8c3aeea

[pre-commit.ci] auto fixes from pre-commit.com hooks

142fe8e

for more information, see https://pre-commit.ci

martinspetlik mentioned this pull request Feb 12, 2024

calling to_zarr inside map_blocks function results in missing values #8703

Closed

5 tasks

dcherian and others added 3 commits March 27, 2024 10:26

Merge branch 'main' into zarr-region-chunks

fbc39bd

Add regression test for pydata#8459

b7c9674

Update whats-new

da79f07

dcherian added the plan to merge Final call for comments label Mar 27, 2024

dcherian approved these changes Mar 27, 2024

View reviewed changes

dcherian removed the plan to merge Final call for comments label Mar 27, 2024

Address Ryan's comment

7ed7b57

dcherian added the plan to merge Final call for comments label Mar 27, 2024

dcherian added 2 commits March 27, 2024 11:07

Update region typing

2e513ba

Update test

f7f7cd8

dcherian merged commit ffb30a8 into pydata:main Mar 29, 2024
29 checks passed

pont-us mentioned this pull request Apr 2, 2024

Some unit tests failing with xarray 2024.3.0 xcube-dev/xcube#958

Closed

max-sixty deleted the zarr-region-chunks branch April 29, 2024 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for aligned chunks when writing to existing variables #8459

Check for aligned chunks when writing to existing variables #8459

max-sixty commented Nov 16, 2023 •

edited by dcherian

Loading

rabernat commented Nov 20, 2023

max-sixty commented Nov 20, 2023

dcherian commented Nov 29, 2023 •

edited

Loading

dcherian commented Mar 27, 2024 •

edited

Loading

rsemlal-murmuration commented Mar 28, 2024

Check for aligned chunks when writing to existing variables #8459

Check for aligned chunks when writing to existing variables #8459

Conversation

max-sixty commented Nov 16, 2023 • edited by dcherian Loading

rabernat commented Nov 20, 2023

max-sixty commented Nov 20, 2023

dcherian commented Nov 29, 2023 • edited Loading

dcherian commented Mar 27, 2024 • edited Loading

rsemlal-murmuration commented Mar 28, 2024

max-sixty commented Nov 16, 2023 •

edited by dcherian

Loading

dcherian commented Nov 29, 2023 •

edited

Loading

dcherian commented Mar 27, 2024 •

edited

Loading