Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for aligned chunks when writing to existing variables #8459

Merged
merged 14 commits into from
Mar 29, 2024

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Nov 16, 2023

While I don't feel super confident that this is designed to protect against any bugs, it does solve the immediate problem in #8371, by hoisting the encoding check above the code that runs for only new variables. The encoding check is somewhat implicit, so this was an easy thing to miss prior.

doc/whats-new.rst Outdated Show resolved Hide resolved
max-sixty and others added 4 commits November 16, 2023 11:27
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@rabernat
Copy link
Contributor

@max-sixty thanks so much for tackling this tricky issue. Can you help me understand whether the following edge case is covered here...

For our existing "safe chunks" check, we have a special case for the last chunk, which does not have to evenly divide the zarr chunks:

if var_chunks and enc_chunks_tuple:
for zchunk, dchunks in zip(enc_chunks_tuple, var_chunks):
for dchunk in dchunks[:-1]:
if dchunk % zchunk:
base_error = (

(i.e. we just skip the check for the last chunk; dchunks[:-1]) That is appropriate because, as implemented, the assumption is that we are always writing from the start of the array.

With regions (and also with append), that assumption is no longer true; we now have an initial offset to consider. Imagine, for example, an array with chunksize 10, where we want to write the region slice(5, 35) (offset 5). An acceptable Dask chunking would be (5, 10, 10, 5). I'm fairly certain that the existing chunk alignment code does not consider this and would not allow it. On the other hand, it would allow (10, 10, 10), which would be wrong because of the offset!

Is handling these sorts of scenarios in scope for this PR?

@max-sixty
Copy link
Collaborator Author

Excellent point @rabernat ! Let me consider that.

@dcherian could you confirm you agree with the change assuming we handle that? I wasn't sure where your were at in #8371. TY!

@dcherian
Copy link
Contributor

dcherian commented Nov 29, 2023

Would you confirm you agree with the change assuming we handle that?

Yes! sorry for the delay. I didn't realize that we could skip this guardrail with safe_chunks=False

@dcherian dcherian added the plan to merge Final call for comments label Mar 27, 2024
@dcherian dcherian removed the plan to merge Final call for comments label Mar 27, 2024
@dcherian
Copy link
Contributor

dcherian commented Mar 27, 2024

we now have an initial offset to consider. Imagine, for example, an array with chunksize 10, where we want to write the region slice(5, 35) (offset 5). An acceptable Dask chunking would be (5, 10, 10, 5). I'm fairly certain that the existing chunk alignment code does not consider this and would not allow it. On the other hand, it would allow (10, 10, 10), which would be wrong because of the offset!

This is a great observation but I think we can merge this PR for two reasons:

  1. this is an improvement over the status quo
  2. Users making region writes should ensure that the chunk boundaries line up. I've added a warning to the docstring for this.
  3. Users can skip these checks and recover old behaviour with safe_chunks=False so they can intentionally opt-in to unsafe writes.

Of course, it'd be nice to catch this case in the future.

@dcherian dcherian added the plan to merge Final call for comments label Mar 27, 2024
@rsemlal-murmuration
Copy link

I think this adresses also, the problem of this issue: #8876

@dcherian dcherian merged commit ffb30a8 into pydata:main Mar 29, 2024
29 checks passed
dcherian added a commit to dcherian/xarray that referenced this pull request Apr 2, 2024
* main: (26 commits)
  [pre-commit.ci] pre-commit autoupdate (pydata#8900)
  Bump the actions group with 1 update (pydata#8896)
  New empty whatsnew entry (pydata#8899)
  Update reference to 'Weighted quantile estimators' (pydata#8898)
  2024.03.0: Add whats-new (pydata#8891)
  Add typing to test_groupby.py (pydata#8890)
  Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867)
  Check for aligned chunks when writing to existing variables (pydata#8459)
  Add dt.date to plottable types (pydata#8873)
  Optimize writes to existing Zarr stores. (pydata#8875)
  Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886)
  Don't allow overwriting indexes with region writes (pydata#8877)
  Migrate datatree.py module into xarray.core. (pydata#8789)
  warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874)
  groupby: Dispatch quantile to flox. (pydata#8720)
  Opt out of auto creating index variables (pydata#8711)
  Update docs on view / copies (pydata#8744)
  Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869)
  numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865)
  upstream-dev CI: Fix interp and cumtrapz (pydata#8861)
  ...
@max-sixty max-sixty deleted the zarr-region-chunks branch April 29, 2024 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library
Projects
None yet
4 participants