Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve safe chunk validation #9527

Merged
merged 18 commits into from
Sep 22, 2024

Conversation

josephnowak
Copy link
Contributor

@max-sixty Here is the PR with all the features requested #9513 (comment), also I added more tests and improved the logic of the validation algorithm, I think now it is much more simple and covers certain cases that were not validated before, for example, the "r+" mode only allows full chunk writes even on the last chunk.

  • Closes #xxxx
  • [test_zarr_safe_chunk_region ] Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@josephnowak josephnowak mentioned this pull request Sep 20, 2024
2 tasks
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @josephnowak ! The tests look great.

I left a couple of small comments and will have another read through of the code later, but overall 👍 . Thank you.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
…ameter in order on the extract_zarr_variable_encoding method, raise the correct error if the border size is smaller than the zchunk on mode equal to r+
…ameter in order on the extract_zarr_variable_encoding method, raise the correct error if the border size is smaller than the zchunk on mode equal to r+
@max-sixty max-sixty added the plan to merge Final call for comments label Sep 20, 2024
@max-sixty
Copy link
Collaborator

Couple of typing errors, then we can merge!

@josephnowak
Copy link
Contributor Author

Hi @max-sixty, sorry for being insistent with the review but I would like to avoid any misunderstood with the validation logic and affect the usability of Xarray. The following case is going to raise an error using the mode "r+":

    arr = xr.DataArray(
        list(range(10)), dims=["a"], coords={"a": list(range(10))}, name="foo"
    ).chunk(a=3)
    arr.to_zarr(store, mode="w")
    with pytest.raises(ValueError):
        # It is not possible to write on a chunk that is not full, even when it is the last one which can be of a smaller size
        arr.isel(a=slice(9, 10)).to_zarr(store, region="auto", mode="r+")

Is it the expected behavior? if it is, it probably would be good to make "a" the default mode, because it allows partial writes on the first and last chunk inside the region which I think is more convenient in more scenarios

@max-sixty
Copy link
Collaborator

Hi @max-sixty, sorry for being insistent with the review but I would like to avoid any misunderstood with the validation logic and affect the usability of Xarray.

(tbc, thank you very much for being insistent, this is a complicated and important problem which we haven't spent enough time on, so need exactly this sort of help!)

The following case is going to raise an error using the mode "r+":

OK, that's a very good point — that writing to the full final chunk will raise an error with the proposed code. The rule should be that partial chunk writes are not allowed, but that is a full chunk write, so should be allowed. Otherwise people can't write to the full array!

Could we adjust the logic so writing to the final chunk is allowed?

if it is, it probably would be good to make "a" the default mode, because it allows partial writes on the first and last chunk inside the region which I think is more convenient in more scenarios

I weigh the cost of unsafe writes very highly, so my initial impulse is that writing to partial chunks should fail by default.

Possibly making this strict could break some existing code; possibly we should explain ourselves better. For users who want to write to a single partial chunk from a single process, it would be a fairly small change for people to pass mode="a", or allow unsafe writes.

But open to feedback from others; CC @pydata/xarray. One compromise would be to have a new mode, like rd+, for distributed writes, which would be strict like this. (though we should only do that if people feel strongly, otherwise we're at risk of config proliferation)

Comment on lines +214 to +215
if not safe_chunks:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could do an early return outside this loop

@max-sixty
Copy link
Collaborator

OK, very nice! (I haven't re-reviewed all the code, just the most recent commit).

There are still a couple of type errors; lmk if you need a hand with those.

@josephnowak
Copy link
Contributor Author

Hi @max-sixty, sorry for being insistent with the review but I would like to avoid any misunderstood with the validation logic and affect the usability of Xarray.

(tbc, thank you very much for being insistent, this is a complicated and important problem which we haven't spent enough time on, so need exactly this sort of help!)

The following case is going to raise an error using the mode "r+":

OK, that's a very good point — that writing to the full final chunk will raise an error with the proposed code. The rule should be that partial chunk writes are not allowed, but that is a full chunk write, so should be allowed. Otherwise people can't write to the full array!

Could we adjust the logic so writing to the final chunk is allowed?

if it is, it probably would be good to make "a" the default mode, because it allows partial writes on the first and last chunk inside the region which I think is more convenient in more scenarios

I weigh the cost of unsafe writes very highly, so my initial impulse is that writing to partial chunks should fail by default.

Possibly making this strict could break some existing code; possibly we should explain ourselves better. For users who want to write to a single partial chunk from a single process, it would be a fairly small change for people to pass mode="a", or allow unsafe writes.

But open to feedback from others; CC @pydata/xarray. One compromise would be to have a new mode, like rd+, for distributed writes, which would be strict like this. (though we should only do that if people feel strongly, otherwise we're at risk of config proliferation)

Yes, I can change the logic to avoid raising an error, I will have to send the shape of the Zarr array to validate if the region is covering the last chunk.

Hi @max-sixty, sorry for being insistent with the review but I would like to avoid any misunderstood with the validation logic and affect the usability of Xarray.

(tbc, thank you very much for being insistent, this is a complicated and important problem which we haven't spent enough time on, so need exactly this sort of help!)

The following case is going to raise an error using the mode "r+":

OK, that's a very good point — that writing to the full final chunk will raise an error with the proposed code. The rule should be that partial chunk writes are not allowed, but that is a full chunk write, so should be allowed. Otherwise people can't write to the full array!

Could we adjust the logic so writing to the final chunk is allowed?

if it is, it probably would be good to make "a" the default mode, because it allows partial writes on the first and last chunk inside the region which I think is more convenient in more scenarios

I weigh the cost of unsafe writes very highly, so my initial impulse is that writing to partial chunks should fail by default.

Possibly making this strict could break some existing code; possibly we should explain ourselves better. For users who want to write to a single partial chunk from a single process, it would be a fairly small change for people to pass mode="a", or allow unsafe writes.

But open to feedback from others; CC @pydata/xarray. One compromise would be to have a new mode, like rd+, for distributed writes, which would be strict like this. (though we should only do that if people feel strongly, otherwise we're at risk of config proliferation)

I have already push

OK, very nice! (I haven't re-reviewed all the code, just the most recent commit).

There are still a couple of type errors; lmk if you need a hand with those.

I will check the type error, I thought that it was related to the code of someone else

@max-sixty
Copy link
Collaborator

Great! I'll merge, and if anyone has feedback we can make any adjustments later.

Thank you very much @josephnowak !

@max-sixty max-sixty merged commit 2a6212e into pydata:main Sep 22, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants