-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to regions with unaligned chunks can lose data #8371
Comments
@max-sixty - thanks for the clear bug description. I wonder if we are opening the chunks with the wrong |
If we're appending to existing chunks, is that safe to run in parallel? (even though that's not the issue here) Should we raise an error if a region doesn't fully encompass a chunk that it's writing to? |
IIUC,
If we instead append to an existing chunk, as @jhamman suggested, then this would safe serially but not concurrently? i.e. does |
Writing a partial chunk in either Given that purpose of |
OK great! Though it seems that writing with
Yes, that makes sense to me — i.e. if running concurrently, writing with |
Oh indeed, there is something else buggy going on here, too! I'm not sure it's related to |
I wonder if we can reproduce this using the zarr-python api directly? I suspect not but it would be good to check. |
FWIW, I was able to reproduce the behavior @max-sixty demonstrated with the latest Xarray/Zarr versions. The equivalent pattern is not reproducible using Zarr-Python. |
I believe this is using dask's threaded scheduler to write in parallel. The test passes in serial (run by specifying @jhamman did you run with dask or in serial with only Zarr-python. |
Great find @dcherian ! If that's the case, we could:
|
Great! PR? |
Yup — easier to suggest the thing that write the code! I think realistically I'm less likely to get to this soon relative to working the numbagg stuff through... |
I ran with the threaded scheduler but without trying this again, I'm skeptical this is the problem. The region writes are done one at a time in a for-loop. This should work! |
I realize I was claiming it was serial, but I think @dcherian might be right, since if we only chunk on diff --git a/xarray/core/x.py b/xarray/core/x.py
index 749f228a..b2053f5f 100644
--- a/xarray/core/x.py
+++ b/xarray/core/x.py
@@ -13,9 +13,9 @@
def write(ds):
- ds.chunk(5).to_zarr("foo.zarr", compute=False, mode="w")
+ ds.chunk(a=5).to_zarr("foo.zarr", compute=False, mode="w")
for r in range(ds.sizes["a"]):
- ds.chunk(3).isel(a=[r]).to_zarr("foo.zarr", region=dict(a=slice(r, r + 1)))
+ ds.chunk(a=3).isel(a=[r]).to_zarr("foo.zarr", region=dict(a=slice(r, r + 1)))
def read(ds): ...so possibly the dask scheduler is writing multiple chunks along the other dims. (though I still think this is not great, that a |
You may want to try throwing zarr's ThreadSynchronizer in the mix to see if that resolves things here. I believe you can pass this to |
This also makes it work! Possibly we should add the check to |
We already have a lot of logic in place to verify that we don't do these sort of unaligned writes, e.g. xarray/xarray/backends/zarr.py Lines 176 to 199 in 1411474
Is the issue here that writing with |
OK you got me, thanks for getting this close enough — #8459 fixes it. I'm not sure it's the best conceptual design, but it does fix it... |
I'm not sure what the action item is. The MVCE succeeds in serial, or if the appropriate synchronizer is used. Erroring would be backwards-incompatible, wouldn't it? |
But is the existing behavior intentional? The most common case of using To get the old behavior, someone can pass |
What happened?
Writing with
region
with chunks that aren't aligned can lose data.I've recreated an example below. While it's unlikely that folks are passing different values to
.chunk
for the template vs. the regions, I had an"auto"
chunk, which can then set different chunk values.(FWIW, this was fairly painful, and I managed to lose a lot of time by not noticing this, and then not really considering this could happen as I was trying to debug. I think we should really strive to ensure that we don't lose data / incorrectly report that we've successfully written data...)
What did you expect to happen?
If there's a risk of data loss, raise an error...
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: ccc8f99
python: 3.9.18 (main, Aug 24 2023, 21:19:58)
[Clang 14.0.3 (clang-1403.0.22.14.1)]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2023.10.2.dev10+gccc8f998
pandas: 2.1.1
numpy: 1.25.2
scipy: 1.11.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.16.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.4.0
distributed: 2023.7.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: 0.2.3.dev30+gd26e29e
fsspec: 2021.11.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: 0.9.19
setuptools: 68.1.2
pip: 23.2.1
conda: None
pytest: 7.4.0
mypy: 1.6.0
IPython: 8.15.0
sphinx: 4.3.2
The text was updated successfully, but these errors were encountered: