Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic chunking of zarr archive #4046

Open
apatlpo opened this issue May 8, 2020 · 3 comments
Open

automatic chunking of zarr archive #4046

apatlpo opened this issue May 8, 2020 · 3 comments

Comments

@apatlpo
Copy link
Contributor

apatlpo commented May 8, 2020

I store data in a zarr archive that is not chunked and the resulting zarr archive is chunked.
This may be as simple usage question.
I don't know how to turn this behavior off.

Code sample

Here is minimal example that reproduces the issue:

ds = xr.DataArray(np.ones((200,800))).rename('foo').to_dataset()
print('Initial chunks = {}'.format(ds.foo.chunks))
ds.to_zarr('test.zarr', mode='w')
print('zarr archives contains: {}'.format(os.listdir('test.zarr/foo')))
ds = xr.open_zarr('test.zarr')
print('Final chunks = {}'.format(ds.foo.chunks))

returns:

Initial chunks = None
zarr archives contains: ['.zarray', '.zattrs', '0.0', '0.1', '1.0', '1.1']
Final chunks = ((100, 100), (400, 400))

Expected Output

I would expect the archive to not to be chunked.

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.12.53-60.30-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4

xarray: 0.15.2.dev29+g6048356
pandas: 1.0.3
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.13.0
distributed: 2.13.0
matplotlib: 3.2.1
cartopy: 0.17.0
seaborn: 0.10.0
numbagg: None
pint: None
setuptools: 46.1.3.post20200325
pip: 20.0.2
conda: None
pytest: None
IPython: 7.13.0
sphinx: None

@rabernat
Copy link
Contributor

rabernat commented May 8, 2020

Thanks for raising this useful issue.

There are two ways to control Zarr chunks:

  • Specify chunks in encoding (always takes precedence)
  • Determine chunks based on Dask chunks

If neither of these are present, Xarray creates the zarr arrays with no chunks specified. In this case, zarr will choose the chunks automatically for you. This behavior is described in the Zarr docs:
https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape

If you are feeling lazy, you can let Zarr guess a chunk shape for your data by providing chunks=True, although please note that the algorithm for guessing a chunk shape is based on simple heuristics and may be far from optimal

You can override this default per variable by specifying a single global chunk in encoding:

ds.foo.encoding['chunks'] = -1

or, at write time,

ds.to_zarr('test.zarr', mode='w', encoding={'foo': {'chunks': -1}})

I agree that none of this is described well in the Xarray docs. A PR to improve the docs would be most welcome. 😉

@apatlpo
Copy link
Contributor Author

apatlpo commented May 8, 2020

Thanks for this speedy reply @rabernat !

Improving docs is still within my reach (I hope) and I will give it a shot.
Could this improvement in the document take place in the description of the encoding parameter of xarray.Dataset.to_zarr?

@stale
Copy link

stale bot commented Apr 30, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 30, 2022
@dcherian dcherian removed the stale label Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants