docs on specifying chunks in to_zarr encoding arg #6542

delgadom · 2022-04-29T18:37:25Z

The structure of the Dataset.to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!

Documentation additions only (no workflow stages)

The structure of the to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!

for more information, see https://pre-commit.ci

max-sixty · 2022-04-29T18:44:00Z

This looks very well-written @delgadom , thank you!

I'll let someone who knows the content better do a review too.

delgadom · 2022-04-29T20:44:43Z

hmmm seems I've messed something up in the docs build. apologize for the churn - will fix.

delgadom · 2022-05-01T01:47:27Z

sorry I know I said I'd fix this but I'm having a very hard time figuring out what is wrong and how to build the docs. I had to set up a docker image to build them because I couldn't get the ipython directive to use the right conda env on my laptop, and now I'm getting a version error when pandas encounters the default build number on my fork. I'm a bit embarrassed that I can't figure this out, but... I think I might need a hand getting this across the finish line 😕

....
#13 1597.6 reading sources... [ 97%] getting-started-guide/faq
#13 1600.8 reading sources... [ 97%] getting-started-guide/index
#13 1600.8 reading sources... [ 97%] getting-started-guide/installing
#13 1601.0 reading sources... [ 97%] getting-started-guide/quick-overview
#13 1609.3 WARNING:
#13 1609.3 >>>-------------------------------------------------------------------------
#13 1609.3 Exception in /xarray/doc/getting-started-guide/quick-overview.rst at block ending on line 169
#13 1609.3 Specify :okexcept: as an option in the ipython:: block to suppress this message
#13 1609.3 ---------------------------------------------------------------------------
#13 1609.3 ImportError                               Traceback (most recent call last)
#13 1609.3 Input In [40], in <cell line: 1>()
#13 1609.3 ----> 1 series.to_xarray()
#13 1609.3
#13 1609.3 File /srv/conda/envs/xarray-doc/lib/python3.9/site-packages/pandas/core/generic.py:3173, in NDFrame.to_xarray(self)
#13 1609.3    3096 @final
#13 1609.3    3097 def to_xarray(self):
#13 1609.3    3098     """
#13 1609.3    3099     Return an xarray object from the pandas object.
#13 1609.3    3100
#13 1609.3    (...)
#13 1609.3    3171         speed    (date, animal) int64 350 18 361 15
#13 1609.3    3172     """
#13 1609.3 -> 3173     xarray = import_optional_dependency("xarray")
#13 1609.3    3175     if self.ndim == 1:
#13 1609.3    3176         return xarray.DataArray.from_series(self)
#13 1609.3
#13 1609.3 File /srv/conda/envs/xarray-doc/lib/python3.9/site-packages/pandas/compat/_optional.py:164, in import_optional_dependency(name, extra, errors, min_version)
#13 1609.3     162             return None
#13 1609.3     163         elif errors == "raise":
#13 1609.3 --> 164             raise ImportError(msg)
#13 1609.3     166 return module
#13 1609.3
#13 1609.3 ImportError: Pandas requires version '0.15.1' or newer of 'xarray' (version '0.1.dev1+gfdf7303' currently installed).
#13 1609.3
#13 1609.3 <<<-------------------------------------------------------------------------
#13 1609.3
#13 1609.3 Exception occurred:
#13 1609.3   File "/srv/conda/envs/xarray-doc/lib/python3.9/site-packages/IPython/sphinxext/ipython_directive.py", line 584, in process_input
#13 1609.3     raise RuntimeError('Non Expected exception in `{}` line {}'.format(filename, lineno))
#13 1609.3 RuntimeError: Non Expected exception in `/xarray/doc/getting-started-guide/quick-overview.rst` line 169
#13 1609.3 The full traceback has been saved in /tmp/sphinx-err-a01y480j.log, if you want to report the issue to the developers.
#13 1609.3 Please also report this if it was a user error, so that a better error message can be provided next time.
#13 1609.3 A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
#13 1611.4 make: *** [Makefile:58: html] Error 2
#13 1611.6 ERROR conda.cli.main_run:execute(41): `conda run /bin/bash -c make html` failed. (See above for error)
#13 ERROR: executor failed running [conda run --no-capture-output -n xarray-doc /bin/bash -c make html]: exit code: 2
------
 > [10/10] RUN make html:
------
executor failed running [conda run --no-capture-output -n xarray-doc /bin/bash -c make html]: exit code: 2

dcherian · 2022-05-02T19:33:05Z

Does this part of the docs help

andersy005 · 2022-05-05T02:37:36Z

ImportError: Pandas requires version '0.15.1' or newer of 'xarray' (version '0.1.dev1+gfdf7303' currently installed).

looking at the reported xarray version, i'm curious... 🧐 are you using a shallow git clone of xarray? I'm able to reproduce the version issue via these steps:

git clone --depth 1 git@github.com:pydata/xarray.git
cd xarray
python -m pip install -e .

 conda list xarray                                                                                                                              ─╯
# packages in environment at /Users/andersy005/mambaforge/envs/test:
#
# Name                    Version                   Build  Channel
xarray                    0.1.dev1+g126051f           dev_0    <develop>

doc/user-guide/io.rst

delgadom · 2022-05-05T16:31:06Z

!!! @andersy005 thank you so much! yes - I was using a shallow clone inside the docker version of the build. I really appreciate the review and for catching my error. I'll clean this up and push the changes.

delgadom · 2022-05-05T16:36:02Z

@dcherian - yep I was trying to follow that guide closely but was still struggling with building using conda on my laptop's miniconda environment. The sphinx ipython directive kept running in the wrong conda environment, even after deleting sphinx, ipython, and ipykernel on my base env and making sure the env I was running make from had all the dependencies 🤷🏼‍♂️ . The docker image helped with the conda version issue but then I ran afoul of the xarray version because I used a shallow clone. oopps 😞

Co-authored-by: Anderson Banihirwe <axbanihirwe@ualr.edu>

rabernat

This is fantastic! Thanks so much @delgadom.

Reading these docs make me think that perhaps we should consider changing the defaults to something more intuitive. But we can address that in the future.

delgadom · 2022-05-06T16:07:44Z

now I'm getting reproducible build timeouts 😭

The docs build alone (just make html) takes around an hour on my laptop for xarray/main and just a bit longer on my branch. readthedocs seems to be timing out at 31 minutes. How does this usually work? or am I doing something wrong again?

Thanks for all the help everyone!

jakirkham · 2022-05-09T18:23:03Z

FWIW there's a similar doc page about chunk size in Dask that may be worth borrowing from

max-sixty · 2022-05-10T05:51:15Z

now I'm getting reproducible build timeouts 😭

The docs build alone (just make html) takes around an hour on my laptop for xarray/main and just a bit longer on my branch. readthedocs seems to be timing out at 31 minutes. How does this usually work? or am I doing something wrong again?

Thanks for all the help everyone!

That's not fun at all. FWIW most seem to pass, so you may have been unlucky.

I restarted the build by merging main into the branch. Let's hope this completes.

Thanks a lot for both the contribution and your patience @delgadom ...

delgadom · 2022-05-10T19:41:12Z

thank you all for being patient with this PR! seems the build failed again for the same reason. I think there might be something wrong with my examples, though it beats me what the issue is. As far as I can tell, most builds come in somewhere in the mid 900s range on readthedocs but my branch consistently times out at 1900s. I'll see if I can muster a bit of time to run through the exact rtd build workflow and figure out what's going on but probably won't get to it until this weekend.

delgadom · 2022-05-10T19:48:34Z

@jakirkham were you thinking a reference to the dask docs for more info on optimal chunk sizing and aligning with storage? or are you suggesting the proposed docs change is too complex?

I was trying to address the lack of documentation on specifying chunks within a zarr array for non-dask arrays/coordinates, but also covering the weedsy (but common) case of datasets with a mix of dask & in-memory arrays/coords like in my example. I have been frustrated by zarr stores I've written with a couple dozen array chunks and thousands of coordinate chunks for this reason, but it's definitely a gnarly topic to cover concisely :P

jakirkham · 2022-05-10T20:06:06Z

@jakirkham were you thinking a reference to the dask docs for more info on optimal chunk sizing and aligning with storage?

It could make sense to refer to or if similar ideas come up here it may be worth mentioning in this change

or are you suggesting the proposed docs change is too complex?

Not at all.

I was trying to address the lack of documentation on specifying chunks within a zarr array for non-dask arrays/coordinates, but also covering the weedsy (but common) case of datasets with a mix of dask & in-memory arrays/coords like in my example. I have been frustrated by zarr stores I've written with a couple dozen array chunks and thousands of coordinate chunks for this reason, but it's definitely a gnarly topic to cover concisely :P

If there's anything you need help with or would like to discuss, please don't hesitate to raise a Zarr issue. We also enabled GH discussions over there so if that fits better feel free to use that 🙂

dcherian · 2022-05-10T20:42:23Z

doc/user-guide/io.rst

+
+.. ipython:: python
+
+    ds = xr.tutorial.open_dataset("rasm")


Can we make this an idealized example to avoid having to download the dataset?

I might be wrong, but don't we use this elsewhere, too? In that case, pooch shouldn't re-download this.

that's what I thought - it's definitely used in a handful of places already, including within an ipython directive in the sphinx docs (see e.g. zarr encoding specification).

delgadom · 2022-05-11T15:45:24Z

If there's anything you need help with or would like to discuss, please don't hesitate to raise a Zarr issue. We also enabled GH discussions over there so if that fits better feel free to use that 🙂

@jakirkham appreciate that! And just to clarify my confusion/frustration in the past was simply around the issue that I'm documenting here, which I've finally figured out! so hopefully this will help resolve the problem for future users. I agree with Ryan that there might be some changes in the defaults that could be helpful here, though setting intuitive defaults other than zarr's defaults could get messy for complex datasets with a mix of dask and in-memory arrays. But I think that's all on the xarray side?

andersy005 · 2022-06-23T21:31:01Z

Thank you, @delgadom!

delgadom · 2022-06-24T18:18:47Z

@andersy005 last I saw this was still not building on readthedocs! I never figured out how to get around the build timeout are you sure this PR was good to go?

andersy005 · 2022-06-24T19:18:08Z

my bad :) thank you for reporting this in #6720. I'm going to look into what;s going on

* main: (129 commits) docs on specifying chunks in to_zarr encoding arg (pydata#6542) [skip-ci] List count under Aggregation (pydata#6711) Add `Dataset.dtypes` property (pydata#6706) try to import `UndefinedVariableError` from the new location (pydata#6701) DOC: note of how `chunks` can be defined (pydata#6696) pdyap version dependent client.open_url call (pydata#6656) use `pytest-reportlog` to generate upstream-dev CI failure reports (pydata#6699) [pre-commit.ci] pre-commit autoupdate (pydata#6694) Bump actions/setup-python from 3 to 4 (pydata#6692) Fix Dataset.where with drop=True and mixed dims (pydata#6690) pass kwargs through from save_mfdataset to to_netcdf (pydata#6686) Docs: indexing.rst finetuning (pydata#6685) use micromamba instead of mamba (pydata#6674) install the development version of `matplotlib` into the upstream-dev CI (pydata#6675) Add whatsnew section for v2022.06.0 release notes for 2022.06.0rc0 release notes for the pre-release (pydata#6676) more testpypi workflow fixes (pydata#6673) thin: add examples (pydata#6663) Update multidimensional-coords.ipynb (pydata#6672) ...

delgadom and others added 2 commits April 29, 2022 11:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

b792de1

for more information, see https://pre-commit.ci

max-sixty added the needs review label Apr 29, 2022

Merge branch 'main' into patch-1

fdf7303

clean-up prior write

81542b3

dcherian requested a review from rabernat May 2, 2022 19:32

max-sixty and others added 2 commits May 2, 2022 15:02

Merge branch 'main' into patch-1

79cc4f7

Merge branch 'main' into patch-1

ea0ce0e

andersy005 reviewed May 5, 2022

View reviewed changes