New inline_array kwarg for open_dataset #6566

TomNicholas · 2022-05-02T19:39:07Z

Exposes the inline_array kwarg of dask.array.from_array in xr.open_dataset, and ds/da/variable.chunk.

What setting this to True does is inline the array into the opening/chunking task, which avoids an an extra array object at the start of the task graph. That's useful because the presence of that single common task connecting otherwise independent parts of the graph can confuse the graph optimizer.

With open_dataset(..., inline_array=False):

With open_dataset(..., inline_array=True):

In our case (xGCM) this is important because once inlined the optimizer understands that all the remaining parts of the graph are embarrasingly-parallel, and realizes that it can fuze all our chunk-wise padding tasks into one padding task per chunk.

I think this option could help in any case where someone is opening data from a Zarr store (the reason we had this opener task) or a netCDF file.

The value of the kwarg should be kept optional because in theory inlining is a tradeoff between fewer tasks and more memory use, but I think there might be a case for setting the default to be True?

Questions:

How should I test this?
Should it default to False or True?
inline_array or inline? (inline_array doesn't really make sense for open_dataset, which creates multiple arrays)

Closes Avoid Adapters in task graphs? #1895
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

@rabernat @jbusecke

rabernat · 2022-05-02T19:46:06Z

Exposing this options seems like a great idea IMO.

I'm not sure the best way to test this. I think the most basic test is just to make sure the inline=True option gets invoked in the test suite. Going further, one could examine the dask graph to make sure inlining is actually happening, but that sounds fragile and maybe also not xarray's responsibility. Let's just make sure it gets to dask.

rabernat · 2022-05-02T19:47:18Z

xarray/core/variable.py

@@ -2710,7 +2723,7 @@ def values(self, values):
            f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate."
        )

-    def chunk(self, chunks={}, name=None, lock=False):
+    def chunk(self, chunks={}, name=None, lock=False, inline_array=False):


What is the point of this function if it doesn't do anything?

It means that Dataset.chunk doesn't have to specifically deal with IndexVariable (convenient!) but is the cause of #6204

xarray/backends/api.py

TomNicholas · 2022-05-04T14:56:31Z

I think the test failure might be because our minimum dependencies CI uses dask_core=2.30, but the inline_array kwarg was added in dask_core=2021.01.0. That's actually only a few versions afterwards, so I'll try bumping the dependency in this PR.

TomNicholas · 2022-05-04T14:59:44Z

There is some discussion on the dask PR that added this feature about what the default value for the flag should be. They suggest that at least for datasets opened from zarr it might always be better to inline_array=True. I guess we could change the default in open_zarr, if not in open_dataset?

Tagging @shoyer because he had opinions in that dask PR discussion.

dcherian · 2022-05-04T15:25:35Z

that's actually only a few versions afterwards, so I'll try bumping the dependency in this PR.

See #6559 getting the env to work took some effort

TomNicholas · 2022-05-04T15:30:52Z

@dcherian thanks for the heads-up. #6559 has a recent enough version of dask, so if that gets merged then I can just pull it into this PR and avoid messing about with versions here.

dcherian · 2022-05-04T15:34:48Z

@TomNicholas can you review and check that I didn't miss anything?

ci/requirements/min-all-deps.yml

…into inline_array

TomNicholas · 2022-05-11T17:31:55Z

We discussed this in the team meeting today.

Questions:

How should I test this?

I've added a test which simply counts the number of nodes in the dask graph and checks that it is smaller when inline_array is True.

Should it default to False or True?

We decided False for now, and maybe switch it in a future PR

inline_array or inline? (inline_array doesn't really make sense for open_dataset, which . creates multiple arrays)

I'll just leave it as inline_array for now.

I think this can be merged?

dcherian

LGTM. Thanks @TomNicholas

xarray/backends/api.py

xarray/core/dataset.py

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

TomNicholas · 2022-05-11T18:37:24Z

For some reason counting the number of tasks in the dask graph via len(ds.__dask_graph__()) raises an Error on Windows.

>           assert num_graph_nodes(inlined) < num_graph_nodes(not_inlined)

...

>                   os.unlink(fullname)
E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpnmm87jlx\\temp-415.nc'

I only need to do this in the test, so unless someone knows a more robust way to check the number of tasks in the task graph, I'll just add a skipif windows to the test.

…into inline_array

dcherian · 2022-05-11T19:10:04Z

Lets skip windows for now.

@crusaderky this looks weird:

For some reason counting the number of tasks in the dask graph via len(ds.dask_graph()) raises an Error on Windows.

shoyer · 2022-05-11T21:15:36Z

For whatever reason, Windows seems to be much stricter about requiring file handles to be explicitly closed. So my guess is that this could be solved by using open_dataset() as a context manager.

crusaderky · 2022-05-11T22:12:24Z

Lets skip windows for now.

@crusaderky this looks weird:

For some reason counting the number of tasks in the dask graph via len(ds.dask_graph()) raises an Error on Windows.

I think that's the context manager teardown, not the task counting

* main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ...

commit 398f1b6 Author: dcherian <deepak@cherian.net> Date: Fri May 20 08:47:56 2022 -0600 Backward compatibility dask commit bde40e4 Merge: 0783df3 4cae8d0 Author: dcherian <deepak@cherian.net> Date: Fri May 20 07:54:48 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) {full,zeros,ones}_like typing (pydata#6611) commit 0783df3 Merge: 5cff4f1 8de7061 Author: dcherian <deepak@cherian.net> Date: Sun May 15 21:03:50 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ... commit 5cff4f1 Merge: dfe200d 6144c61 Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Sun May 1 15:16:33 2022 -0700 Merge branch 'main' into dask-datetime-to-numeric commit dfe200d Author: dcherian <deepak@cherian.net> Date: Sun May 1 11:04:03 2022 -0600 Minor cleanup commit 35ed378 Author: dcherian <deepak@cherian.net> Date: Sun May 1 10:57:36 2022 -0600 Support dask arrays in datetime_to_numeric

TomNicholas added 2 commits May 2, 2022 15:16

added inline_array kwarg

0fb433d

remove cheeky print statements

8765acb

TomNicholas added enhancement topic-dask topic-zarr Related to zarr storage library labels May 2, 2022

TomNicholas mentioned this pull request May 2, 2022

Avoid Adapters in task graphs? #1895

Closed

rabernat reviewed May 2, 2022

View reviewed changes

dcherian reviewed May 2, 2022

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

Merge branch 'main' into inline_array

65987a9

TomNicholas commented May 3, 2022

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

Remove another rogue print statement

480fd8c

bump dask dependency

b6afdd4

update multiple dependencies based on min-deps-check.py

ecb5cc2

update environment to match pydata#6559

a2db21f

TomNicholas commented May 4, 2022

View reviewed changes

ci/requirements/min-all-deps.yml Outdated Show resolved Hide resolved

Update h5py in ci/requirements/min-all-deps.yml

032d9f3

TomNicholas commented May 4, 2022

View reviewed changes

ci/requirements/min-all-deps.yml Outdated Show resolved Hide resolved

TomNicholas and others added 3 commits May 4, 2022 17:44

Update ci/requirements/min-all-deps.yml

cae84ea

remove pynio from test env

4399569

Merge branch 'main' into inline_array

ce5758e

TomNicholas commented May 6, 2022

View reviewed changes

ci/requirements/min-all-deps.yml Outdated Show resolved Hide resolved

TomNicholas and others added 3 commits May 6, 2022 10:51

Update ci/requirements/min-all-deps.yml

d582576

promote inline_array kwarg to be top-level kwarg

a2a2419

whatsnew

07e2c8d

TomNicholas added 2 commits May 11, 2022 13:26

add test

070b45a

Merge branch 'inline_array' of https://github.com/TomNicholas/xarray …

2031154

…into inline_array

TomNicholas requested a review from dcherian May 11, 2022 17:29

TomNicholas added the needs review label May 11, 2022

dcherian approved these changes May 11, 2022

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Show resolved Hide resolved

dcherian removed the needs review label May 11, 2022

TomNicholas and others added 3 commits May 11, 2022 13:38

Remove repeated docstring entry

91a955f

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

Remove repeated docstring entry

8bed2bb

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

hyperlink to dask functions

cebd89a

TomNicholas enabled auto-merge (squash) May 11, 2022 17:43

TomNicholas added 4 commits May 11, 2022 14:37

skip test if on windows

7dbe364

correct spelling

7eb0569

correct spelling

058630f

Merge branch 'inline_array' of https://github.com/TomNicholas/xarray …

102b503

…into inline_array

TomNicholas merged commit 0512da1 into pydata:main May 11, 2022

TomNicholas deleted the inline_array branch May 11, 2022 20:35

dcherian mentioned this pull request Nov 8, 2023

[Dask.order] Memory usage regression for flox xarray reductions dask/dask#10618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New inline_array kwarg for open_dataset #6566

New inline_array kwarg for open_dataset #6566

TomNicholas commented May 2, 2022 •

edited

Loading

rabernat commented May 2, 2022

rabernat May 2, 2022

dcherian May 2, 2022

TomNicholas commented May 4, 2022

TomNicholas commented May 4, 2022

dcherian commented May 4, 2022

TomNicholas commented May 4, 2022

dcherian commented May 4, 2022

TomNicholas commented May 11, 2022

dcherian left a comment

TomNicholas commented May 11, 2022

dcherian commented May 11, 2022

shoyer commented May 11, 2022

crusaderky commented May 11, 2022

New inline_array kwarg for open_dataset #6566

New inline_array kwarg for open_dataset #6566

Conversation

TomNicholas commented May 2, 2022 • edited Loading

rabernat commented May 2, 2022

rabernat May 2, 2022

Choose a reason for hiding this comment

dcherian May 2, 2022

Choose a reason for hiding this comment

TomNicholas commented May 4, 2022

TomNicholas commented May 4, 2022

dcherian commented May 4, 2022

TomNicholas commented May 4, 2022

dcherian commented May 4, 2022

TomNicholas commented May 11, 2022

dcherian left a comment

Choose a reason for hiding this comment

TomNicholas commented May 11, 2022

dcherian commented May 11, 2022

shoyer commented May 11, 2022

crusaderky commented May 11, 2022

TomNicholas commented May 2, 2022 •

edited

Loading