Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New inline_array kwarg for open_dataset #6566

Merged
merged 23 commits into from
May 11, 2022

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 2, 2022

Exposes the inline_array kwarg of dask.array.from_array in xr.open_dataset, and ds/da/variable.chunk.

What setting this to True does is inline the array into the opening/chunking task, which avoids an an extra array object at the start of the task graph. That's useful because the presence of that single common task connecting otherwise independent parts of the graph can confuse the graph optimizer.

With open_dataset(..., inline_array=False):

With open_dataset(..., inline_array=True):

In our case (xGCM) this is important because once inlined the optimizer understands that all the remaining parts of the graph are embarrasingly-parallel, and realizes that it can fuze all our chunk-wise padding tasks into one padding task per chunk.

I think this option could help in any case where someone is opening data from a Zarr store (the reason we had this opener task) or a netCDF file.

The value of the kwarg should be kept optional because in theory inlining is a tradeoff between fewer tasks and more memory use, but I think there might be a case for setting the default to be True?

Questions:

  1. How should I test this?
  2. Should it default to False or True?
  3. inline_array or inline? (inline_array doesn't really make sense for open_dataset, which creates multiple arrays)

@rabernat @jbusecke

@rabernat
Copy link
Contributor

rabernat commented May 2, 2022

Exposing this options seems like a great idea IMO.

I'm not sure the best way to test this. I think the most basic test is just to make sure the inline=True option gets invoked in the test suite. Going further, one could examine the dask graph to make sure inlining is actually happening, but that sounds fragile and maybe also not xarray's responsibility. Let's just make sure it gets to dask.

@@ -2710,7 +2723,7 @@ def values(self, values):
f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate."
)

def chunk(self, chunks={}, name=None, lock=False):
def chunk(self, chunks={}, name=None, lock=False, inline_array=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this function if it doesn't do anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that Dataset.chunk doesn't have to specifically deal with IndexVariable (convenient!) but is the cause of #6204

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/api.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member Author

I think the test failure might be because our minimum dependencies CI uses dask_core=2.30, but the inline_array kwarg was added in dask_core=2021.01.0. That's actually only a few versions afterwards, so I'll try bumping the dependency in this PR.

@TomNicholas
Copy link
Member Author

There is some discussion on the dask PR that added this feature about what the default value for the flag should be. They suggest that at least for datasets opened from zarr it might always be better to inline_array=True. I guess we could change the default in open_zarr, if not in open_dataset?

Tagging @shoyer because he had opinions in that dask PR discussion.

@dcherian
Copy link
Contributor

dcherian commented May 4, 2022

that's actually only a few versions afterwards, so I'll try bumping the dependency in this PR.

See #6559 getting the env to work took some effort

@TomNicholas
Copy link
Member Author

@dcherian thanks for the heads-up. #6559 has a recent enough version of dask, so if that gets merged then I can just pull it into this PR and avoid messing about with versions here.

@dcherian
Copy link
Contributor

dcherian commented May 4, 2022

@TomNicholas can you review and check that I didn't miss anything?

@TomNicholas
Copy link
Member Author

We discussed this in the team meeting today.

Questions:

  1. How should I test this?

I've added a test which simply counts the number of nodes in the dask graph and checks that it is smaller when inline_array is True.

  1. Should it default to False or True?

We decided False for now, and maybe switch it in a future PR

  1. inline_array or inline? (inline_array doesn't really make sense for open_dataset, which . creates multiple arrays)

I'll just leave it as inline_array for now.

I think this can be merged?

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @TomNicholas

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
TomNicholas and others added 3 commits May 11, 2022 13:38
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@TomNicholas TomNicholas enabled auto-merge (squash) May 11, 2022 17:43
@TomNicholas
Copy link
Member Author

For some reason counting the number of tasks in the dask graph via len(ds.__dask_graph__()) raises an Error on Windows.

>           assert num_graph_nodes(inlined) < num_graph_nodes(not_inlined)

...

>                   os.unlink(fullname)
E                   PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpnmm87jlx\\temp-415.nc'

I only need to do this in the test, so unless someone knows a more robust way to check the number of tasks in the task graph, I'll just add a skipif windows to the test.

@dcherian
Copy link
Contributor

Lets skip windows for now.

@crusaderky this looks weird:

For some reason counting the number of tasks in the dask graph via len(ds.dask_graph()) raises an Error on Windows.

@TomNicholas TomNicholas merged commit 0512da1 into pydata:main May 11, 2022
@TomNicholas TomNicholas deleted the inline_array branch May 11, 2022 20:35
@shoyer
Copy link
Member

shoyer commented May 11, 2022

For whatever reason, Windows seems to be much stricter about requiring file handles to be explicitly closed. So my guess is that this could be solved by using open_dataset() as a context manager.

@crusaderky
Copy link
Contributor

Lets skip windows for now.

@crusaderky this looks weird:

For some reason counting the number of tasks in the dask graph via len(ds.dask_graph()) raises an Error on Windows.

I think that's the context manager teardown, not the task counting

dcherian added a commit to dcherian/xarray that referenced this pull request May 20, 2022
* main: (24 commits)
  Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
  Enable flox in GroupBy and resample (pydata#5734)
  Add setuptools as dependency in ASV benchmark CI (pydata#6609)
  change polyval dim ordering (pydata#6601)
  re-add timedelta support for polyval (pydata#6599)
  Minor Dataset.map docstr clarification (pydata#6595)
  New inline_array kwarg for open_dataset (pydata#6566)
  Fix polyval overloads (pydata#6593)
  Restore old MultiIndex dropping behaviour (pydata#6592)
  [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
  Fix zarr append dtype checks (pydata#6476)
  Add missing space in exception message (pydata#6590)
  Doc Link to accessors list in extending-xarray.rst (pydata#6587)
  Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
  Add some warnings about rechunking to the docs (pydata#6569)
  [pre-commit.ci] pre-commit autoupdate (pydata#6584)
  terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
  Allow string formatting of scalar DataArrays (pydata#5981)
  Fix mypy issues & reenable in tests (pydata#6581)
  polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
  ...
dcherian added a commit to headtr1ck/xarray that referenced this pull request May 20, 2022
commit 398f1b6
Author: dcherian <deepak@cherian.net>
Date:   Fri May 20 08:47:56 2022 -0600

    Backward compatibility dask

commit bde40e4
Merge: 0783df3 4cae8d0
Author: dcherian <deepak@cherian.net>
Date:   Fri May 20 07:54:48 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main:
      concatenate docs style (pydata#6621)
      Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612)
      {full,zeros,ones}_like typing (pydata#6611)

commit 0783df3
Merge: 5cff4f1 8de7061
Author: dcherian <deepak@cherian.net>
Date:   Sun May 15 21:03:50 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main: (24 commits)
      Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
      Enable flox in GroupBy and resample (pydata#5734)
      Add setuptools as dependency in ASV benchmark CI (pydata#6609)
      change polyval dim ordering (pydata#6601)
      re-add timedelta support for polyval (pydata#6599)
      Minor Dataset.map docstr clarification (pydata#6595)
      New inline_array kwarg for open_dataset (pydata#6566)
      Fix polyval overloads (pydata#6593)
      Restore old MultiIndex dropping behaviour (pydata#6592)
      [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
      Fix zarr append dtype checks (pydata#6476)
      Add missing space in exception message (pydata#6590)
      Doc Link to accessors list in extending-xarray.rst (pydata#6587)
      Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
      Add some warnings about rechunking to the docs (pydata#6569)
      [pre-commit.ci] pre-commit autoupdate (pydata#6584)
      terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
      Allow string formatting of scalar DataArrays (pydata#5981)
      Fix mypy issues & reenable in tests (pydata#6581)
      polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
      ...

commit 5cff4f1
Merge: dfe200d 6144c61
Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Date:   Sun May 1 15:16:33 2022 -0700

    Merge branch 'main' into dask-datetime-to-numeric

commit dfe200d
Author: dcherian <deepak@cherian.net>
Date:   Sun May 1 11:04:03 2022 -0600

    Minor cleanup

commit 35ed378
Author: dcherian <deepak@cherian.net>
Date:   Sun May 1 10:57:36 2022 -0600

    Support dask arrays in datetime_to_numeric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-dask topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid Adapters in task graphs?
5 participants