-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New inline_array kwarg for open_dataset #6566
Conversation
Exposing this options seems like a great idea IMO. I'm not sure the best way to test this. I think the most basic test is just to make sure the |
@@ -2710,7 +2723,7 @@ def values(self, values): | |||
f"Please use DataArray.assign_coords, Dataset.assign_coords or Dataset.assign as appropriate." | |||
) | |||
|
|||
def chunk(self, chunks={}, name=None, lock=False): | |||
def chunk(self, chunks={}, name=None, lock=False, inline_array=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this function if it doesn't do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that Dataset.chunk
doesn't have to specifically deal with IndexVariable (convenient!) but is the cause of #6204
I think the test failure might be because our minimum dependencies CI uses |
There is some discussion on the dask PR that added this feature about what the default value for the flag should be. They suggest that at least for datasets opened from zarr it might always be better to Tagging @shoyer because he had opinions in that dask PR discussion. |
See #6559 getting the env to work took some effort |
@TomNicholas can you review and check that I didn't miss anything? |
…into inline_array
We discussed this in the team meeting today.
I've added a test which simply counts the number of nodes in the dask graph and checks that it is smaller when
We decided False for now, and maybe switch it in a future PR
I'll just leave it as I think this can be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @TomNicholas
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
For some reason counting the number of tasks in the dask graph via
...
I only need to do this in the test, so unless someone knows a more robust way to check the number of tasks in the task graph, I'll just add a skipif windows to the test. |
Lets skip windows for now. @crusaderky this looks weird:
|
For whatever reason, Windows seems to be much stricter about requiring file handles to be explicitly closed. So my guess is that this could be solved by using |
I think that's the context manager teardown, not the task counting |
* main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ...
commit 398f1b6 Author: dcherian <deepak@cherian.net> Date: Fri May 20 08:47:56 2022 -0600 Backward compatibility dask commit bde40e4 Merge: 0783df3 4cae8d0 Author: dcherian <deepak@cherian.net> Date: Fri May 20 07:54:48 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) {full,zeros,ones}_like typing (pydata#6611) commit 0783df3 Merge: 5cff4f1 8de7061 Author: dcherian <deepak@cherian.net> Date: Sun May 15 21:03:50 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ... commit 5cff4f1 Merge: dfe200d 6144c61 Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Sun May 1 15:16:33 2022 -0700 Merge branch 'main' into dask-datetime-to-numeric commit dfe200d Author: dcherian <deepak@cherian.net> Date: Sun May 1 11:04:03 2022 -0600 Minor cleanup commit 35ed378 Author: dcherian <deepak@cherian.net> Date: Sun May 1 10:57:36 2022 -0600 Support dask arrays in datetime_to_numeric
Exposes the
inline_array
kwarg ofdask.array.from_array
inxr.open_dataset
, andds/da/variable.chunk
.What setting this to True does is inline the array into the opening/chunking task, which avoids an an extra array object at the start of the task graph. That's useful because the presence of that single common task connecting otherwise independent parts of the graph can confuse the graph optimizer.
With
open_dataset(..., inline_array=False)
:With
open_dataset(..., inline_array=True)
:In our case (xGCM) this is important because once inlined the optimizer understands that all the remaining parts of the graph are embarrasingly-parallel, and realizes that it can fuze all our chunk-wise padding tasks into one padding task per chunk.
I think this option could help in any case where someone is opening data from a Zarr store (the reason we had this opener task) or a netCDF file.
The value of the kwarg should be kept optional because in theory inlining is a tradeoff between fewer tasks and more memory use, but I think there might be a case for setting the default to be True?
Questions:
False
orTrue
?inline_array
orinline
? (inline_array
doesn't really make sense foropen_dataset
, which creates multiple arrays)whats-new.rst
@rabernat @jbusecke