-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop loading tutorial data by default #2538
Conversation
Hello @jhamman! Thanks for updating the PR.
Comment last updated on November 05, 2018 at 14:17 Hours UTC |
Our current tutorial datasets are 8MB and 17MB, which is pretty small. You'll definitely get better performance loading datasets of this size into NumPy arrays. |
@shoyer - absolutely we'll get better performance with numpy arrays in this case. So I'm trying to use our tutorial datasets for some examples with dask (dask/dask-examples#51). The docstring for the
(3) won't require any changes but makes it a little harder to connect the typical use pattern of |
OK, that seems reasonable. The default behavior should cache the arrays
loaded with NumPy anyways. I would not be opposed to renaming this to
open_dataset, either.
…On Sun, Nov 4, 2018 at 9:19 AM Joe Hamman ***@***.***> wrote:
@shoyer <https://github.com/shoyer> - absolutely we'll get better
performance with numpy arrays in this case. So I'm trying to use our
tutorial datasets for some examples with dask (dask/dask-examples#51
<dask/dask-examples#51>). The docstring for the
load_dataset function states that we can pass kwargs on to the
open_dataset function but if we pass chunks to the load_dataset call
currently, we still get data back as numpy arrays. We have some other
options here:
1. if chunks is a kwargs, return a dataset with data as persisted dask
arrays
2. provide a second function to handle returning datasets using the
same logic as open_dataset (caching, dask arrays, lazy loading, etc.)
3. tell people (like me) to rechunk the dataset after the fact
(3) won't require any changes but makes it a little harder to connect the
typical use pattern of open_dataset with tutorial.load_dataset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2538 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1mBjbk7l2qXi4EqFtMGdvDDoPJHaks5uryGUgaJpZM4YM5-d>
.
|
Sorry, to be clear what I meant here is that by default arrays loaded with NumPy get cached after the first/access/operation. Not that we need to preserve the existing behavior of |
@shoyer - I think I was tracking with you. I've gone ahead and deprecated the current |
* upstream/master: (122 commits) add missing , and article in error message (pydata#2557) Add libnetcdf, libhdf5, pydap and cfgrib to xarray.show_versions() (pydata#2555) revert to dev version for 0.11.1 Release xarray v0.11 DOC: update whatsnew for xarray 0.11 release (pydata#2548) Drop the hack needed to use CachingFileManager as we don't use it anymore. (pydata#2544) add full test env for py37 ci env (pydata#2545) Remove old-style resample example in documentation (pydata#2543) Stop loading tutorial data by default (pydata#2538) Remove the old syntax for resample. (pydata#2541) Remove use of deprecated, unused keyword. (pydata#2540) Deprecate inplace (pydata#2524) Zarr chunking (GH2300) (pydata#2487) Include multidimensional stacking groupby in docs (pydata#2493) (pydata#2536) Switch enable_cftimeindex to True by default (pydata#2516) Raise more informative error when converting tuples to Variable. (pydata#2523) Global option to always keep/discard attrs on operations (pydata#2482) Remove tests where answers change in cftime 1.0.2.1 (pydata#2522) Finish deprecation cycle for DataArray.__contains__ checking array values (pydata#2520) Fix bug where OverflowError is not being raised (pydata#2519) ...
whats-new.rst
for all changes andapi.rst
for new APIIn working on an xarray/dask tutorial, I've come to realize we eagerly load the tutorial datasets in
xarray.tutorial.load_dataset
. I'm going to just say that I don't think we should do that but I could be missing some rational. I didn't open an issue so please feel free to share thoughts here.One option would be to create a new function (
xr.tutorial.open_dataset
) that does what I'm suggesting and then slowly deprecatetutorial.load_dataset
. Thoughts?xref: dask/dask-examples#51