-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid accessing slow .data in unstack #5906
Conversation
There are two changes here
cc @dcherian |
xarray/core/dataset.py
Outdated
# Dask arrays don't support assignment by index, which the fast unstack | ||
# function requires. | ||
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125 | ||
any(is_duck_dask_array(v.data) for v in nonindexes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same loop is being used here for each of these checks.
- would it be better to add them all in one loop?
- since they all use any() I think we can break the loop at the first sight of true value.
- sparse is a constant, waiting on a loop to finish before seems unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6363a76 (hopefully I got that right).
Agreed with doing it all in one loop since I suspect that perhaps allocating v.data
will be more expensive that any of the checks, so we should avoid doing that as long as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a normal for loop you can remove two of the v.data.
for v in nonindexes:
data_ = v.data
....
Might be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that after the initial access of v.data
, which allocates the data if it isn't yet, subsequent ones will be fast. Or is there some case that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that may be the case. I've seen though that just calling a property can be quite slow sometimes.
Nice work! Our benchmarks show 2X speedup |
Thanks @TomAugspurger |
* main: Add typing_extensions as a required dependency (pydata#5911) pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914) Avoid accessing slow .data in unstack (pydata#5906) Add wradlib to ecosystem in docs (pydata#5915) Use .to_numpy() for quantified facetgrids (pydata#5886) [test-upstream] fix pd skipna=None (pydata#5899) Add var and std to weighted computations (pydata#5870) Check for path-like objects rather than Path type, use os.fspath (pydata#5879) Handle single `PathLike` objects in `open_mfdataset()` (pydata#5884)
* upstream/main: Add typing_extensions as a required dependency (pydata#5911) pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914) Avoid accessing slow .data in unstack (pydata#5906) Add wradlib to ecosystem in docs (pydata#5915) Use .to_numpy() for quantified facetgrids (pydata#5886) [test-upstream] fix pd skipna=None (pydata#5899) Add var and std to weighted computations (pydata#5870)
DataArray.unstack()
from checkingvariable.data
#5902pre-commit run --all-files
whats-new.rst
api.rst