Avoid accessing slow .data in unstack #5906

TomAugspurger · 2021-10-28T13:39:36Z

Closes Slow performance of DataArray.unstack() from checking variable.data #5902
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Closes pydata#5902

github-actions · 2021-10-28T13:41:19Z

Unit Test Results

        6 files         6 suites 56m 25s ⏱️
16 278 tests 14 550 ✔️ 1 728 💤 0 ❌
90 864 runs 82 692 ✔️ 8 172 💤 0 ❌

Results for commit 6363a76.

♻️ This comment has been updated with latest results.

TomAugspurger · 2021-10-28T13:43:04Z

There are two changes here

Only check the .data of non-index variables, done at https://github.com/pydata/xarray/pull/5906/files#diff-763e3002fd954d544b05858d8d138b828b66b6a2a0ae3cd58d2040a652f14638R4161-R4163
The check for whether or not a full index was needed is done in a for dim in dims loop, but the condition doesn't actually depend on dim. So I lifted that check out of the for loop (doesn't matter much, since stuff is cached).

cc @dcherian

xarray/core/dataset.py

Illviljan · 2021-10-28T18:55:19Z

xarray/core/dataset.py

+            # Dask arrays don't support assignment by index, which the fast unstack
+            # function requires.
+            # https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
+            any(is_duck_dask_array(v.data) for v in nonindexes)


The same loop is being used here for each of these checks.

would it be better to add them all in one loop?

since they all use any() I think we can break the loop at the first sight of true value.

sparse is a constant, waiting on a loop to finish before seems unnecessary

Done in 6363a76 (hopefully I got that right).

Agreed with doing it all in one loop since I suspect that perhaps allocating v.data will be more expensive that any of the checks, so we should avoid doing that as long as possible.

If you use a normal for loop you can remove two of the v.data.

for v in nonindexes: data_ = v.data ....

Might be faster?

I think that after the initial access of v.data, which allocates the data if it isn't yet, subsequent ones will be fast. Or is there some case that I'm missing?

Ok that may be the case. I've seen though that just calling a property can be quite slow sometimes.

dcherian · 2021-10-28T22:40:30Z

Nice work! Our benchmarks show 2X speedup

dcherian · 2021-10-29T15:14:35Z

Thanks @TomAugspurger

* main: Add typing_extensions as a required dependency (pydata#5911) pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914) Avoid accessing slow .data in unstack (pydata#5906) Add wradlib to ecosystem in docs (pydata#5915) Use .to_numpy() for quantified facetgrids (pydata#5886) [test-upstream] fix pd skipna=None (pydata#5899) Add var and std to weighted computations (pydata#5870) Check for path-like objects rather than Path type, use os.fspath (pydata#5879) Handle single `PathLike` objects in `open_mfdataset()` (pydata#5884)

* upstream/main: Add typing_extensions as a required dependency (pydata#5911) pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914) Avoid accessing slow .data in unstack (pydata#5906) Add wradlib to ecosystem in docs (pydata#5915) Use .to_numpy() for quantified facetgrids (pydata#5886) [test-upstream] fix pd skipna=None (pydata#5899) Add var and std to weighted computations (pydata#5870)

Tom Augspurger added 2 commits October 28, 2021 08:36

Avoid accessing slow .data in unstack

73c2a91

Closes pydata#5902

Added PR number

d0a7657

dcherian reviewed Oct 28, 2021

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

Update xarray/core/dataset.py

bba423f

dcherian added the run-benchmark Run the ASV benchmark workflow label Oct 28, 2021

Illviljan reviewed Oct 28, 2021

View reviewed changes

Refactor checks to avoid .data more

6363a76

Illviljan added run-benchmark Run the ASV benchmark workflow and removed run-benchmark Run the ASV benchmark workflow labels Oct 28, 2021

dcherian merged commit b2ed62e into pydata:main Oct 29, 2021

TomAugspurger deleted the fix/5902-unstack-perf branch October 29, 2021 15:29

snowman2 pushed a commit to snowman2/xarray that referenced this pull request Feb 9, 2022

Avoid accessing slow .data in unstack (pydata#5906)

b27b601

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid accessing slow .data in unstack #5906

Avoid accessing slow .data in unstack #5906

TomAugspurger commented Oct 28, 2021 •

edited by dcherian

Loading

github-actions bot commented Oct 28, 2021 •

edited

Loading

TomAugspurger commented Oct 28, 2021

Illviljan Oct 28, 2021

TomAugspurger Oct 28, 2021

Illviljan Oct 28, 2021

TomAugspurger Oct 29, 2021

Illviljan Oct 29, 2021

dcherian commented Oct 28, 2021

dcherian commented Oct 29, 2021

Avoid accessing slow .data in unstack #5906

Avoid accessing slow .data in unstack #5906

Conversation

TomAugspurger commented Oct 28, 2021 • edited by dcherian Loading

github-actions bot commented Oct 28, 2021 • edited Loading

Unit Test Results

TomAugspurger commented Oct 28, 2021

Illviljan Oct 28, 2021

Choose a reason for hiding this comment

TomAugspurger Oct 28, 2021

Choose a reason for hiding this comment

Illviljan Oct 28, 2021

Choose a reason for hiding this comment

TomAugspurger Oct 29, 2021

Choose a reason for hiding this comment

Illviljan Oct 29, 2021

Choose a reason for hiding this comment

dcherian commented Oct 28, 2021

dcherian commented Oct 29, 2021

TomAugspurger commented Oct 28, 2021 •

edited by dcherian

Loading

github-actions bot commented Oct 28, 2021 •

edited

Loading