Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid accessing slow .data in unstack #5906

Merged
merged 4 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ Bug fixes
By `Maxime Liquet <https://github.com/maximlt>`_.
- ``open_mfdataset()`` now accepts a single ``pathlib.Path`` object (:issue: `5881`).
By `Panos Mavrogiorgos <https://github.com/pmav99>`_.
- Improved performance of :py:meth:`Dataset.unstack` (:pull:`5906`). By `Tom Augspurger <https://github.com/TomAugspurger>`_.

Documentation
~~~~~~~~~~~~~
Expand Down
54 changes: 27 additions & 27 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4153,34 +4153,34 @@ def unstack(
)

result = self.copy(deep=False)
for dim in dims:

if (
# Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
any(is_duck_dask_array(v.data) for v in self.variables.values())
# Sparse doesn't currently support (though we could special-case
# it)
# https://github.com/pydata/sparse/issues/422
or any(
isinstance(v.data, sparse_array_type)
for v in self.variables.values()
)
or sparse
# Until https://github.com/pydata/xarray/pull/4751 is resolved,
# we check explicitly whether it's a numpy array. Once that is
# resolved, explicitly exclude pint arrays.
# # pint doesn't implement `np.full_like` in a way that's
# # currently compatible.
# # https://github.com/pydata/xarray/pull/4746#issuecomment-753425173
# # or any(
# # isinstance(v.data, pint_array_type) for v in self.variables.values()
# # )
or any(
not isinstance(v.data, np.ndarray) for v in self.variables.values()
)
):
# we want to avoid allocating an object-dtype ndarray for a MultiIndex,
# so we can't just access self.variables[v].data for every variable.
# We only check the non-index variables.
# https://github.com/pydata/xarray/issues/5902
nonindexes = [
self.variables[k] for k in set(self.variables) - set(self.xindexes)
]
# Notes for each of these cases:
# 1. Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
# 2. Sparse doesn't currently support (though we could special-case it)
# https://github.com/pydata/sparse/issues/422
# 3. pint requires checking if it's a NumPy array until
# https://github.com/pydata/xarray/pull/4751 is resolved,
# Once that is resolved, explicitly exclude pint arrays.
# pint doesn't implement `np.full_like` in a way that's
# currently compatible.
needs_full_reindex = sparse or any(
is_duck_dask_array(v.data)
or isinstance(v.data, sparse_array_type)
or not isinstance(v.data, np.ndarray)
for v in nonindexes
)

for dim in dims:
if needs_full_reindex:
result = result._unstack_full_reindex(dim, fill_value, sparse)
else:
result = result._unstack_once(dim, fill_value)
Expand Down