Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-track unstack doesn't work with dask #5346

Closed
aulemahal opened this issue May 19, 2021 · 6 comments
Closed

Fast-track unstack doesn't work with dask #5346

aulemahal opened this issue May 19, 2021 · 6 comments

Comments

@aulemahal
Copy link
Contributor

What happened:
Using unstack on data with the dask backend fails with a dask error.

What you expected to happen:
No failure, as with xarray 0.18.0 and earlier.

Minimal Complete Verifiable Example:

import pandas as pd
import xarray as xr

da = xr.DataArray([1] * 4, dims=('x',), coords={'x': [1, 2, 3, 4]})
dac = da.chunk()

ind = pd.MultiIndex.from_arrays(([0, 0, 1, 1], [0, 1, 0, 1]), names=("y", "z"))
dac.assign_coords(x=ind).unstack("x")")

Fails with:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-4-3c317738ec05> in <module>
      3 
      4 ind = pd.MultiIndex.from_arrays(([0, 0, 1, 1], [0, 1, 0, 1]), names=("y", "z"))
----> 5 dac.assign_coords(x=ind).unstack("x")

~/Python/myxarray/xarray/core/dataarray.py in unstack(self, dim, fill_value, sparse)
   2133         DataArray.stack
   2134         """
-> 2135         ds = self._to_temp_dataset().unstack(dim, fill_value, sparse)
   2136         return self._from_temp_dataset(ds)
   2137 

~/Python/myxarray/xarray/core/dataset.py in unstack(self, dim, fill_value, sparse)
   4038             ):
   4039                 # Fast unstacking path:
-> 4040                 result = result._unstack_once(dim, fill_value)
   4041             else:
   4042                 # Slower unstacking path, examples of array types that

~/Python/myxarray/xarray/core/dataset.py in _unstack_once(self, dim, fill_value)
   3914                         fill_value_ = fill_value
   3915 
-> 3916                     variables[name] = var._unstack_once(
   3917                         index=index, dim=dim, fill_value=fill_value_
   3918                     )

~/Python/myxarray/xarray/core/variable.py in _unstack_once(self, index, dim, fill_value)
   1605         # sparse doesn't support item assigment,
   1606         # https://github.com/pydata/sparse/issues/114
-> 1607         data[(..., *indexer)] = reordered
   1608 
   1609         return self._replace(dims=new_dims, data=data)

~/.conda/envs/xxx/lib/python3.8/site-packages/dask/array/core.py in __setitem__(self, key, value)
   1693 
   1694         out = "setitem-" + tokenize(self, key, value)
-> 1695         dsk = setitem_array(out, self, key, value)
   1696 
   1697         graph = HighLevelGraph.from_collections(out, dsk, dependencies=[self])

~/.conda/envs/xxx/lib/python3.8/site-packages/dask/array/slicing.py in setitem_array(out_name, array, indices, value)
   1787 
   1788     # Reformat input indices
-> 1789     indices, indices_shape, reverse = parse_assignment_indices(indices, array_shape)
   1790 
   1791     # Empty slices can only be assigned size 1 values

~/.conda/envs/xxx/lib/python3.8/site-packages/dask/array/slicing.py in parse_assignment_indices(indices, shape)
   1476             n_lists += 1
   1477             if n_lists > 1:
-> 1478                 raise NotImplementedError(
   1479                     "dask is currently limited to at most one "
   1480                     "dimension's assignment index being a "

NotImplementedError: dask is currently limited to at most one dimension's assignment index being a 1-d array of integers or booleans. Got: (Ellipsis, array([0, 0, 1, 1], dtype=int8), array([0, 1, 0, 1], dtype=int8))

The example works when I go back to xarray 0.18.0.

Anything else we need to know?:
I saw no tests in "test_daraarray.py" and "test_dataset.py" for unstack+dask, but they might be elsewhere?
If #5315 was successful, maybe there is something specific in my example and config that is causing the error? @max-sixty @Illviljan

Proposed test, for "test_dataset.py", adapted copy of test_unstack:

    @requires_dask
    def test_unstack_dask(self):
        index = pd.MultiIndex.from_product([[0, 1], ["a", "b"]], names=["x", "y"])
        ds = Dataset({"b": ("z", [0, 1, 2, 3]), "z": index}).chunk()
        expected = Dataset(
            {"b": (("x", "y"), [[0, 1], [2, 3]]), "x": [0, 1], "y": ["a", "b"]}
        )
        for dim in ["z", ["z"], None]:
            actual = ds.unstack(dim).load()
            assert_identical(actual, expected)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.11.16-arch1-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_CA.utf8
LOCALE: ('fr_CA', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.18.2.dev2+g6d2a7301
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.8.1
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: installed
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: 2.4.0
bottleneck: 1.3.2
dask: 2021.05.0
distributed: 2021.05.0
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: 0.11.1
numbagg: installed
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.1
conda: None
pytest: 6.2.3
IPython: 7.22.0
sphinx: 3.5.4

@max-sixty
Copy link
Collaborator

That's not good.

Great spot @aulemahal . We may well not have tests for multiple dimensions.

We could revert this commit and release a patch release. Any thoughts @Illviljan ?

@Illviljan
Copy link
Contributor

Yeah, reverting it for now sounds good. Unless we can easily add a single dimension check?

@aulemahal
Copy link
Contributor Author

I'm not sure I understand the "dimensions" thing, isn't the MWE 1D? What is the case where it should work?

@Illviljan
Copy link
Contributor

Dask can do this nowadays:

x = da.array([1, 2, 3, 5])
x[(Ellipsis, [1, 2])] = 6, 7
x[(Ellipsis, [0, 3])] = 9, 8
x.compute()
Out[51]: array([9, 6, 7, 8])

But I suppose I was a little too eager to get this working and it felt like I got caught in plenty of tests.

@max-sixty
Copy link
Collaborator

But I suppose I was a little too eager to get this working and it felt like I got caught in plenty of tests.

TBC, no great stress — both of us should have checked whether we had tests (and the project should have had them!).

@max-sixty max-sixty mentioned this issue May 19, 2021
@max-sixty
Copy link
Collaborator

The new version is released and works with the original example. Thank a lot @aulemahal for the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants