Make dask names change when chunking Variables by different amounts. #3584

dcherian · 2019-12-01T02:18:52Z

When rechunking by the current chunk size, name should not change.
Add a __dask_tokenize__ method for ReprObject so that this behaviour is present
when DataArrays are converted to temporary Datasets and back.

Closes assert_equal and dask #3350
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

When rechunking by the current chunk size, name should not change. Add a __dask_tokenize__ method for ReprObject so that this behaviour is present when DataArrays are converted to temporary Datasets and back.

…chunk-unique-token * 'chunk-unique-token' of github.com:dcherian/xarray: remove more computes.

dcherian · 2019-12-03T15:57:16Z

The tests fail on dask == 2.8.1 with this interesting bug. Here's a reproducible example.

import dask
import xarray as xr

ds = xr.Dataset({'x': (('y',), dask.array.ones(10, chunks=(3,)))})
mapped = ds.map_blocks(lambda x: x)
mapped.compute()  # works

xr.testing.assert_equal(mapped, ds)  # does not work
xr.testing.assert_equal(mapped, ds.compute()) # works
xr.testing.assert_equal(mapped.compute(), ds)  # works
xr.testing.assert_equal(mapped.compute(), ds.compute())  # works

The traceback is

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/array/optimization.py in optimize(dsk, keys, fuse_keys, fast_functions, inline_functions_fast_functions, rename_fused_keys, **kwargs)
     41     if isinstance(dsk, HighLevelGraph):
     42         dsk = optimize_blockwise(dsk, keys=keys)
---> 43         dsk = fuse_roots(dsk, keys=keys)
     44 
     45     # Low level task optimizations

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/blockwise.py in fuse_roots(graph, keys)
    819             isinstance(layer, Blockwise)
    820             and len(deps) > 1
--> 821             and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
    822             and all(len(dependents[dep]) == 1 for dep in deps)
    823         ):

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/blockwise.py in <genexpr>(.0)
    819             isinstance(layer, Blockwise)
    820             and len(deps) > 1
--> 821             and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
    822             and all(len(dependents[dep]) == 1 for dep in deps)
    823         ):

KeyError: 'lambda-6720ab0e3639d5c63fc06dfc66a3ce47-x'

This key is not in dependencies. From https://github.com/dask/dask/blob/67fb5363009c583c175cb577776a4f2f4e811410/dask/blockwise.py#L816-L826

    for name, layer in graph.layers.items():
        deps = graph.dependencies[name]
        if (
            isinstance(layer, Blockwise)
            and len(deps) > 1
            and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
            and all(len(dependents[dep]) == 1 for dep in deps)
        ):
            new = toolz.merge(layer, *[layers[dep] for dep in deps])
            new, _ = fuse(new, keys, ave_width=len(deps))

I'm not sure whether this is a bug in fuse_roots, HighLevelGraph.from_collections or in how map_blocks calls HighLevelGraph.from_collections here:

xarray/xarray/core/parallel.py

Line 315 in 69c85b8

graph = HighLevelGraph.from_collections(gname, graph, dependencies=[dataset])

cc @mrocklin

TomAugspurger · 2019-12-04T18:17:56Z

So this is enough to fix this in Dask

diff --git a/dask/blockwise.py b/dask/blockwise.py
index 52a36c246..84e0ecc08 100644
--- a/dask/blockwise.py
+++ b/dask/blockwise.py
@@ -818,7 +818,7 @@ def fuse_roots(graph: HighLevelGraph, keys: list):
         if (
             isinstance(layer, Blockwise)
             and len(deps) > 1
-            and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
+            and not any(dependencies.get(dep, {}) for dep in deps)  # no need to fuse if 0 or 1
             and all(len(dependents[dep]) == 1 for dep in deps)
         ):
             new = toolz.merge(layer, *[layers[dep] for dep in deps])

I'm trying to understand why we're getting this KeyError though. I want to make sure that we have a valid HighLevelGraph before making that change.

TomAugspurger · 2019-12-04T19:09:34Z

@mrocklin if you get a chance, can you confirm that the values in HighLevelGraph.depedencies should be a subset of the keys of layers?

So in the following, the lambda-<...>-x is problematic, because it's not a key in layers?

(Pdb) pp list(self.layers)
['eq-e98e52fb2b8e27b4b5158d399330c72d',
 'lambda-0f1d0bc5e7df462d7125839aed006e04',
 'ones-c4a83f4b990021618d55e0fa61a351d6']
(Pdb) pp self.dependencies
{'eq-e98e52fb2b8e27b4b5158d399330c72d': {'lambda-0f1d0bc5e7df462d7125839aed006e04-x',
                                         'ones-c4a83f4b990021618d55e0fa61a351d6'},
 'lambda-0f1d0bc5e7df462d7125839aed006e04': {'ones-c4a83f4b990021618d55e0fa61a351d6'},
 'ones-c4a83f4b990021618d55e0fa61a351d6': set()}

That's coming from the name of the DataArray / the dask arary in DataArray.data.

mrocklin · 2019-12-05T01:16:03Z

@mrocklin if you get a chance, can you confirm that the values in HighLevelGraph.depedencies should be a subset of the keys of layers?

That sounds like a reasonable expectation, but honestly it's been a while, so I don't fully trust my knowledge here. It might be worth adding some runtime checks into the HighLevelGraph constructor to see where this might be occurring.

This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.

* Fix map_blocks HLG layering This fixes an issue with the HighLevelGraph noted in #3584, and exposed by a recent change in Dask to do more HLG fusion. * update * black * update

* upstream/master: Fix map_blocks HLG layering (pydata#3598) Silence sphinx warnings: Round 2 (pydata#3592) 2x~5x speed up for isel() in most cases (pydata#3533) remove xarray again (pydata#3591) fix plotting with transposed nondim coords. (pydata#3441) make coarsen reductions consistent with reductions on other classes (pydata#3500) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)

…oken * 'master' of github.com:pydata/xarray: Add nanmedian for dask arrays (pydata#3604) added pyinterp to related projects (pydata#3655) Allow incomplete hypercubes in combine_by_coords (pydata#3649) concat keeps attrs from first variable. (pydata#3637) Extend DatetimeAccessor properties and support `.dt` accessor for Timedelta (pydata#3612) update readthedocs.yml (pydata#3639) silence sphinx warnings round 3 (pydata#3602) Fix/quantile wrong errmsg (pydata#3635) Provide shape info in shape mismatch error. (pydata#3619) Minor doc fixes (pydata#3615) Respect user-specified coordinates attribute. (pydata#3487) Add Facetgrid.row_labels & Facetgrid.col_labels (pydata#3597) Fix pint integration tests (pydata#3600) Minor fix to combine_by_coords to allow for the combination of CFTimeIndexes separated by large time intervals (pydata#3543)

dcherian · 2020-01-08T18:31:08Z

gentle ping @crusaderky

xarray/core/utils.py

Co-Authored-By: crusaderky <crusaderky@gmail.com>

dcherian · 2020-01-10T16:11:01Z

Thanks @crusaderky

* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)

* upstream/master: Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)

* upstream/master: (23 commits) Feature/align in dot (pydata#3699) ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618) One-off isort run (pydata#3705) hardcoded xarray.__all__ (pydata#3703) Bump mypy to v0.761 (pydata#3704) remove DataArray and Dataset constructor deprecations for 0.15 (pydata#3560) Tests for variables with units (pydata#3654) Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) ...

dcherian added 2 commits November 30, 2019 21:18

Make dask names change when chunking Variables by different amounts.

06c706e

When rechunking by the current chunk size, name should not change. Add a __dask_tokenize__ method for ReprObject so that this behaviour is present when DataArrays are converted to temporary Datasets and back.

remove more computes.

96f47d4

dcherian mentioned this pull request Dec 1, 2019

assert_equal and dask #3350

Closed

dcherian added 2 commits December 3, 2019 09:44

remove more computes.

cd8a283

Merge branch 'chunk-unique-token' of github.com:dcherian/xarray into …

f0d73e1

…chunk-unique-token * 'chunk-unique-token' of github.com:dcherian/xarray: remove more computes.

TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019

Fix map_blocks HLG layering

104f5cf

This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.

TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019

Fix map_blocks HLG layering

12292e6

This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.

TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019

Fix map_blocks HLG layering

a9a5e93

This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.

TomAugspurger mentioned this pull request Dec 5, 2019

Fix map_blocks HLG layering #3598

Merged

dcherian mentioned this pull request Dec 5, 2019

map_blocks graph construction bug #3599

Closed

dcherian pushed a commit that referenced this pull request Dec 7, 2019

Fix map_blocks HLG layering (#3598)

cafcaee

* Fix map_blocks HLG layering This fixes an issue with the HighLevelGraph noted in #3584, and exposed by a recent change in Dask to do more HLG fusion. * update * black * update

dcherian added 3 commits December 6, 2019 21:30

fix whats-new

ba7cc8e

internal change.

50e3f2c

dcherian requested review from crusaderky and shoyer December 7, 2019 06:07

crusaderky approved these changes Jan 10, 2020

View reviewed changes

xarray/core/utils.py Outdated Show resolved Hide resolved

Update xarray/core/utils.py

7154efa

Co-Authored-By: crusaderky <crusaderky@gmail.com>

dcherian merged commit 24f9292 into pydata:master Jan 10, 2020

dcherian deleted the chunk-unique-token branch January 10, 2020 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dask names change when chunking Variables by different amounts. #3584

Make dask names change when chunking Variables by different amounts. #3584

dcherian commented Dec 1, 2019 •

edited

Loading

dcherian commented Dec 3, 2019

TomAugspurger commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

mrocklin commented Dec 5, 2019

dcherian commented Jan 8, 2020

dcherian commented Jan 10, 2020

Make dask names change when chunking Variables by different amounts. #3584

Make dask names change when chunking Variables by different amounts. #3584

Conversation

dcherian commented Dec 1, 2019 • edited Loading

dcherian commented Dec 3, 2019

TomAugspurger commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

mrocklin commented Dec 5, 2019

dcherian commented Jan 8, 2020

dcherian commented Jan 10, 2020

dcherian commented Dec 1, 2019 •

edited

Loading