Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dask names change when chunking Variables by different amounts. #3584

Merged
merged 9 commits into from
Jan 10, 2020

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Dec 1, 2019

When rechunking by the current chunk size, name should not change.
Add a __dask_tokenize__ method for ReprObject so that this behaviour is present
when DataArrays are converted to temporary Datasets and back.

  • Closes assert_equal and dask #3350
  • Tests added
  • Passes black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

When rechunking by the current chunk size, name should not change.
Add a __dask_tokenize__ method for ReprObject so that this behaviour is present
when DataArrays are converted to temporary Datasets and back.
@dcherian dcherian mentioned this pull request Dec 1, 2019
…chunk-unique-token

* 'chunk-unique-token' of github.com:dcherian/xarray:
  remove more computes.
@dcherian
Copy link
Contributor Author

dcherian commented Dec 3, 2019

The tests fail on dask == 2.8.1 with this interesting bug. Here's a reproducible example.

import dask
import xarray as xr

ds = xr.Dataset({'x': (('y',), dask.array.ones(10, chunks=(3,)))})
mapped = ds.map_blocks(lambda x: x)
mapped.compute()  # works

xr.testing.assert_equal(mapped, ds)  # does not work
xr.testing.assert_equal(mapped, ds.compute()) # works
xr.testing.assert_equal(mapped.compute(), ds)  # works
xr.testing.assert_equal(mapped.compute(), ds.compute())  # works

The traceback is

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/array/optimization.py in optimize(dsk, keys, fuse_keys, fast_functions, inline_functions_fast_functions, rename_fused_keys, **kwargs)
     41     if isinstance(dsk, HighLevelGraph):
     42         dsk = optimize_blockwise(dsk, keys=keys)
---> 43         dsk = fuse_roots(dsk, keys=keys)
     44 
     45     # Low level task optimizations

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/blockwise.py in fuse_roots(graph, keys)
    819             isinstance(layer, Blockwise)
    820             and len(deps) > 1
--> 821             and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
    822             and all(len(dependents[dep]) == 1 for dep in deps)
    823         ):

~/miniconda3/envs/dcpy/lib/python3.7/site-packages/dask/blockwise.py in <genexpr>(.0)
    819             isinstance(layer, Blockwise)
    820             and len(deps) > 1
--> 821             and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
    822             and all(len(dependents[dep]) == 1 for dep in deps)
    823         ):

KeyError: 'lambda-6720ab0e3639d5c63fc06dfc66a3ce47-x'

This key is not in dependencies. From https://github.com/dask/dask/blob/67fb5363009c583c175cb577776a4f2f4e811410/dask/blockwise.py#L816-L826

    for name, layer in graph.layers.items():
        deps = graph.dependencies[name]
        if (
            isinstance(layer, Blockwise)
            and len(deps) > 1
            and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
            and all(len(dependents[dep]) == 1 for dep in deps)
        ):
            new = toolz.merge(layer, *[layers[dep] for dep in deps])
            new, _ = fuse(new, keys, ave_width=len(deps))

I'm not sure whether this is a bug in fuse_roots, HighLevelGraph.from_collections or in how map_blocks calls HighLevelGraph.from_collections here:

graph = HighLevelGraph.from_collections(gname, graph, dependencies=[dataset])

cc @mrocklin

@TomAugspurger
Copy link
Contributor

So this is enough to fix this in Dask

diff --git a/dask/blockwise.py b/dask/blockwise.py
index 52a36c246..84e0ecc08 100644
--- a/dask/blockwise.py
+++ b/dask/blockwise.py
@@ -818,7 +818,7 @@ def fuse_roots(graph: HighLevelGraph, keys: list):
         if (
             isinstance(layer, Blockwise)
             and len(deps) > 1
-            and not any(dependencies[dep] for dep in deps)  # no need to fuse if 0 or 1
+            and not any(dependencies.get(dep, {}) for dep in deps)  # no need to fuse if 0 or 1
             and all(len(dependents[dep]) == 1 for dep in deps)
         ):
             new = toolz.merge(layer, *[layers[dep] for dep in deps])

I'm trying to understand why we're getting this KeyError though. I want to make sure that we have a valid HighLevelGraph before making that change.

@TomAugspurger
Copy link
Contributor

@mrocklin if you get a chance, can you confirm that the values in HighLevelGraph.depedencies should be a subset of the keys of layers?

So in the following, the lambda-<...>-x is problematic, because it's not a key in layers?

(Pdb) pp list(self.layers)
['eq-e98e52fb2b8e27b4b5158d399330c72d',
 'lambda-0f1d0bc5e7df462d7125839aed006e04',
 'ones-c4a83f4b990021618d55e0fa61a351d6']
(Pdb) pp self.dependencies
{'eq-e98e52fb2b8e27b4b5158d399330c72d': {'lambda-0f1d0bc5e7df462d7125839aed006e04-x',
                                         'ones-c4a83f4b990021618d55e0fa61a351d6'},
 'lambda-0f1d0bc5e7df462d7125839aed006e04': {'ones-c4a83f4b990021618d55e0fa61a351d6'},
 'ones-c4a83f4b990021618d55e0fa61a351d6': set()}

That's coming from the name of the DataArray / the dask arary in DataArray.data.

@mrocklin
Copy link
Contributor

mrocklin commented Dec 5, 2019

@mrocklin if you get a chance, can you confirm that the values in HighLevelGraph.depedencies should be a subset of the keys of layers?

That sounds like a reasonable expectation, but honestly it's been a while, so I don't fully trust my knowledge here. It might be worth adding some runtime checks into the HighLevelGraph constructor to see where this might be occurring.

TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019
This fixes an issue with the HighLevelGraph noted in
pydata#3584, and exposed by a recent
change in Dask to do more HLG fusion.
TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019
This fixes an issue with the HighLevelGraph noted in
pydata#3584, and exposed by a recent
change in Dask to do more HLG fusion.
TomAugspurger added a commit to TomAugspurger/xarray that referenced this pull request Dec 5, 2019
This fixes an issue with the HighLevelGraph noted in
pydata#3584, and exposed by a recent
change in Dask to do more HLG fusion.
dcherian pushed a commit that referenced this pull request Dec 7, 2019
* Fix map_blocks HLG layering

This fixes an issue with the HighLevelGraph noted in
#3584, and exposed by a recent
change in Dask to do more HLG fusion.

* update

* black

* update
* upstream/master:
  Fix map_blocks HLG layering (pydata#3598)
  Silence sphinx warnings: Round 2 (pydata#3592)
  2x~5x speed up for isel() in most cases (pydata#3533)
  remove xarray again (pydata#3591)
  fix plotting with transposed nondim coords. (pydata#3441)
  make coarsen reductions consistent with reductions on other classes (pydata#3500)
  Resolve the version issues on RTD (pydata#3589)
  Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)
@dcherian dcherian requested review from crusaderky and shoyer December 7, 2019 06:07
…oken

* 'master' of github.com:pydata/xarray:
  Add nanmedian for dask arrays (pydata#3604)
  added pyinterp to related projects (pydata#3655)
  Allow incomplete hypercubes in combine_by_coords (pydata#3649)
  concat keeps attrs from first variable. (pydata#3637)
  Extend DatetimeAccessor properties and support `.dt` accessor for Timedelta (pydata#3612)
  update readthedocs.yml (pydata#3639)
  silence sphinx warnings round 3 (pydata#3602)
  Fix/quantile wrong errmsg (pydata#3635)
  Provide shape info in shape mismatch error. (pydata#3619)
  Minor doc fixes (pydata#3615)
  Respect user-specified coordinates attribute. (pydata#3487)
  Add Facetgrid.row_labels & Facetgrid.col_labels (pydata#3597)
  Fix pint integration tests (pydata#3600)
  Minor fix to combine_by_coords to allow for the combination of CFTimeIndexes separated by large time intervals (pydata#3543)
@dcherian
Copy link
Contributor Author

dcherian commented Jan 8, 2020

gentle ping @crusaderky

xarray/core/utils.py Outdated Show resolved Hide resolved
Co-Authored-By: crusaderky <crusaderky@gmail.com>
@dcherian dcherian merged commit 24f9292 into pydata:master Jan 10, 2020
@dcherian
Copy link
Contributor Author

Thanks @crusaderky

@dcherian dcherian deleted the chunk-unique-token branch January 10, 2020 16:11
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 14, 2020
* upstream/master:
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  Support swap_dims to dimension names that are not existing variables (pydata#3636)
  Add map_blocks example to docs. (pydata#3667)
  add multiindex level name checking to .rename() (pydata#3658)
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 15, 2020
* upstream/master:
  Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629)
  Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652)
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  Support swap_dims to dimension names that are not existing variables (pydata#3636)
  Add map_blocks example to docs. (pydata#3667)
  add multiindex level name checking to .rename() (pydata#3658)
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 21, 2020
* upstream/master: (23 commits)
  Feature/align in dot (pydata#3699)
  ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618)
  One-off isort run (pydata#3705)
  hardcoded xarray.__all__ (pydata#3703)
  Bump mypy to v0.761 (pydata#3704)
  remove DataArray and Dataset constructor deprecations for 0.15  (pydata#3560)
  Tests for variables with units (pydata#3654)
  Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629)
  Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652)
  allow passing any iterable to drop when dropping variables (pydata#3693)
  Typo on DataSet/DataArray.to_dict documentation (pydata#3692)
  Fix mypy type checking tests failure in ds.merge (pydata#3690)
  Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688)
  ds.merge(da) bugfix (pydata#3677)
  fix docstring for combine_first: returns a Dataset (pydata#3683)
  Add option to choose mfdataset attributes source. (pydata#3498)
  How do I add a new variable to dataset. (pydata#3679)
  Add map_blocks example to whats-new (pydata#3682)
  Make dask names change when chunking Variables by different amounts. (pydata#3584)
  raise an error when renaming dimensions to existing names (pydata#3645)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

assert_equal and dask
4 participants