-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dask names change when chunking Variables by different amounts. #3584
Conversation
When rechunking by the current chunk size, name should not change. Add a __dask_tokenize__ method for ReprObject so that this behaviour is present when DataArrays are converted to temporary Datasets and back.
…chunk-unique-token * 'chunk-unique-token' of github.com:dcherian/xarray: remove more computes.
The tests fail on import dask
import xarray as xr
ds = xr.Dataset({'x': (('y',), dask.array.ones(10, chunks=(3,)))})
mapped = ds.map_blocks(lambda x: x)
mapped.compute() # works
xr.testing.assert_equal(mapped, ds) # does not work
xr.testing.assert_equal(mapped, ds.compute()) # works
xr.testing.assert_equal(mapped.compute(), ds) # works
xr.testing.assert_equal(mapped.compute(), ds.compute()) # works The traceback is
This key is not in for name, layer in graph.layers.items():
deps = graph.dependencies[name]
if (
isinstance(layer, Blockwise)
and len(deps) > 1
and not any(dependencies[dep] for dep in deps) # no need to fuse if 0 or 1
and all(len(dependents[dep]) == 1 for dep in deps)
):
new = toolz.merge(layer, *[layers[dep] for dep in deps])
new, _ = fuse(new, keys, ave_width=len(deps)) I'm not sure whether this is a bug in xarray/xarray/core/parallel.py Line 315 in 69c85b8
cc @mrocklin |
So this is enough to fix this in Dask diff --git a/dask/blockwise.py b/dask/blockwise.py
index 52a36c246..84e0ecc08 100644
--- a/dask/blockwise.py
+++ b/dask/blockwise.py
@@ -818,7 +818,7 @@ def fuse_roots(graph: HighLevelGraph, keys: list):
if (
isinstance(layer, Blockwise)
and len(deps) > 1
- and not any(dependencies[dep] for dep in deps) # no need to fuse if 0 or 1
+ and not any(dependencies.get(dep, {}) for dep in deps) # no need to fuse if 0 or 1
and all(len(dependents[dep]) == 1 for dep in deps)
):
new = toolz.merge(layer, *[layers[dep] for dep in deps]) I'm trying to understand why we're getting this KeyError though. I want to make sure that we have a valid HighLevelGraph before making that change. |
@mrocklin if you get a chance, can you confirm that the values in So in the following, the (Pdb) pp list(self.layers)
['eq-e98e52fb2b8e27b4b5158d399330c72d',
'lambda-0f1d0bc5e7df462d7125839aed006e04',
'ones-c4a83f4b990021618d55e0fa61a351d6']
(Pdb) pp self.dependencies
{'eq-e98e52fb2b8e27b4b5158d399330c72d': {'lambda-0f1d0bc5e7df462d7125839aed006e04-x',
'ones-c4a83f4b990021618d55e0fa61a351d6'},
'lambda-0f1d0bc5e7df462d7125839aed006e04': {'ones-c4a83f4b990021618d55e0fa61a351d6'},
'ones-c4a83f4b990021618d55e0fa61a351d6': set()} That's coming from the |
That sounds like a reasonable expectation, but honestly it's been a while, so I don't fully trust my knowledge here. It might be worth adding some runtime checks into the |
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
* Fix map_blocks HLG layering This fixes an issue with the HighLevelGraph noted in #3584, and exposed by a recent change in Dask to do more HLG fusion. * update * black * update
* upstream/master: Fix map_blocks HLG layering (pydata#3598) Silence sphinx warnings: Round 2 (pydata#3592) 2x~5x speed up for isel() in most cases (pydata#3533) remove xarray again (pydata#3591) fix plotting with transposed nondim coords. (pydata#3441) make coarsen reductions consistent with reductions on other classes (pydata#3500) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)
…oken * 'master' of github.com:pydata/xarray: Add nanmedian for dask arrays (pydata#3604) added pyinterp to related projects (pydata#3655) Allow incomplete hypercubes in combine_by_coords (pydata#3649) concat keeps attrs from first variable. (pydata#3637) Extend DatetimeAccessor properties and support `.dt` accessor for Timedelta (pydata#3612) update readthedocs.yml (pydata#3639) silence sphinx warnings round 3 (pydata#3602) Fix/quantile wrong errmsg (pydata#3635) Provide shape info in shape mismatch error. (pydata#3619) Minor doc fixes (pydata#3615) Respect user-specified coordinates attribute. (pydata#3487) Add Facetgrid.row_labels & Facetgrid.col_labels (pydata#3597) Fix pint integration tests (pydata#3600) Minor fix to combine_by_coords to allow for the combination of CFTimeIndexes separated by large time intervals (pydata#3543)
gentle ping @crusaderky |
Co-Authored-By: crusaderky <crusaderky@gmail.com>
Thanks @crusaderky |
* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: (23 commits) Feature/align in dot (pydata#3699) ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618) One-off isort run (pydata#3705) hardcoded xarray.__all__ (pydata#3703) Bump mypy to v0.761 (pydata#3704) remove DataArray and Dataset constructor deprecations for 0.15 (pydata#3560) Tests for variables with units (pydata#3654) Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) ...
When rechunking by the current chunk size, name should not change.
Add a
__dask_tokenize__
method for ReprObject so that this behaviour is presentwhen DataArrays are converted to temporary Datasets and back.
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API