-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize dask array equality checks. #3453
Conversation
Dask arrays with the same graph have the same name. We can use this to quickly compare dask-backed variables without computing. Fixes pydata#3068 and pydata#3311
xarray/core/duck_array_ops.py
Outdated
): | ||
# GH3068 | ||
if arr1.name == arr2.name: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking a little repetitive. Can you make a helper function for doing these checks?
Also note the similar logic (checking object identity) inside Variable.equals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. We can just call lazy_array_equiv
in all these before doing the actual value comparison. Then only the asarray
cast is duplicated.
Re Variable.equals
:
xarray/xarray/core/variable.py
Lines 1560 to 1576 in fb0cf7b
def equals(self, other, equiv=duck_array_ops.array_equiv): | |
"""True if two Variables have the same dimensions and values; | |
otherwise False. | |
Variables can still be equal (like pandas objects) if they have NaN | |
values in the same locations. | |
This method is necessary because `v1 == v2` for Variables | |
does element-wise comparisons (like numpy.ndarrays). | |
""" | |
other = getattr(other, "variable", other) | |
try: | |
return self.dims == other.dims and ( | |
self._data is other._data or equiv(self.data, other.data) | |
) | |
except (TypeError, AttributeError): | |
return False |
equiv=duck_array_ops.array_equiv
by default which now calls duck_array_ops.lazy_array_equiv
first so things are good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've added the identity check to lazy_array_equiv
too
* upstream/master: Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) Drop groups associated with nans in group variable (pydata#3406) Allow ellipsis (...) in transpose (pydata#3421) Another groupby.reduce bugfix. (pydata#3403) add icomoon license (pydata#3448)
Ready for final review and merge. |
* upstream/master: upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450)
xarray/core/merge.py
Outdated
for var in variables[1:]: | ||
equals = getattr(out, compat)(var) | ||
equals = getattr(out, compat)(var, equiv=lazy_array_equiv) | ||
if not equals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be if equals is not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right (as above) but the next one is wrong. I've fixed that and added a test.
xarray/core/concat.py
Outdated
equals[k] = getattr(variables[0], compat)( | ||
var, equiv=lazy_array_equiv | ||
) | ||
if not equals[k]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as below -- should this be checking against None
explicitly, which I believe lazy_array_equiv
can return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right. We are exiting early because equals[k] is not True
i.e. either the shapes are not equal, or one (or both) of the arrays is a numpy array or the dask names are not equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shoyer.
I've added more comments and made the if conditions more explicit so the logic is (hopefully) clearer now. I did find a bug in merge.py
and fixed it. I've added better tests too.
xarray/core/concat.py
Outdated
equals[k] = getattr(variables[0], compat)( | ||
var, equiv=lazy_array_equiv | ||
) | ||
if not equals[k]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right. We are exiting early because equals[k] is not True
i.e. either the shapes are not equal, or one (or both) of the arrays is a numpy array or the dask names are not equal
xarray/core/merge.py
Outdated
for var in variables[1:]: | ||
equals = getattr(out, compat)(var) | ||
equals = getattr(out, compat)(var, equiv=lazy_array_equiv) | ||
if not equals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right (as above) but the next one is wrong. I've fixed that and added a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Thanks for adding the explicit checks is not True
, etc. It's definitely good to be explicit about exactly what conditions we are checking rather than relying upon "implicit false" here (even though that's the normal Python idiom).
Thanks for the review @shoyer
I myself was confused when I went through it the second time :) |
…tests * upstream/master: drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453)
* upstream/master: (27 commits) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453) Propagate indexes in DataArray binary operations. (pydata#3481) python 3.8 tests (pydata#3477) __dask_tokenize__ (pydata#3446) Type check sentinel values (pydata#3472) Fix typo in docstring (pydata#3474) fix test suite warnings re `drop` (pydata#3460) Fix integrate docs (pydata#3469) Fix leap year condition in monthly means example (pydata#3464) Hypothesis tests for roundtrip to & from pandas (pydata#3285) unpin cftime (pydata#3463) Cleanup whatsnew (pydata#3462) enable xr.ALL_DIMS in xr.dot (pydata#3424) Merge stable into master (pydata#3457) upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) ...
* upstream/master: add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) Dataset.map, GroupBy.map, Resample.map (pydata#3459) tests for datasets with units (pydata#3447) fix pandas-dev tests (pydata#3491) unpin pseudonetcdf (pydata#3496) whatsnew corrections (pydata#3494) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453)
* upstream/master: format indexing.rst code with black (pydata#3511) add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) Dataset.map, GroupBy.map, Resample.map (pydata#3459) tests for datasets with units (pydata#3447) fix pandas-dev tests (pydata#3491) unpin pseudonetcdf (pydata#3496) whatsnew corrections (pydata#3494) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453) Propagate indexes in DataArray binary operations. (pydata#3481) python 3.8 tests (pydata#3477)
commit d430ae0 Author: dcherian <deepak@cherian.net> Date: Wed Nov 13 08:27:04 2019 -0700 proper fix. commit 7fd69be Author: dcherian <deepak@cherian.net> Date: Wed Nov 13 08:03:26 2019 -0700 fix whats-new merge. commit 4489394 Merge: 279ff1d b74f80c Author: dcherian <deepak@cherian.net> Date: Wed Nov 13 08:03:06 2019 -0700 Merge remote-tracking branch 'upstream/master' into fix/plot-broadcast * upstream/master: format indexing.rst code with black (pydata#3511) add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) Dataset.map, GroupBy.map, Resample.map (pydata#3459) tests for datasets with units (pydata#3447) fix pandas-dev tests (pydata#3491) unpin pseudonetcdf (pydata#3496) whatsnew corrections (pydata#3494) drop_vars; deprecate drop for variables (pydata#3475) uamiv test using only raw uamiv variables (pydata#3485) Optimize dask array equality checks. (pydata#3453) Propagate indexes in DataArray binary operations. (pydata#3481) python 3.8 tests (pydata#3477) commit 279ff1d Author: dcherian <deepak@cherian.net> Date: Wed Nov 13 08:02:44 2019 -0700 Undo the transpose change and add test to make sure transposition is right. commit c9cc698 Author: dcherian <deepak@cherian.net> Date: Wed Nov 13 08:01:39 2019 -0700 Test to make sure transpose is right commit 9b35ecf Author: dcherian <deepak@cherian.net> Date: Sat Nov 2 15:49:08 2019 -0600 Additional test. commit 7aed950 Author: dcherian <deepak@cherian.net> Date: Sat Nov 2 15:20:07 2019 -0600 make plotting work with transposed nondim coords.
Dask arrays with the same graph have the same name. We can use this to quickly
compare dask-backed variables without computing. (see #3068 (comment))
I will work on adding extra tests but could use feedback on the approach.
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API@djhoese, thanks for the great example code!