einsum for xarray #1968

fujiisoup · 2018-03-06T14:18:22Z

Closes einsum for xarray #1951
Tests added
Tests passed (for all non-documentation changes)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

Currently, lazy-einsum for dask is not yet working.

@shoyer
I think apply_ufunc supports lazy computation, but I did not yet figure out how to do this.
Can you give me a help?

xarray/core/computation.py

shoyer

Very nice!

shoyer · 2018-03-06T19:07:55Z

xarray/core/computation.py

+    subscripts = ''
+    for ds in input_core_dims:
+        subscripts += '...' + ''.join([dim_map[d] for d in ds]) + ','
+    subscripts = subscripts[:-1]  # remove last comma


It would probably be cleaner to build up subscripts as a list and use ','.join(subscripts_list) once at the end.

shoyer · 2018-03-06T19:09:31Z

xarray/core/computation.py

+
+    result = apply_ufunc(np.einsum, subscripts, *arrays,
+                         input_core_dims=[[]] + input_core_dims,
+                         output_core_dims=output_core_dims, dask='allowed')


I think dask='parallelized' is what you want here -- that generate the wrapper to do this with dask. This will require also determining the result data type, probably with dtypes.result_type or even np.result_type (we don't need support for non-numeric types in einsum, so I'm pretty sure NumPy's casting rules would work fine).

dask='allowed' would be appropriate if np.einsum already supported dask arrays (but it does not).

It's possible that a dask specific einsum could be much more efficient than the auto-generated wrapper here, but certainly this is good enough for now.

Thanks. I noticed that my current implementation is not very efficient for dask.
Maybe smaller number of input_core_dims is better for dask?

I think I need some improvement.

dask='parallelized' will only parallelize over broadcast dimensions, i.e., ones that don't appear in either input_core_dims or output_core_dims. So yes, it will probably be slow in many cases.

I'm still OK adding the non-optimal einsum for now and improving it later.

shoyer · 2018-03-06T19:10:17Z

xarray/core/computation.py

+    if len(arrays) < 2:
+        raise TypeError('More than two arrays must be provided')
+
+    if any(not hasattr(arr, 'dims') for arr in arrays):


Dataset also defines dims. It's probably better to explicitly use an isinstance() check.

shoyer · 2018-03-06T19:11:38Z

xarray/core/dataarray.py

-                    [d for d in other.dims if d not in dims])
-
-        return type(self)(new_data, new_coords.variables, new_dims)
+        # backward compat: if there is no shared dimension, we rais an Errror


It might be better to eliminate this special case. Then users can understand DataArray.dot as a simple short-cut for xarray.dot().

shoyer · 2018-03-06T19:15:55Z

xarray/core/computation.py

+    arrays = args
+    if dims is None and isinstance(args[-1], (list, tuple, basestring)):
+        dims = args[-1]
+        arrays = args[:-1]


I think it is better to require specifying dims with a keyword argument.

Our previous dot does not require dim. This assumes to sum over along all the common dimensions.
I think dim=None is not surprising.

I agree, the default dims=None should be OK. I meant that dims should be a keyword only argument, not a required argument.

Here you are supporting xr.dot(a, b, 'x'), where 'x' denotes a dimension. I would require writing xr.dot(a, b, dim='x') or omitting dim altogether.

shoyer · 2018-03-06T19:16:12Z

xarray/core/computation.py

@@ -926,6 +926,86 @@ def earth_mover_distance(first_samples,
        return apply_array_ufunc(func, *args, dask=dask)


+def dot(*args, **kwargs):
+    """ dot(*arrays, dims=None)


dot(*arrays, *, dims=None) is the way to write this with Python 3's keyword only arguments.

Maybe we would keep this as dot(*arrays, **kwargs) as we did not yet drop python 2 support?

I was confused. def dot(*arrays, *, dims=None) is not valid syntax in Python 3, either. (There can only be one single *)

PEP3102 says we python 3 supports the form def dot(*arrays, dim=None).

fujiisoup · 2018-03-07T06:58:18Z

xarray/core/computation.py

+        return apply_ufunc(duck_array_ops.tensordot, *arrays, dask='allowed',
+                           input_core_dims=input_core_dims,
+                           output_core_dims=output_core_dims,
+                           kwargs={'axes': axes})


Thanks. I added a path for tensordot, which dask can compute more efficiently.

shoyer

Some feedback on the documentation (mostly grammar).

shoyer · 2018-03-07T07:24:33Z

xarray/core/computation.py

+    ----------
+    arrays: multiple DataArrays
+        arrays to compute.
+    dims: tuple of strings, optional


str or tuple of strings

shoyer · 2018-03-07T07:25:11Z

xarray/core/computation.py

+    """ dot(*arrays, *, dims=None)
+
+    einsum for xarray object, but providing simpler interface based on
+    the array dimensions.


We should lead with a more general description. Maybe:

Generalized dot product for xarray objects. Like np.einsum, but provides a simpler interface based on array dimensions.

shoyer · 2018-03-07T07:27:00Z

xarray/core/computation.py

+
+    Parameters
+    ----------
+    arrays: multiple DataArrays


*arrays: DataArray objects

shoyer · 2018-03-07T07:27:14Z

xarray/core/computation.py

+    Parameters
+    ----------
+    arrays: multiple DataArrays
+        arrays to compute.


shoyer · 2018-03-07T07:27:28Z

xarray/core/computation.py

+    arrays: multiple DataArrays
+        arrays to compute.
+    dims: tuple of strings, optional
+        Along which dimensions to be summed over.


Which dimensions to sum over.

shoyer · 2018-03-07T07:28:15Z

xarray/core/computation.py

+
+    Returns
+    -------
+    dot: same type to input.


Probably should just "DataArray"?

shoyer · 2018-03-07T07:37:52Z

xarray/core/computation.py

+
+    common_dims = set(arrays[0].dims)
+    for arr in arrays[1:]:
+        common_dims = common_dims.intersection(set(arr.dims))


This is a slightly different choice of default dimensions than np.einsum:

np.einsum sums over any dimensions that are defined in two over more inputs.

This sums only over dimensions that are defined on all inputs.

Should we switch this behavior to match einsum?

shoyer · 2018-03-07T07:40:32Z

xarray/core/computation.py

+                            dims=['a', 'b', 'c'])
+    >>> da_c = xr.DataArray(np.arange(5 * 6).reshape(5, 6), dims=['c', 'd'])
+
+    >>> dot(da_a, da_b, dims=['a', 'b']).dims


These should use the full name xr.dot.

shoyer · 2018-03-07T07:43:33Z

xarray/core/computation.py

+    dims = kwargs.pop('dims', None)
+
+    if len(arrays) < 2:
+        raise TypeError('More than one arrays must be provided')


Do we need this special case? If not, let's remove this. For consistency, it is nice to use the same logic even for edge cases when possible. This makes it easier to think about the function.

In this case, I think a dot product of 1 array would consistently defined by summing over dimensions listed explicitly in dims.

stickler-ci · 2018-03-07T09:23:05Z

xarray/core/computation.py

+    dims = kwargs.pop('dims', None)
+    if len(kwargs) > 0:
+        raise TypeError('Invalid keyward arguments {} are given'.format(
+            kwargs.keys()))


W1655 dict.keys referenced when not iterating

shoyer · 2018-03-07T16:49:17Z

xarray/core/computation.py

+        # find dimensions that exist in more than two arrays
+        whole_dims = []
+        for arr in arrays:
+            whole_dims += [d for d in arr.dims]


This might be a nice use for collections.Counter(), e.g.,

dim_counts = Counter(): for arr in arrays: dim_counts.update(arr.dims)

shoyer · 2018-03-07T16:50:56Z

xarray/core/computation.py

@@ -974,27 +977,30 @@ def dot(*arrays, **kwargs):
        dims = [dims]

    common_dims = set(arrays[0].dims)
+    all_dims = []


would it work to make all_dims a set instead of a list? I think that would be slightly more efficient.

I want to keep the occurrence order in all_dims, so that to move input_core_dims positions back to the original position.

OK, sounds good.

shoyer · 2018-03-07T16:52:22Z

xarray/core/computation.py

-    if len(arrays) < 2:
-        raise TypeError('More than one arrays must be provided')
+    if len(arrays) < 2 and dims is None:
+        raise TypeError('dim must be provided for one array computation.')


If there's only one array, wouldn't dims just be any repeated dimensions on the single array?

xarray objects do not have any repeated dimensions.

This is not strictly true: #1378 . That said, we certainly don't support repeated dims well right now.

Even if we banned repeated dimensions, I still think there's no harm in supporting the trivial xr.dot(array) -> array.

OK. Updated.

shoyer · 2018-03-07T16:54:54Z

xarray/core/computation.py

@@ -926,6 +926,86 @@ def earth_mover_distance(first_samples,
        return apply_array_ufunc(func, *args, dask=dask)


+def dot(*args, **kwargs):
+    """ dot(*arrays, dims=None)


I was confused. def dot(*arrays, *, dims=None) is not valid syntax in Python 3, either. (There can only be one single *)

shoyer · 2018-03-08T02:09:53Z

xarray/core/computation.py

+    common_dims = set(arrays[0].dims)
+    all_dims = []
+    for arr in arrays[1:]:
+        common_dims = common_dims.intersection(set(arr.dims))


It might be slightly more efficient to construct common_dims with a single call to intersection?

e.g.,
common_dims = set.intersection(*[set(arr.dims) for arr in arrays])

shoyer · 2018-03-08T02:11:29Z

xarray/core/computation.py

+    if len(kwargs) > 0:
+        raise TypeError('Invalid keyward arguments {} are given'.format(
+            list(kwargs.keys())))
+


What happens if you write xr.dot()? I suppose we still need to raise an error for 0 arguments.

shoyer

Let's wait a little while to see if anyone else has feedback, e.g,. on the name. But this looks very nice to me!

shoyer · 2018-03-08T02:57:32Z

xarray/core/computation.py

@@ -968,15 +968,19 @@ def dot(*arrays, **kwargs):
            list(kwargs.keys())))

    if any(not isinstance(arr, DataArray) for arr in arrays):
-        raise TypeError('Only xr.DataArray and xr.Variable are supported.')
+        raise TypeError('Only xr.DataArray and xr.Variable are supported.'


We should either update the error message or isinstance() check here -- right now they are inconsistent.

max-sixty · 2018-03-08T03:50:51Z

xarray/core/computation.py

+            list(kwargs.keys())))
+
+    if any(not isinstance(arr, DataArray) for arr in arrays):
+        raise TypeError('Only xr.DataArray and xr.Variable are supported.'


Either a type checking or a docstring issue:

In [8]: v=xr.Variable(data=np.random.rand(3,4), dims=('a','b')) In [9]: xr.dot(v,v) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-9-fac8e1cb222a> in <module>() ----> 1 xr.dot(v,v) ~/drive/workspace/xarray/xarray/core/computation.py in dot(*arrays, **kwargs) 970 if any(not isinstance(arr, DataArray) for arr in arrays): 971 raise TypeError('Only xr.DataArray and xr.Variable are supported.' --> 972 'Given {}.'.format([type(arr) for arr in arrays])) 973 974 if len(arrays) == 0: TypeError: Only xr.DataArray and xr.Variable are supported.Given [<class 'xarray.core.variable.Variable'>, <class 'xarray.core.variable.Variable'>].

max-sixty · 2018-03-08T03:57:33Z

xarray/core/computation.py

+        raise TypeError('At least one array should be given.')
+
+    if isinstance(dims, basestring):
+        dims = (dims, )


FWIW you don't need the parentheses

I personally like parentheses, as I think it is more descriptive.

max-sixty · 2018-03-08T04:00:34Z

xarray/core/computation.py

+    if isinstance(dims, basestring):
+        dims = (dims, )
+    elif isinstance(dims, list):
+        dims = tuple(dims)


FWIW dims=tuple(dims) doesn't create any copies if dims is already a tuple, so you could skip the if isinstance check

fujiisoup · 2018-03-08T04:34:12Z

Thanks, @maxim-lian
added xr.Variable support for xr.dot.

max-sixty · 2018-03-08T04:49:24Z

This is awesome. Beautiful code, immediately impactful, and the API is so simple - a testament to the benefits of named dims

Thank you @fujiisoup !

max-sixty · 2018-03-08T04:52:41Z

Do you know why the tests are failing? Do you want me to have a look?

The arrays look the same: https://travis-ci.org/pydata/xarray/jobs/350640898#L5182. Would assert_close help?

fujiisoup · 2018-03-08T05:06:14Z

I just noticed the test failings.
This was a bug caused by the undefined order of set.
Fixed. Thanks :)

fujiisoup · 2018-03-10T01:49:11Z

I'm going to merge this tomorrow if there are no further comments.

fujiisoup added 2 commits March 6, 2018 23:11

einsum for xarray

220ebcc

whats new

4239ac6

max-sixty reviewed Mar 6, 2018

View reviewed changes

xarray/core/computation.py Outdated Show resolved Hide resolved

shoyer reviewed Mar 6, 2018

View reviewed changes

fujiisoup added 2 commits March 7, 2018 15:48

Support dask for xr.dot.

0f472a2

Merge branch 'master' into einsum

c83d442

fujiisoup commented Mar 7, 2018

View reviewed changes

shoyer reviewed Mar 7, 2018

View reviewed changes

flake8. Add some error messages.

1c732a4

stickler-ci reviewed Mar 7, 2018

View reviewed changes

fix for sticker-ci

b8d93b0

shoyer reviewed Mar 7, 2018

View reviewed changes

fujiisoup added 2 commits March 8, 2018 09:09

Use counter

3278bf3

Always allow dims=None for xr.dot.

1ec5683

shoyer reviewed Mar 8, 2018

View reviewed changes

Simplify logic. More comments.

789cb96

shoyer approved these changes Mar 8, 2018

View reviewed changes

max-sixty reviewed Mar 8, 2018

View reviewed changes

Support variable in xr.dot

a57907c

bug fix due to the undefined order of set

693b242

Remove unused casting to set

88be319

shoyer mentioned this pull request Mar 9, 2018

0.10.2 release #1975

Closed

3 tasks

fujiisoup added 2 commits March 12, 2018 14:39

Merge branch 'master' into einsum

b3d4768

Merge branch 'master' into einsum

2bd06ef

fujiisoup merged commit 8271dff into pydata:master Mar 12, 2018

fujiisoup deleted the einsum branch March 12, 2018 06:42

einsum for xarray #1968

einsum for xarray #1968

Conversation

fujiisoup commented Mar 6, 2018 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujiisoup Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujiisoup commented Mar 8, 2018

max-sixty commented Mar 8, 2018

max-sixty commented Mar 8, 2018

fujiisoup commented Mar 8, 2018

fujiisoup commented Mar 10, 2018

fujiisoup commented Mar 6, 2018 •

edited

Loading

fujiisoup Mar 8, 2018 •

edited

Loading