cov() and corr() #2652

hrishikeshac · 2019-01-04T23:30:44Z

Added da.corr() and da.cov() to dataarray.py. Test added in test_dataarray.py, and tested using pytest.
Concerns issue #1115

The test is based on demo data and can be readily added to the user guide.

…array.py, and tested using pytest.

pep8speaks · 2019-01-04T23:30:48Z

Hello @hrishikeshac! Thanks for updating the PR.

In the file xarray/core/dataarray.py, following are the PEP8 issues :

Line 2417:80: E501 line too long (85 > 79 characters)
Line 2448:80: E501 line too long (86 > 79 characters)

In the file xarray/tests/test_dataarray.py, following are the PEP8 issues :

Line 3312:80: E501 line too long (92 > 79 characters)
Line 3313:80: E501 line too long (89 > 79 characters)
Line 3342:80: E501 line too long (100 > 79 characters)

Comment last updated on January 04, 2019 at 23:39 Hours UTC

hrishikeshac · 2019-01-04T23:44:10Z

Made the code PEP8 compatible. Apologies for not doing so earlier.

max-sixty · 2019-01-05T05:49:40Z

xarray/tests/test_dataarray.py

+
+        expected = pd_corr(ts1, ts2)
+        actual = ts1.corr(ts2)
+        np.allclose(expected, actual)


You can use assert_allclose from xarray.testing. I'm not sure whether that is asserting or returning a bool

@max-sixty assert_allclose gives AssertionError, hence used np.allclose - it returns a bool.

You need to use an actual assertion here. Otherwise this isn't testing anything -- np.allclose() could fail and we wouldn't know.

fujiisoup

Thanks for sending a PR.

fujiisoup · 2019-01-05T13:58:48Z

xarray/core/dataarray.py

+        self, other = broadcast(self, other)
+
+        # 2. Ignore the nans
+        valid_values = self.notnull() & other.notnull()


It allocates larger memory than dot or tensordot.
Can we use xr.dot instead of broadcasting?

I used broadcast to ensure that the dataarrays get aligned and extra dimensions (if any) in one get inserted into the other. So, broadcast implemented here doesn't do any arithmetic computation, as such. I didn't know xr.dot could be used in such a context. Could it?

xarray.dot does do alignment/broadcasting, but it definitely doesn't skip missing values so I'm not sure it would would work well here.

In the typical case, I would expect arguments for which correlation is being computed will have the same dimensions. So I don't think xarray.dot would be much faster.

fujiisoup · 2019-01-05T14:00:07Z

xarray/core/dataarray.py

+        other = other.where(valid_values, drop=True)
+        valid_count = valid_values.sum(dim)
+
+        # 3. Compute mean and standard deviation along the given dim


remove 'and standard deviation'

I am a little worrying that users could misunderstand cov is for (auto-)covariance rather than cross-covariance, which we are implementing here.
Probably a function like xr.cov(x, y) is better than method?

I can implement it as xr.cov(x,y). However, I made the implementation to be consistent with pd.Series cov() and corr(). https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cov.html. So I think users might be more familiar with this implementation.
If we make it a function, then may be do it for both cov() and corr(), just to be consistent?

fujiisoup · 2019-01-05T14:07:30Z

I am a little worrying that users could misunderstand cov is for (auto-)covariance rather than cross-covariance, which we are implementing here.
Probably a function like xr.cov(x, y) is better than method?

dcherian · 2019-01-05T14:44:28Z

I agree with @fujisoup

shoyer · 2019-01-05T23:44:59Z

xarray/core/dataarray.py

+        self, other = broadcast(self, other)
+
+        # 2. Ignore the nans
+        valid_values = self.notnull() & other.notnull()


xarray.dot does do alignment/broadcasting, but it definitely doesn't skip missing values so I'm not sure it would would work well here.

shoyer · 2019-01-05T23:47:09Z

xarray/core/dataarray.py

+
+        # 2. Ignore the nans
+        valid_values = self.notnull() & other.notnull()
+        self = self.where(valid_values, drop=True)


It would be best to avoid drop=True if possible. Dropping elements can really slow things down when using dask arrays, because determining the elements to drop requires computing the arrays. In contrast, if we avoid drop=True we can build a lazy computation graph.

shoyer · 2019-01-05T23:51:45Z

xarray/core/dataarray.py

+        demeaned_other = other - other.mean(dim=dim)
+
+        # 4. Compute  covariance along the given dim
+        cov = (demeaned_self * demeaned_other).sum(dim=dim) / (valid_count)


I think this slightly simpler version would work:

self, other = broadcast(self, other) valid_values = self.notnull() & other.notnull() self = self.where(valid_values) other = self.where(valid_values) demeaned_self = self - self.mean(dim=dim) demeaned_other = other - other.mean(dim=dim) cov = (demeaned_self * demeaned_other).mean(dim=dim)

Or maybe we want to keep using valid_count for the ddof argument.

Testing this version. Will look into ddof.

shoyer · 2019-01-05T23:52:45Z

xarray/tests/test_dataarray.py

+
+        expected = pd_corr(ts1, ts2)
+        actual = ts1.corr(ts2)
+        np.allclose(expected, actual)


You need to use an actual assertion here. Otherwise this isn't testing anything -- np.allclose() could fail and we wouldn't know.

shoyer · 2019-01-05T23:53:38Z

xarray/tests/test_dataarray.py

@@ -3305,6 +3305,45 @@ def test_rank(self):
        y = DataArray([0.75, 0.25, np.nan, 0.5, 1.0], dims=('z',))
        assert_equal(y.rank('z', pct=True), y)

+    def test_corr(self):
+        # self: Load demo data and trim it's size
+        ds = xr.tutorial.load_dataset('air_temperature')


Loading the tutorial datasets requires network access, which we try to avoid for tests. Can you write this test using synthetic data instead?

Yes, will do. The tutorial code can be moved into an 'example' in documentation/ user guide later.

shoyer · 2019-01-05T23:55:21Z

xarray/core/dataarray.py

+        other = other.where(valid_values, drop=True)
+
+        # 3. Compute correlation based on standard deviations and cov()
+        self_std = self.std(dim=dim)


What value do we use for ddof? Should that be a keyword argument to this method?

shoyer · 2019-01-06T00:05:49Z

I also think making this a function is probably a good idea, even though it's different from pandas.

One question: how should these functions align their arguments? Recall that xarray does an inner join for arithmetic (though there's an option to control this), and an outer join in most other cases. It's not entirely obvious to me what the right choice is here (or if it really even matters).

max-sixty · 2019-01-07T07:04:46Z

Probably a function like xr.cov(x, y) is better than method?

We should be more concerned with correctness than consistency - but is having DataArray.dot consistent with DataArray.cov? If not, what's the salient difference?

fujiisoup · 2019-01-07T09:43:17Z

@max-sixty
I am not sure whether DataArray.dot is a right choice. But I am wondering for cov case, it sounds like to compute a covariance of the DataArray itself rather than the cross covariance with another DataArray.

shoyer · 2019-01-07T17:29:45Z

I agree that the case for DataArray.dot is questionable. It sort of makes sense because numpy and pandas both have it as a method, but the @ operator is a really a cleaner way to express this now that we're Python 3 only. (Speaking of which, why don't we support @ in xarray yet? :).)

…

On Mon, Jan 7, 2019 at 1:43 AM Keisuke Fujii ***@***.***> wrote: @max-sixty <https://github.com/max-sixty> I am not sure whether DataArray.dot is a right choice. But I am wondering for cov case, it sounds like to compute a covariance of the DataArray itself rather than the cross covariance with another DataArray. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1tDVgVcOe7Sw08cWI1II-HfgVAnIks5vAxa2gaJpZM4ZuRR9> .

hrishikeshac · 2019-01-09T17:32:12Z

I also think making this a function is probably a good idea, even though it's different from pandas.

One question: how should these functions align their arguments? Recall that xarray does an inner join for arithmetic (though there's an option to control this), and an outer join in most other cases. It's not entirely obvious to me what the right choice is here (or if it really even matters).

I always assumed an inner join is the way to go. I had initially just implemented align, but later changed to broadcast since the align doesn't add dimension/ labels (if missing in one of the inputs) to the output, but broadcast does. Without this, the where(valid_values) doesn't work if one input is 1-D and the other is N-D.

Hoeze · 2019-07-14T01:17:54Z

Is this pull request still up to date?

r-beer · 2019-11-19T21:48:07Z

Dear @Hoeze, I will (try to) finalize this merge request, as I am also very interested in this functionality.

I am new to xarray and contribution. I downloaded @hrishikeshac's code and ran the pytest tests locally.

I get 17 failed, 2088 passed, 2896 skipped, 7 xfailed, 754 warnings in 49.95s.

Is there an elegant way to share "which tests failed where" in order to avoid that I try to fix tests, that might already have been fixed in other branches?

I will already start to get a better understanding of why the tests fail and try to fix them in the meantime.

max-sixty · 2019-11-19T21:54:26Z

Great @r-beer , we can be helpful in getting you up & running

Given this branch has diverged from master, I would make your own fork and merge in master; looks like you'll have some minor conflicts. (more details in our contributing.rst docs, or post here if confused). You can then open up your own draft PR.

Re the tests: pytest should print a list of the tests that failed and their stack traces, do you not see anything?

r-beer · 2019-11-19T21:57:02Z

@max-sixty, thanks for the fast response!

Yeah, I get the traceback and already started diving into it. However, I assumed that @hrishikeshac's branch "corr" wasn't up-to-date.

Shall I merge changes from master or develop into corr, before looking further into the tests?

r-beer · 2019-11-19T22:03:46Z

@max-sixty, thanks for the fast response!

Yeah, I get the traceback and already started diving into it. However, I assumed that @hrishikeshac's branch "corr" wasn't up-to-date.

Shall I merge changes from master or develop into corr, before looking further into the tests?

I read http://xarray.pydata.org/en/stable/contributing.html, is this identical to contributing.rst?
Following those guidelines, I would use the following commands to "retrieve the changes from the master branch":

git fetch upstream
git rebase upstream/master

Where upstream = https://github.com/pydata/xarray.git?

max-sixty · 2019-11-19T22:06:49Z

Yes 100%! Let me know if that doesn't work!

r-beer · 2019-11-19T22:27:10Z

Alright, I only got two merge conflicts in dataarray.py:

minor merge conflict concerning imports:

accessors -> accessors_td
broadcast has been dropped in master?

<<<<<<< HEAD
from . import (
    computation,
    dtypes,
    groupby,
    indexing,
    ops,
    pdcompat,
    resample,
    rolling,
    utils,
)
from .accessor_dt import DatetimeAccessor
from .accessor_str import StringAccessor
from .alignment import (
    _broadcast_helper,
    _get_broadcast_dims_map_common_coords,
    align,
    reindex_like_indexers,
)
=======
from .accessors import DatetimeAccessor
from .alignment import align, reindex_like_indexers, broadcast
>>>>>>> added da.corr() and da.cov() to dataarray.py. Test added in test_dataarray.py, and tested using pytest.

Secondly, some bigger merge conflicts concerning some of dataarray's methods, but they seem to be not in conflict with each other:

integrate(), unify_chunks() and map_blocks added in master
cov() and corr() added in corr branch
It seems like str = property(StringAccessor) should be at the end of dataarray's definition, for mypy reasons...

<<<<<<< HEAD
    def integrate(
        self, dim: Union[Hashable, Sequence[Hashable]], datetime_unit: str = None
    ) -> "DataArray":
        """ integrate the array with the trapezoidal rule.

        .. note::
            This feature is limited to simple cartesian geometry, i.e. dim
            must be one dimensional.

        Parameters
        ----------
        dim: hashable, or a sequence of hashable
            Coordinate(s) used for the integration.
        datetime_unit: str, optional
            Can be used to specify the unit if datetime coordinate is used.
            One of {'Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', 'us', 'ns', 'ps',
            'fs', 'as'}

        Returns
        -------
        integrated: DataArray

        See also
        --------
        numpy.trapz: corresponding numpy function

        Examples
        --------

        >>> da = xr.DataArray(np.arange(12).reshape(4, 3), dims=['x', 'y'],
        ...                   coords={'x': [0, 0.1, 1.1, 1.2]})
        >>> da
        <xarray.DataArray (x: 4, y: 3)>
        array([[ 0,  1,  2],
               [ 3,  4,  5],
               [ 6,  7,  8],
               [ 9, 10, 11]])
        Coordinates:
          * x        (x) float64 0.0 0.1 1.1 1.2
        Dimensions without coordinates: y
        >>>
        >>> da.integrate('x')
        <xarray.DataArray (y: 3)>
        array([5.4, 6.6, 7.8])
        Dimensions without coordinates: y
        """
        ds = self._to_temp_dataset().integrate(dim, datetime_unit)
        return self._from_temp_dataset(ds)

    def unify_chunks(self) -> "DataArray":
        """ Unify chunk size along all chunked dimensions of this DataArray.

        Returns
        -------

        DataArray with consistent chunk sizes for all dask-array variables

        See Also
        --------

        dask.array.core.unify_chunks
        """
        ds = self._to_temp_dataset().unify_chunks()
        return self._from_temp_dataset(ds)

    def map_blocks(
        self,
        func: "Callable[..., T_DSorDA]",
        args: Sequence[Any] = (),
        kwargs: Mapping[str, Any] = None,
    ) -> "T_DSorDA":
        """
        Apply a function to each chunk of this DataArray. This method is experimental
        and its signature may change.

        Parameters
        ----------
        func: callable
            User-provided function that accepts a DataArray as its first parameter. The
            function will receive a subset of this DataArray, corresponding to one chunk
            along each chunked dimension. ``func`` will be executed as
            ``func(obj_subset, *args, **kwargs)``.

            The function will be first run on mocked-up data, that looks like this array
            but has sizes 0, to determine properties of the returned object such as
            dtype, variable names, new dimensions and new indexes (if any).

            This function must return either a single DataArray or a single Dataset.

            This function cannot change size of existing dimensions, or add new chunked
            dimensions.
        args: Sequence
            Passed verbatim to func after unpacking, after the sliced DataArray. xarray
            objects, if any, will not be split by chunks. Passing dask collections is
            not allowed.
        kwargs: Mapping
            Passed verbatim to func after unpacking. xarray objects, if any, will not be
            split by chunks. Passing dask collections is not allowed.

        Returns
        -------
        A single DataArray or Dataset with dask backend, reassembled from the outputs of
        the function.

        Notes
        -----
        This method is designed for when one needs to manipulate a whole xarray object
        within each chunk. In the more common case where one can work on numpy arrays,
        it is recommended to use apply_ufunc.

        If none of the variables in this DataArray is backed by dask, calling this
        method is equivalent to calling ``func(self, *args, **kwargs)``.

        See Also
        --------
        dask.array.map_blocks, xarray.apply_ufunc, xarray.map_blocks,
        xarray.Dataset.map_blocks
        """
        from .parallel import map_blocks

        return map_blocks(func, self, args, kwargs)

    # this needs to be at the end, or mypy will confuse with `str`
    # https://mypy.readthedocs.io/en/latest/common_issues.html#dealing-with-conflicting-names
    str = property(StringAccessor)

=======
    def cov(self, other, dim = None):
        """Compute covariance between two DataArray objects along a shared dimension.

        Parameters
        ----------
        other: DataArray
            The other array with which the covariance will be computed
        dim: The dimension along which the covariance will be computed

        Returns
        -------
        covariance: DataArray
        """
        # 1. Broadcast the two arrays
        self, other     = broadcast(self, other)

        # 2. Ignore the nans
        valid_values    = self.notnull() & other.notnull()
        self            = self.where(valid_values, drop=True)
        other           = other.where(valid_values, drop=True)
        valid_count     = valid_values.sum(dim)

        #3. Compute mean and standard deviation along the given dim
        demeaned_self   = self - self.mean(dim = dim)
        demeaned_other  = other - other.mean(dim = dim)

        #4. Compute  covariance along the given dim
        cov             =  (demeaned_self*demeaned_other).sum(dim=dim)/(valid_count)

        return cov

    def corr(self, other, dim = None):
        """Compute correlation between two DataArray objects along a shared dimension.

        Parameters
        ----------
        other: DataArray
            The other array with which the correlation will be computed
        dim: The dimension along which the correlation will be computed

        Returns
        -------
        correlation: DataArray
        """
        # 1. Broadcast the two arrays
        self, other     = broadcast(self, other)

        # 2. Ignore the nans
        valid_values    = self.notnull() & other.notnull()
        self            = self.where(valid_values, drop=True)
        other           = other.where(valid_values, drop=True)

        # 3. Compute correlation based on standard deviations and cov()
        self_std        = self.std(dim=dim)
        other_std       = other.std(dim=dim)

        return self.cov(other, dim = dim)/(self_std*other_std)
>>>>>>> added da.corr() and da.cov() to dataarray.py. Test added in test_dataarray.py, and tested using pytest.

Can you please comment my suggested changes (accepting either changes from master or both, if no conflicts).

max-sixty · 2019-11-19T23:02:37Z

Yeah those are pretty normal. Your suggestions look right: you can keep both on both sets and eliminate any duplicate imports in the first,

(FYI I edited your comment so it displayed properly, seems you need a line break after <details>)

r-beer · 2019-11-20T09:20:39Z

Alright, I have done so and changed basestrings into str, now the tests run (mostly) through locally.

1 failed, 6539 passed, 1952 skipped, 37 xfailed, 31 warnings in 86.34s

A general question concerning collaboration on existing PRs:
Should I push the changes to my fork and then create a new PR that replaces @hrishikeshac's PR or shall I push to @hrishikeshac's fork and see whether it works to update the existing PR?

Or is there another option?

PS: Permission to push to hrishikeshac:corr is denied for me.

max-sixty · 2019-11-20T15:11:27Z

@r-beer great—you were right to start your own PR

max-sixty · 2020-05-25T16:57:28Z

@hrishikeshac in case you come back to see this: thank you for taking it so far; your code was helpful to eventually getting this feature in. And we'd of course appreciate any additional contributions.

added da.corr() and da.cov() to dataarray.py. Test added in test_data…

0d97b4d

…array.py, and tested using pytest.

Made the code PEP8 compatible

17c1181

max-sixty reviewed Jan 5, 2019

View reviewed changes

fujiisoup reviewed Jan 5, 2019

View reviewed changes

shoyer reviewed Jan 5, 2019

View reviewed changes

r-beer mentioned this pull request Nov 19, 2019

Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

Closed

r-beer mentioned this pull request Nov 20, 2019

cov() and corr() - finalization #3550

Closed

16 tasks

max-sixty mentioned this pull request Feb 20, 2020

Function for regressing/correlating multiple fields? #3784

Closed

AndrewILWilliams mentioned this pull request May 25, 2020

xr.cov() and xr.corr() #4089

Merged

5 tasks

max-sixty closed this May 25, 2020

cov() and corr() #2652

cov() and corr() #2652

Conversation

hrishikeshac commented Jan 4, 2019

pep8speaks commented Jan 4, 2019 • edited Loading

Comment last updated on January 04, 2019 at 23:39 Hours UTC

hrishikeshac commented Jan 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujiisoup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujiisoup commented Jan 5, 2019

dcherian commented Jan 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jan 6, 2019

max-sixty commented Jan 7, 2019

fujiisoup commented Jan 7, 2019

shoyer commented Jan 7, 2019 via email

hrishikeshac commented Jan 9, 2019

Hoeze commented Jul 14, 2019

r-beer commented Nov 19, 2019

max-sixty commented Nov 19, 2019

r-beer commented Nov 19, 2019

r-beer commented Nov 19, 2019

max-sixty commented Nov 19, 2019

r-beer commented Nov 19, 2019 • edited by max-sixty Loading

max-sixty commented Nov 19, 2019

r-beer commented Nov 20, 2019 • edited Loading

max-sixty commented Nov 20, 2019

max-sixty commented May 25, 2020

pep8speaks commented Jan 4, 2019 •

edited

Loading

r-beer commented Nov 19, 2019 •

edited by max-sixty

Loading

r-beer commented Nov 20, 2019 •

edited

Loading