xr.cov() and xr.corr() #4089

AndrewILWilliams · 2020-05-23T22:09:07Z

PR for the xr.cov() and xr.corr() functionality which others have been working on. Most code adapted from @r-beer in PR #3550.

TODO:

~~Write a reasonable set of tests, maybe not using pandas as a benchmark? (See Function for regressing/correlating multiple fields? #3784 (comment)) Will probably need some help with this~~

CHECKLIST:

Closes Function for regressing/correlating multiple fields? #3784, cov() and corr() - finalization #3550, cov() and corr() #2652, Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115
Tests added
Passes isort -rc . && black . && mypy . && flake8 ~~(something wrong with docs though??)~~
Fully documented, including whats-new.rst for all changes and api.rst for new API

Still issues I think

pep8speaks · 2020-05-23T22:09:12Z

Hello @AndrewWilliams3142! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-25 13:55:29 UTC

AndrewILWilliams · 2020-05-24T10:59:43Z

The current problem is that we can't use Pandas to fully test xr.cov() or xr.corr() because once you convert the DataArrays to a series or a dataframe for testing, you can't easily index them with a dim parameter. See @r-beer 's comment here #3550 (comment).

As such, I think it maybe just makes sense to test a few low-dimensional cases? Eg

>>> da_a = xr.DataArray(
          np.random.random((3, 21, 4)),
          coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
          dims=("a", "time", "x"),
      )

>>> da_b = xr.DataArray(
          np.random.random((3, 21, 4)),
          coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
          dims=("a", "time", "x"),
      )

>>> xr.cov(da_a, da_b, 'time')
<xarray.DataArray (a: 3, x: 4)>
array([[-0.01824046,  0.00373796, -0.00601642, -0.00108818],
       [ 0.00686132, -0.02680119, -0.00639433, -0.00868691],
       [-0.00889806,  0.02622817, -0.01022208, -0.00101257]])
Dimensions without coordinates: a, x
>>> xr.cov(da_a, da_b, 'time').sel(a=0,x=0)
<xarray.DataArray ()>
array(-0.01824046)
>>> da_a.sel(a=0,x=0).to_series().cov(da_b.sel(a=0,x=0).to_series())
-0.018240458880158048

So, while it's easy to check that a few individual points from xr.cov() agree with the pandas implementation, it would require a loop over (a,x) in order to check all of the points for this example. Do people have thoughts about this?

I think it would also make sense to have some test cases where we don't use Pandas at all, but we specify the output manually?

>>> da_a = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])
>>> expected = [1, np.nan]
>>> actual = xr.corr(da_a, da_a, dim='time')
>>> assert_allclose(actual, expected)

Does this seem like a good way forward?

keewis · 2020-05-24T11:22:42Z

If you want to test individual values without reimplementing the function in the tests (which is what I suspect comparing with the result of np.cov would require), that might be the only way.

If not, you could also check properties of covariance / correlation matrices, e.g. that assert_allclose(xr.cov(a, b) / (a.std() * b.std()), xr.corr(a, b)) (I'm not sure if I remember that formula correctly) or that the diagonal of the auto-covariance matrix is the same as the variance of the array (with a 1D vector, not sure about more dimensions).
If you decide to test using properties, you could also extend our small collection of tests using hypothesis (see #1846).

AndrewILWilliams · 2020-05-24T19:49:30Z

One problem I came across here is that pandas automatically ignores 'np.nan' values in any corr or cov calculation. This is hard-coded into the package and there's no skipna=False option sadly, so what I've done in the tests is to use the numpy implementation which pandas is built on (see, for example here).

Current tests implemented are (in pseudocode...):

assert_allclose(xr.cov(a, b) / (a.std() * b.std()), xr.corr(a, b))
assert_allclose(xr.cov(a,a)*(N-1), ((a - a.mean())**2).sum())
For the example in my previous comment, I now have a loop over all values of (a,x) to reconstruct the covariance / correlation matrix, and check it with an assert_allclose(...).
- Add more test arrays, with/without np.nans -- done

@keewis I tried reading the Hypothesis docs and got a bit overwhelmed, so I've stuck with example-based tests for now.

mathause · 2020-05-24T21:04:03Z

Currently corr needs to sanitize the inputs twice, which will be inefficient. One way around this is to define an internal method which can do both, depending on a method keyword (no need to write extra tests for this IMHO):

def corr(da_a, da_b, dim=None, ddof=0):

        return _cov_corr(da_a, da_b, dim=None, ddof=0, method="corr")

def cov(da_a, da_b, dim=None, ddof=0):

    return _cov_corr(da_a, da_b, dim=None, ddof=0, method="cov")


def _cov_corr(da_a, da_b, dim=None, ddof=0, method=None):

    # compute cov

    if method = "cov":
        return cov

    # compute corr

    return corr

Maybe you could use xr.apply_ufunc instead of looping in the tests (might be overkill).

mathause · 2020-05-25T08:29:22Z

Could you also add a test for the TypeError?

with raises_regex(TypeError, "Only xr.DataArray is supported"):
    xr.corr(xr.Dataset(), xr.Dataset())

AndrewILWilliams · 2020-05-25T08:44:36Z

Could you also add a test for the TypeError?

with raises_regex(TypeError, "Only xr.DataArray is supported"):
    xr.corr(xr.Dataset(), xr.Dataset())

Where do you mean sorry? Isn't this already there in corr()?

if any(not isinstance(arr, (Variable, DataArray)) for arr in [da_a, da_b]):
        raise TypeError(
            "Only xr.DataArray and xr.Variable are supported."
            "Given {}.".format([type(arr) for arr in [da_a, da_b]])
        )

EDIT: Scratch that, I get what you mean :)

mathause

Four more nits ;)

xarray/core/computation.py

AndrewILWilliams · 2020-05-25T13:50:08Z

One more thing actually, is there an argument for not defining da_a_std and demeaned_da_a and just performing the operations in place? Defining these variables makes the code more readable but in #3550 (comment) and #3550 (comment) the reviewer suggests this is inefficient?

mathause · 2020-05-25T14:08:35Z

If you insist ;)

da_a -= da_a.mean(dim=dim)

is indeed marginally faster. As they are already aligned, we don't have to worry about this.

AndrewILWilliams · 2020-05-25T14:13:46Z

If you insist ;)
da_a -= da_a.mean(dim=dim)
is indeed marginally faster. As they are already aligned, we don't have to worry about this.

Sweet! On second thought, I might leave it for now...the sun is too nice today. Can always have it as a future PR or something. :)

max-sixty · 2020-05-25T16:55:26Z

Awesome @AndrewWilliams3142 ! Very excited we have this.

Thanks for the review @mathause

Hitting merge; any other feedback is welcome and we can iterate.

* upstream/master: Improve interp performance (pydata#4069) Auto chunk (pydata#4064) xr.cov() and xr.corr() (pydata#4089) allow multiindex levels in plots (pydata#3938) Fix bool weights (pydata#4075) fix dangerous default arguments (pydata#4006)

kefirbandi · 2020-05-26T09:40:12Z

Just a small comment: in the docs (http://xarray.pydata.org/en/latest/generated/xarray.cov.html#xarray.cov) there is a typo: da_a is declared twice, the second should really be da_b.

keewis · 2020-05-26T09:43:29Z

thanks. Do you want to put in a PR fixing that?

kefirbandi · 2020-05-26T17:12:41Z

Well, actually I was thinking, that correcting it for someone who is working on the code on a daily basis is ~30 seconds. For me, I think, it would be quite a bit of overhead for a single character...

AndrewILWilliams · 2020-05-26T17:26:34Z

@kefirbandi I didn't want to step on your toes, but I'm happy to put in a PR to fix the typo. :)

kefirbandi · 2020-05-26T18:31:31Z

@AndrewWilliams3142 I see. Thanks.

AndrewILWilliams added 14 commits May 23, 2020 21:39

Added chunks='auto' option in dataset.py

37a0bd9

reverted accidental changes in dataset.chunk()

45edda1

Added corr and cov to computation.py. Taken from r-beer:xarray/corr

500e0b2

Added r-beer's tests to test_computation.py

29a1373

Still issues I think

trying to fix github.com//pull/3550#discussion_r349935731

fdd5c5f

Removing drop=True from the .where() calls in computation.py+test.py

aeabf2c

api.rst and whats-new.rst

1489e0f

Updated xarray/__init__.py and added broadcast import to computation

c121a3d

added DataArray import to corr, cov

a40d95b

assert_allclose added to test_computation.py

cd19e32

removed whitespace in test_dask...oops

2ddcb55

Added to init

2fce175

format changes

a0ef1c2

Fiddling around with cov/corr tests in test_computation.py

5d456b5

AndrewILWilliams added 4 commits May 23, 2020 23:12

PEP8 changes

523e4fd

pep

c23cae6

remove old todo and comments

860babc

isort

33ded40

AndrewILWilliams added 4 commits May 24, 2020 17:23

Added consistency check between corr() and cov(), ensure they give same

2751b10

added skipna=False to computation.py. made consistency+autocov tests

759c9f4

formatting

1accabd

Added numpy-based tests.

43a6ad7

AndrewILWilliams added 2 commits May 24, 2020 20:58

format

29bbcfb

formatting again

a5ce9b3

AndrewILWilliams marked this pull request as ready for review May 24, 2020 20:39

AndrewILWilliams added 2 commits May 25, 2020 09:40

refactored corr/cov so there is one internal method for calculating both

d395c27

formatting

21351f5

AndrewILWilliams added 7 commits May 25, 2020 09:59

updating docstrings and code suggestions from PR

87c9bea

paramterize ddof in tests

0e4b682

removed extraneous test arrays

b23eea8

formatting + adding deterministic docstring

44c77f0

added test for TypeError

4bfc1f2

formatting

c2ba27b

tidying up docstring

bc58708

mathause reviewed May 25, 2020

View reviewed changes

xarray/core/computation.py Show resolved Hide resolved

xarray/core/computation.py Show resolved Hide resolved

xarray/core/computation.py Outdated Show resolved Hide resolved

xarray/core/computation.py Outdated Show resolved Hide resolved

AndrewILWilliams added 2 commits May 25, 2020 14:53

formatting and tidying up _cov_corr() so that the logic is more clear

6bfb3cf

flake8 ...

672c87f

mathause approved these changes May 25, 2020

View reviewed changes

max-sixty merged commit 3194b3e into pydata:master May 25, 2020

AndrewILWilliams deleted the corrcov branch May 25, 2020 17:11

AndrewILWilliams mentioned this pull request May 26, 2020

Corrcov typo fix #4096

Merged

2 tasks

This was referenced Jun 8, 2023

Identify Xarray functions I'd like to add examples to yutik-nn/Xarray-adventure#3

Open

Create an Xarray feature/function yutik-nn/Xarray-adventure#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.cov() and xr.corr() #4089

xr.cov() and xr.corr() #4089

AndrewILWilliams commented May 23, 2020 •

edited

Loading

pep8speaks commented May 23, 2020 •

edited

Loading

AndrewILWilliams commented May 24, 2020 •

edited

Loading

keewis commented May 24, 2020 •

edited

Loading

AndrewILWilliams commented May 24, 2020 •

edited

Loading

mathause commented May 24, 2020

mathause commented May 25, 2020

AndrewILWilliams commented May 25, 2020 •

edited

Loading

mathause left a comment

AndrewILWilliams commented May 25, 2020

mathause commented May 25, 2020

AndrewILWilliams commented May 25, 2020

max-sixty commented May 25, 2020

kefirbandi commented May 26, 2020

keewis commented May 26, 2020

kefirbandi commented May 26, 2020

AndrewILWilliams commented May 26, 2020

kefirbandi commented May 26, 2020

xr.cov() and xr.corr() #4089

xr.cov() and xr.corr() #4089

Conversation

AndrewILWilliams commented May 23, 2020 • edited Loading

pep8speaks commented May 23, 2020 • edited Loading

Comment last updated at 2020-05-25 13:55:29 UTC

AndrewILWilliams commented May 24, 2020 • edited Loading

keewis commented May 24, 2020 • edited Loading

AndrewILWilliams commented May 24, 2020 • edited Loading

mathause commented May 24, 2020

mathause commented May 25, 2020

AndrewILWilliams commented May 25, 2020 • edited Loading

mathause left a comment

Choose a reason for hiding this comment

AndrewILWilliams commented May 25, 2020

mathause commented May 25, 2020

AndrewILWilliams commented May 25, 2020

max-sixty commented May 25, 2020

kefirbandi commented May 26, 2020

keewis commented May 26, 2020

kefirbandi commented May 26, 2020

AndrewILWilliams commented May 26, 2020

kefirbandi commented May 26, 2020

AndrewILWilliams commented May 23, 2020 •

edited

Loading

pep8speaks commented May 23, 2020 •

edited

Loading

AndrewILWilliams commented May 24, 2020 •

edited

Loading

keewis commented May 24, 2020 •

edited

Loading

AndrewILWilliams commented May 24, 2020 •

edited

Loading

AndrewILWilliams commented May 25, 2020 •

edited

Loading