Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.cov() and xr.corr() #4089

Merged
merged 36 commits into from
May 25, 2020
Merged

xr.cov() and xr.corr() #4089

merged 36 commits into from
May 25, 2020

Conversation

AndrewILWilliams
Copy link
Contributor

@AndrewILWilliams AndrewILWilliams commented May 23, 2020

PR for the xr.cov() and xr.corr() functionality which others have been working on. Most code adapted from @r-beer in PR #3550.

TODO:

CHECKLIST:

@pep8speaks
Copy link

pep8speaks commented May 23, 2020

Hello @AndrewWilliams3142! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-25 13:55:29 UTC

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 24, 2020

The current problem is that we can't use Pandas to fully test xr.cov() or xr.corr() because once you convert the DataArrays to a series or a dataframe for testing, you can't easily index them with a dim parameter. See @r-beer 's comment here #3550 (comment).

As such, I think it maybe just makes sense to test a few low-dimensional cases? Eg

>>> da_a = xr.DataArray(
          np.random.random((3, 21, 4)),
          coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
          dims=("a", "time", "x"),
      )

>>> da_b = xr.DataArray(
          np.random.random((3, 21, 4)),
          coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
          dims=("a", "time", "x"),
      )

>>> xr.cov(da_a, da_b, 'time')
<xarray.DataArray (a: 3, x: 4)>
array([[-0.01824046,  0.00373796, -0.00601642, -0.00108818],
       [ 0.00686132, -0.02680119, -0.00639433, -0.00868691],
       [-0.00889806,  0.02622817, -0.01022208, -0.00101257]])
Dimensions without coordinates: a, x
>>> xr.cov(da_a, da_b, 'time').sel(a=0,x=0)
<xarray.DataArray ()>
array(-0.01824046)
>>> da_a.sel(a=0,x=0).to_series().cov(da_b.sel(a=0,x=0).to_series())
-0.018240458880158048

So, while it's easy to check that a few individual points from xr.cov() agree with the pandas implementation, it would require a loop over (a,x) in order to check all of the points for this example. Do people have thoughts about this?

I think it would also make sense to have some test cases where we don't use Pandas at all, but we specify the output manually?

>>> da_a = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])
>>> expected = [1, np.nan]
>>> actual = xr.corr(da_a, da_a, dim='time')
>>> assert_allclose(actual, expected)

Does this seem like a good way forward?

@keewis
Copy link
Collaborator

keewis commented May 24, 2020

If you want to test individual values without reimplementing the function in the tests (which is what I suspect comparing with the result of np.cov would require), that might be the only way.

If not, you could also check properties of covariance / correlation matrices, e.g. that assert_allclose(xr.cov(a, b) / (a.std() * b.std()), xr.corr(a, b)) (I'm not sure if I remember that formula correctly) or that the diagonal of the auto-covariance matrix is the same as the variance of the array (with a 1D vector, not sure about more dimensions).
If you decide to test using properties, you could also extend our small collection of tests using hypothesis (see #1846).

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 24, 2020

One problem I came across here is that pandas automatically ignores 'np.nan' values in any corr or cov calculation. This is hard-coded into the package and there's no skipna=False option sadly, so what I've done in the tests is to use the numpy implementation which pandas is built on (see, for example here).

Current tests implemented are (in pseudocode...):

  • assert_allclose(xr.cov(a, b) / (a.std() * b.std()), xr.corr(a, b))
  • assert_allclose(xr.cov(a,a)*(N-1), ((a - a.mean())**2).sum())
  • For the example in my previous comment, I now have a loop over all values of (a,x) to reconstruct the covariance / correlation matrix, and check it with an assert_allclose(...).
    • Add more test arrays, with/without np.nans -- done

@keewis I tried reading the Hypothesis docs and got a bit overwhelmed, so I've stuck with example-based tests for now.

@AndrewILWilliams AndrewILWilliams marked this pull request as ready for review May 24, 2020 20:39
@mathause
Copy link
Collaborator

Currently corr needs to sanitize the inputs twice, which will be inefficient. One way around this is to define an internal method which can do both, depending on a method keyword (no need to write extra tests for this IMHO):

def corr(da_a, da_b, dim=None, ddof=0):

        return _cov_corr(da_a, da_b, dim=None, ddof=0, method="corr")

def cov(da_a, da_b, dim=None, ddof=0):

    return _cov_corr(da_a, da_b, dim=None, ddof=0, method="cov")


def _cov_corr(da_a, da_b, dim=None, ddof=0, method=None):

    # compute cov

    if method = "cov":
        return cov

    # compute corr

    return corr

Maybe you could use xr.apply_ufunc instead of looping in the tests (might be overkill).

@mathause
Copy link
Collaborator

Could you also add a test for the TypeError?

with raises_regex(TypeError, "Only xr.DataArray is supported"):
    xr.corr(xr.Dataset(), xr.Dataset())

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 25, 2020

Could you also add a test for the TypeError?

with raises_regex(TypeError, "Only xr.DataArray is supported"):
    xr.corr(xr.Dataset(), xr.Dataset())

Where do you mean sorry? Isn't this already there in corr()?

if any(not isinstance(arr, (Variable, DataArray)) for arr in [da_a, da_b]):
        raise TypeError(
            "Only xr.DataArray and xr.Variable are supported."
            "Given {}.".format([type(arr) for arr in [da_a, da_b]])
        )

EDIT: Scratch that, I get what you mean :)

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four more nits ;)

xarray/core/computation.py Show resolved Hide resolved
xarray/core/computation.py Show resolved Hide resolved
xarray/core/computation.py Outdated Show resolved Hide resolved
xarray/core/computation.py Outdated Show resolved Hide resolved
@AndrewILWilliams
Copy link
Contributor Author

One more thing actually, is there an argument for not defining da_a_std and demeaned_da_a and just performing the operations in place? Defining these variables makes the code more readable but in #3550 (comment) and #3550 (comment) the reviewer suggests this is inefficient?

@mathause
Copy link
Collaborator

If you insist ;)

da_a -= da_a.mean(dim=dim)

is indeed marginally faster. As they are already aligned, we don't have to worry about this.

@AndrewILWilliams
Copy link
Contributor Author

If you insist ;)

da_a -= da_a.mean(dim=dim)

is indeed marginally faster. As they are already aligned, we don't have to worry about this.

Sweet! On second thought, I might leave it for now...the sun is too nice today. Can always have it as a future PR or something. :)

@max-sixty
Copy link
Collaborator

Awesome @AndrewWilliams3142 ! Very excited we have this.

Thanks for the review @mathause

Hitting merge; any other feedback is welcome and we can iterate.

@max-sixty max-sixty merged commit 3194b3e into pydata:master May 25, 2020
@AndrewILWilliams AndrewILWilliams deleted the corrcov branch May 25, 2020 17:11
dcherian added a commit to dcherian/xarray that referenced this pull request May 25, 2020
* upstream/master:
  Improve interp performance (pydata#4069)
  Auto chunk (pydata#4064)
  xr.cov() and xr.corr() (pydata#4089)
  allow multiindex levels in plots (pydata#3938)
  Fix bool weights (pydata#4075)
  fix dangerous default arguments (pydata#4006)
@kefirbandi
Copy link
Contributor

Just a small comment: in the docs (http://xarray.pydata.org/en/latest/generated/xarray.cov.html#xarray.cov) there is a typo: da_a is declared twice, the second should really be da_b.

@keewis
Copy link
Collaborator

keewis commented May 26, 2020

thanks. Do you want to put in a PR fixing that?

@kefirbandi
Copy link
Contributor

Well, actually I was thinking, that correcting it for someone who is working on the code on a daily basis is ~30 seconds. For me, I think, it would be quite a bit of overhead for a single character...

@AndrewILWilliams
Copy link
Contributor Author

@kefirbandi I didn't want to step on your toes, but I'm happy to put in a PR to fix the typo. :)

@AndrewILWilliams AndrewILWilliams mentioned this pull request May 26, 2020
2 tasks
@kefirbandi
Copy link
Contributor

@AndrewWilliams3142 I see. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Function for regressing/correlating multiple fields?
7 participants