-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cov() and corr() #2652
cov() and corr() #2652
Conversation
…array.py, and tested using pytest.
Hello @hrishikeshac! Thanks for updating the PR.
Comment last updated on January 04, 2019 at 23:39 Hours UTC |
Made the code PEP8 compatible. Apologies for not doing so earlier. |
|
||
expected = pd_corr(ts1, ts2) | ||
actual = ts1.corr(ts2) | ||
np.allclose(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use assert_allclose
from xarray.testing
. I'm not sure whether that is asserting or returning a bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@max-sixty assert_allclose
gives AssertionError
, hence used np.allclose
- it returns a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use an actual assertion here. Otherwise this isn't testing anything -- np.allclose()
could fail and we wouldn't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sending a PR.
self, other = broadcast(self, other) | ||
|
||
# 2. Ignore the nans | ||
valid_values = self.notnull() & other.notnull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allocates larger memory than dot
or tensordot
.
Can we use xr.dot
instead of broadcasting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used broadcast
to ensure that the dataarrays get aligned and extra dimensions (if any) in one get inserted into the other. So, broadcast
implemented here doesn't do any arithmetic computation, as such. I didn't know xr.dot
could be used in such a context. Could it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray.dot
does do alignment/broadcasting, but it definitely doesn't skip missing values so I'm not sure it would would work well here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the typical case, I would expect arguments for which correlation is being computed will have the same dimensions. So I don't think xarray.dot
would be much faster.
other = other.where(valid_values, drop=True) | ||
valid_count = valid_values.sum(dim) | ||
|
||
# 3. Compute mean and standard deviation along the given dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove 'and standard deviation'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worrying that users could misunderstand
cov
is for (auto-)covariance rather than cross-covariance, which we are implementing here.
Probably a function likexr.cov(x, y)
is better than method?
I can implement it as xr.cov(x,y). However, I made the implementation to be consistent with pd.Series cov() and corr(). https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cov.html. So I think users might be more familiar with this implementation.
If we make it a function, then may be do it for both cov() and corr(), just to be consistent?
I am a little worrying that users could misunderstand |
I agree with @fujisoup |
self, other = broadcast(self, other) | ||
|
||
# 2. Ignore the nans | ||
valid_values = self.notnull() & other.notnull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray.dot
does do alignment/broadcasting, but it definitely doesn't skip missing values so I'm not sure it would would work well here.
|
||
# 2. Ignore the nans | ||
valid_values = self.notnull() & other.notnull() | ||
self = self.where(valid_values, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to avoid drop=True
if possible. Dropping elements can really slow things down when using dask arrays, because determining the elements to drop requires computing the arrays. In contrast, if we avoid drop=True
we can build a lazy computation graph.
demeaned_other = other - other.mean(dim=dim) | ||
|
||
# 4. Compute covariance along the given dim | ||
cov = (demeaned_self * demeaned_other).sum(dim=dim) / (valid_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this slightly simpler version would work:
self, other = broadcast(self, other)
valid_values = self.notnull() & other.notnull()
self = self.where(valid_values)
other = self.where(valid_values)
demeaned_self = self - self.mean(dim=dim)
demeaned_other = other - other.mean(dim=dim)
cov = (demeaned_self * demeaned_other).mean(dim=dim)
Or maybe we want to keep using valid_count
for the ddof
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this version. Will look into ddof
.
|
||
expected = pd_corr(ts1, ts2) | ||
actual = ts1.corr(ts2) | ||
np.allclose(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use an actual assertion here. Otherwise this isn't testing anything -- np.allclose()
could fail and we wouldn't know.
@@ -3305,6 +3305,45 @@ def test_rank(self): | |||
y = DataArray([0.75, 0.25, np.nan, 0.5, 1.0], dims=('z',)) | |||
assert_equal(y.rank('z', pct=True), y) | |||
|
|||
def test_corr(self): | |||
# self: Load demo data and trim it's size | |||
ds = xr.tutorial.load_dataset('air_temperature') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the tutorial datasets requires network access, which we try to avoid for tests. Can you write this test using synthetic data instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do. The tutorial code can be moved into an 'example' in documentation/ user guide later.
other = other.where(valid_values, drop=True) | ||
|
||
# 3. Compute correlation based on standard deviations and cov() | ||
self_std = self.std(dim=dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value do we use for ddof
? Should that be a keyword argument to this method?
I also think making this a function is probably a good idea, even though it's different from pandas. One question: how should these functions align their arguments? Recall that xarray does an |
We should be more concerned with correctness than consistency - but is having |
@max-sixty |
I agree that the case for DataArray.dot is questionable. It sort of makes
sense because numpy and pandas both have it as a method, but the @ operator
is a really a cleaner way to express this now that we're Python 3 only.
(Speaking of which, why don't we support @ in xarray yet? :).)
…On Mon, Jan 7, 2019 at 1:43 AM Keisuke Fujii ***@***.***> wrote:
@max-sixty <https://github.com/max-sixty>
I am not sure whether DataArray.dot is a right choice. But I am wondering
for cov case, it sounds like to compute a covariance of the DataArray
itself rather than the cross covariance with another DataArray.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2652 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1tDVgVcOe7Sw08cWI1II-HfgVAnIks5vAxa2gaJpZM4ZuRR9>
.
|
I always assumed an |
Is this pull request still up to date? |
Dear @Hoeze, I will (try to) finalize this merge request, as I am also very interested in this functionality. I am new to xarray and contribution. I downloaded @hrishikeshac's code and ran the pytest tests locally. I get Is there an elegant way to share "which tests failed where" in order to avoid that I try to fix tests, that might already have been fixed in other branches? I will already start to get a better understanding of why the tests fail and try to fix them in the meantime. |
Great @r-beer , we can be helpful in getting you up & running Given this branch has diverged from master, I would make your own fork and merge in master; looks like you'll have some minor conflicts. (more details in our contributing.rst docs, or post here if confused). You can then open up your own draft PR. Re the tests: pytest should print a list of the tests that failed and their stack traces, do you not see anything? |
@max-sixty, thanks for the fast response! Yeah, I get the traceback and already started diving into it. However, I assumed that @hrishikeshac's branch "corr" wasn't up-to-date. Shall I merge changes from master or develop into corr, before looking further into the tests? |
I read http://xarray.pydata.org/en/stable/contributing.html, is this identical to contributing.rst?
Where upstream = https://github.com/pydata/xarray.git? |
Yes 100%! Let me know if that doesn't work! |
Alright, I only got two merge conflicts in dataarray.py: minor merge conflict concerning imports:
<<<<<<< HEAD
from . import (
computation,
dtypes,
groupby,
indexing,
ops,
pdcompat,
resample,
rolling,
utils,
)
from .accessor_dt import DatetimeAccessor
from .accessor_str import StringAccessor
from .alignment import (
_broadcast_helper,
_get_broadcast_dims_map_common_coords,
align,
reindex_like_indexers,
)
=======
from .accessors import DatetimeAccessor
from .alignment import align, reindex_like_indexers, broadcast
>>>>>>> added da.corr() and da.cov() to dataarray.py. Test added in test_dataarray.py, and tested using pytest. Secondly, some bigger merge conflicts concerning some of dataarray's methods, but they seem to be not in conflict with each other:
Can you please comment my suggested changes (accepting either changes from master or both, if no conflicts). |
Yeah those are pretty normal. Your suggestions look right: you can keep both on both sets and eliminate any duplicate imports in the first, (FYI I edited your comment so it displayed properly, seems you need a line break after |
Alright, I have done so and changed 1 failed, 6539 passed, 1952 skipped, 37 xfailed, 31 warnings in 86.34s A general question concerning collaboration on existing PRs: Or is there another option? PS: Permission to push to hrishikeshac:corr is denied for me. |
@r-beer great—you were right to start your own PR |
@hrishikeshac in case you come back to see this: thank you for taking it so far; your code was helpful to eventually getting this feature in. And we'd of course appreciate any additional contributions. |
Added da.corr() and da.cov() to dataarray.py. Test added in test_dataarray.py, and tested using pytest.
Concerns issue #1115
The test is based on demo data and can be readily added to the user guide.