-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/weighted #2922
Feature/weighted #2922
Conversation
I updated the PR
Before I continue, it would be nice to get some feedback.
As mentioned by @aaronspring, esmlab already implemented weighted statistic functions. Similarly, statsmodels for 1D data without handling of NaNs (docs / code). Thus it should be feasible to implement further statistics here (weighted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @mathause, for sending a PR, and sorry for late response.
It looks a clean implementation to me.
Only a few comments.
xarray/core/weighted.py
Outdated
""" | ||
|
||
# we need to mask DATA values that are nan; else the weights are wrong | ||
masked_weights = self.weights.where(self.obj.notnull()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If weights
has an additional dimension to those of self.obj
, masked_weight
may consume large memory.
Can we avoid this for example by separating mask
from weights
?
mask = xr.where(self.obj.notnull(), 1, 0) # binary mask
sum_of_weights = xr.dot(mask, weights)
Do you think if the above is worth doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - I do not have any performance tests (memory or speed)
xarray/core/weighted.py
Outdated
|
||
# calculate weighted sum | ||
return (self.obj * self.weights).sum(dim, axis=axis, skipna=skipna, | ||
**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason above,
xr.where(self.obj, self.obj, 0).dot(self.weights)
may work if skipna
is True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - I do not have any performance tests (memory or speed)
Hi @mathause - We really appreciate your contribution. Sorry your PR has stalled! Do you think you can respond to @fujiisoup's review and add documentation? Then we can get this merged. |
Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though. |
I finally made some time to work on this - altough I feel far from finished...
Questions:
|
xarray/core/weighted.py
Outdated
axis : int or sequence of int, optional | ||
Axis(es) over which to apply the weighted `{fcn}`. Only one of the | ||
'dim' and 'axis' arguments can be supplied. If neither are supplied, | ||
then the weighted `{fcn}` is calculated over all axes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove as xr.dot
does not provide this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree re removing (or am I missing something—why was this here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency it would be nice to offer axis
here. Originally sum
was implemented as (weights * da).sum(...)
and we got axis
for free. With dot
it is not as straightforward any more. Honestly, I never use axis
with xarray, so I suppose it is fine to only implement it if anyone would ever request it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Unless I'm missing something, let's remove axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this progress! I left a few comments & questions
xarray/core/weighted.py
Outdated
return sum_of_weights.where(valid_weights) | ||
|
||
|
||
def _weighted_sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just put these on the Weighted
object? (not a huge deal, trying to reduce indirection if possible though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like so?
class Weighted
...
def _weighted_sum(self, da, dims, skipna, **kwargs)
pass
Yes good idea, this might actually even work better as this avoids the hassle with the non-sanitized weights
xarray/core/weighted.py
Outdated
|
||
# need to mask invalid DATA as dot does not implement skipna | ||
if skipna or skipna is None: | ||
return where(da.isnull(), 0.0, da).dot(weights, dims=dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return where(da.isnull(), 0.0, da).dot(weights, dims=dims) | |
return da.fillna(0.0).dot(weights, dims=dims) |
?
xarray/core/weighted.py
Outdated
axis : int or sequence of int, optional | ||
Axis(es) over which to apply the weighted `{fcn}`. Only one of the | ||
'dim' and 'axis' arguments can be supplied. If neither are supplied, | ||
then the weighted `{fcn}` is calculated over all axes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree re removing (or am I missing something—why was this here?)
xarray/core/weighted.py
Outdated
""" Calcualte the sum of weights, accounting for missing values """ | ||
|
||
# we need to mask DATA values that are nan; else the weights are wrong | ||
mask = where(da.notnull(), 1, 0) # binary mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask = where(da.notnull(), 1, 0) # binary mask | |
mask = da.isnull() |
bool
works with dot
from my limited checks
xarray/core/weighted.py
Outdated
|
||
|
||
class DatasetWeighted(Weighted): | ||
def _dataset_implementation(self, func, **kwargs) -> "Dataset": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we do this lots of places; we should find a common approach at some point!
xarray/core/weighted.py
Outdated
|
||
weighted[key] = func(da, self.weights, **kwargs) | ||
|
||
return Dataset(weighted, coords=self.obj.coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work if coords
are on dimensions that are reduced / removed by the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcherian! One question concerning :py:class:
from my side.
Just came to my mind: I should probably add a comment that DataArray([1]).weighted(DataArray([0]))
returns NaN
.
This is going in. Thanks @mathause. This is a major contribution! |
Great! Thanks for all the feedback and support! |
Thanks @mathause ! |
I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143. I would have implemented this using |
We do those sorts of operations fairly frequently, so it's not unique here. Generally users should expect to have available ~3x the memory of an array for most operations. @seth-p it's great you've taken an interest in the project! Is there any chance we could harness that into some contributions? 😄 |
tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed It's certainly not too late to change the "backend" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy). @fujiisoup suggested using Also It think it should not be very difficult to write something that can be passed to xarray/xarray/tests/test_weighted.py Line 161 in e8a284f
So there would be three possibilities: (1) the current implementation (using |
@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations. |
@mathause, ideally
Either way, this only addresses the Also, perhaps the test Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays. |
@mathause, have you considered using these functions?
|
There is some stuff I can do to reduce the memory footprint if
Yes, this would be nice. What could be done, though is to only do
I assume so. I don't know what kind of temporary variables
Again this could be avoided if
Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support
No it's important to make sure this stuff works for large arrays. However, using
None of your suggested functions support I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years! Further libraries that support weighted operations:
|
All good points:
Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).
Well,
Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).
Correct. These have nothing to do with the NaNs issue. For profiling memory usage, I use |
* upstream/master: add spacing in the versions section of the issue report (pydata#3876) map_blocks: allow user function to add new unindexed dimension. (pydata#3817) Delete associated indexes when deleting coordinate variables. (pydata#3840) remove macos build while waiting for libwebp fix (pydata#3875) Fix html repr on non-str keys (pydata#3870) Allow ellipsis to be used in stack (pydata#3826) Improve where docstring (pydata#3836) Add DataArray.pad, Dataset.pad, Variable.pad (pydata#3596) Fix some warnings (pydata#3864) Feature/weighted (pydata#2922) Fix recombination in groupby when changing size along the grouped dimension (pydata#3807) Blacken the doctest code in docstrings (pydata#3857) Fix multi-index with categorical values. (pydata#3860) Fix interp bug when indexer shares coordinates with array (pydata#3758) Fix alignment with join="override" when some dims are unindexed (pydata#3839)
whats-new.rst
for all changes andapi.rst
for new APII took a shot at the weighted function - I added a
DataArrayWeighted
class, that currently only implementsmean
. So, there is still quite a bit missing (e.g.DatasetWeighted
), but let me know what you think.There are quite a number of difficult edge cases with invalid data, that can be discussed.
NaN
in theweights
with0
.0
it returnsNaN
(and notinf
)NaN
(could be1
)It could be good to add all edge-case logic to a separate function but I am not sure if this is possible...