Feature/weighted #2922

mathause · 2019-04-26T17:09:02Z

Closes add average function #422
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

I took a shot at the weighted function - I added a DataArrayWeighted class, that currently only implements mean. So, there is still quite a bit missing (e.g. DatasetWeighted), but let me know what you think.

import numpy as np
import xarray as xr
da = xr.DataArray([1, 2])
weights = xr.DataArray([4, 6])
da.weighted(weights).mean()

# <xarray.DataArray ()>
# array(1.6)

There are quite a number of difficult edge cases with invalid data, that can be discussed.

I decided to replace all NaN in the weights with 0.
if weights sum to 0 it returns NaN (and not inf)

weights = xr.DataArray([0, 0])
da.weighted(weights).mean()

The following returns NaN (could be 1)

da = xr.DataArray([1, np.nan])
weights = xr.DataArray([1, 0])
da.weighted(weights).mean(skipna=False)

It could be good to add all edge-case logic to a separate function but I am not sure if this is possible...

pep8speaks · 2019-04-26T17:09:07Z

Hello @mathause! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-18 01:42:05 UTC

mathause · 2019-04-30T16:57:05Z

I updated the PR

added a weighted sum method
tried to clean the tests

Before I continue, it would be nice to get some feedback.

Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain expected.
Eventually it would be cool to be able to do da.groupby(regions).weighted(weights).mean() - do you see a possibility for this with the current implementation?

As mentioned by @aaronspring, esmlab already implemented weighted statistic functions. Similarly, statsmodels for 1D data without handling of NaNs (docs / code). Thus it should be feasible to implement further statistics here (weighted std).

xarray/core/common.py

xarray/core/weighted.py

xarray/tests/test_weighted.py

fujiisoup

Thank you, @mathause, for sending a PR, and sorry for late response.

It looks a clean implementation to me.
Only a few comments.

xarray/core/weighted.py

xarray/tests/test_weighted.py

fujiisoup · 2019-05-07T09:34:58Z

xarray/core/weighted.py

+        """
+
+        # we need to mask DATA values that are nan; else the weights are wrong
+        masked_weights = self.weights.where(self.obj.notnull())


If weights has an additional dimension to those of self.obj, masked_weight may consume large memory.
Can we avoid this for example by separating mask from weights?

mask = xr.where(self.obj.notnull(), 1, 0) # binary mask sum_of_weights = xr.dot(mask, weights)

Do you think if the above is worth doing?

done - I do not have any performance tests (memory or speed)

fujiisoup · 2019-05-07T09:36:57Z

xarray/core/weighted.py

+
+        # calculate weighted sum
+        return (self.obj * self.weights).sum(dim, axis=axis, skipna=skipna,
+                                             **kwargs)


For the same reason above,

xr.where(self.obj, self.obj, 0).dot(self.weights)

may work if skipna is True?

done - I do not have any performance tests (memory or speed)

rabernat · 2019-07-12T19:16:16Z

Hi @mathause - We really appreciate your contribution. Sorry your PR has stalled!

Do you think you can respond to @fujiisoup's review and add documentation? Then we can get this merged.

mathause · 2019-07-17T12:59:16Z

Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.

mathause · 2019-10-17T20:56:32Z

I finally made some time to work on this - altough I feel far from finished...

added a DatasetWeighted class
for this I pulled the functionality our of DataArrayWeighted class in to own functions taking da and weights as input
the tests need more work
implanted the functionality using xr.dot -> this makes the logic a bit more complicated
I think the failure in Linux py37-upstream-dev is unrelated

Questions:

does this implementation look reasonable to you?
xr.dot does not have a axis keyword is it fine if I leave it out in my functions?
flake8 fails because I use @overload for typing -> should I remove this?
Currently I have the functionality 3-times: once as _weighted_sum, once as da.weighted.sum() and once as ds.weighted().sum: how do I best test this?

xarray/core/weighted.py

mathause · 2019-10-17T21:16:33Z

xarray/core/weighted.py

+    axis : int or sequence of int, optional
+        Axis(es) over which to apply the weighted `{fcn}`. Only one of the
+        'dim' and 'axis' arguments can be supplied. If neither are supplied,
+        then the weighted `{fcn}` is calculated over all axes.


maybe remove as xr.dot does not provide this

Agree re removing (or am I missing something—why was this here?)

For consistency it would be nice to offer axis here. Originally sum was implemented as (weights * da).sum(...) and we got axis for free. With dot it is not as straightforward any more. Honestly, I never use axis with xarray, so I suppose it is fine to only implement it if anyone would ever request it...

OK. Unless I'm missing something, let's remove axis

xarray/core/weighted.py

max-sixty

Thanks for making this progress! I left a few comments & questions

max-sixty · 2019-10-17T23:18:50Z

xarray/core/weighted.py

+    return sum_of_weights.where(valid_weights)
+
+
+def _weighted_sum(


Should we just put these on the Weighted object? (not a huge deal, trying to reduce indirection if possible though)

Do you mean like so?

class Weighted ... def _weighted_sum(self, da, dims, skipna, **kwargs) pass

Yes good idea, this might actually even work better as this avoids the hassle with the non-sanitized weights

max-sixty · 2019-10-18T00:28:53Z

xarray/core/weighted.py

+
+    # need to mask invalid DATA as dot does not implement skipna
+    if skipna or skipna is None:
+        return where(da.isnull(), 0.0, da).dot(weights, dims=dims)


Suggested change

return where(da.isnull(), 0.0, da).dot(weights, dims=dims)

return da.fillna(0.0).dot(weights, dims=dims)

?

max-sixty · 2019-10-18T00:30:31Z

xarray/core/weighted.py

+    axis : int or sequence of int, optional
+        Axis(es) over which to apply the weighted `{fcn}`. Only one of the
+        'dim' and 'axis' arguments can be supplied. If neither are supplied,
+        then the weighted `{fcn}` is calculated over all axes.


Agree re removing (or am I missing something—why was this here?)

xarray/core/weighted.py

max-sixty · 2019-10-18T00:35:43Z

xarray/core/weighted.py

+    """ Calcualte the sum of weights, accounting for missing values """
+
+    # we need to mask DATA values that are nan; else the weights are wrong
+    mask = where(da.notnull(), 1, 0)  # binary mask


Suggested change

mask = where(da.notnull(), 1, 0) # binary mask

mask = da.isnull()

bool works with dot from my limited checks

max-sixty · 2019-10-18T00:38:13Z

xarray/core/weighted.py

+
+
+class DatasetWeighted(Weighted):
+    def _dataset_implementation(self, func, **kwargs) -> "Dataset":


Not for this PR, but we do this lots of places; we should find a common approach at some point!

max-sixty · 2019-10-18T00:40:02Z

xarray/core/weighted.py

+
+            weighted[key] = func(da, self.weights, **kwargs)
+
+        return Dataset(weighted, coords=self.obj.coords)


Will this work if coords are on dimensions that are reduced / removed by the function?

mathause

Thanks @dcherian! One question concerning :py:class: from my side.

Just came to my mind: I should probably add a comment that DataArray([1]).weighted(DataArray([0])) returns NaN.

xarray/core/common.py

…into feature/weighted

xarray/core/common.py

dcherian · 2020-03-19T14:29:42Z

This is going in.

Thanks @mathause. This is a major contribution!

mathause · 2020-03-19T14:35:25Z

Great! Thanks for all the feedback and support!

max-sixty · 2020-03-19T16:37:43Z

Thanks @mathause !

jhamman · 2020-03-19T16:58:57Z

Big time!!!! Thanks @mathause! #422 was opened in June of 2015, amazing.

max-sixty · 2020-03-19T19:34:42Z

#422 was opened in June of 2015, amazing.

😂

@mathause props for the persistence...

seth-p · 2020-03-20T02:11:53Z

I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143.
If da.sizes = {'dim_0': 100000, 'dim_1': 100000}, the two lines above will cause da.weighted(weights).mean('dim_0') to create two simultaneous temporary 100000x100000 arrays, which could be problematic.

I would have implemented this using apply_ufunc, so that one creates these temporary variables only on as small an array as absolutely necessary -- in this case just of size sizes['dim_0'] = 100000. (Much as I would like to, I'm afraid I'm not able to contribute code.) Of course this won't help in the case one is summing over all dimensions, but might as well minimize memory usage in some cases even if not in all.

max-sixty · 2020-03-20T04:01:34Z

We do those sorts of operations fairly frequently, so it's not unique here. Generally users should expect to have available ~3x the memory of an array for most operations.

@seth-p it's great you've taken an interest in the project! Is there any chance we could harness that into some contributions? 😄

mathause · 2020-03-20T09:45:23Z

tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed

It's certainly not too late to change the "backend" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).

@fujiisoup suggested using xr.dot(a, b) (instead of (a * b).sum()) to ameliorate part of the memory footprint. ~~Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial.~~ (edit: I now think xr.dot is actually faster, except for very small arrays).

Also mask is an array of dtype bool, which should help.

It think it should not be very difficult to write something that can be passed to apply_ufinc, probably similar to:

xarray/xarray/tests/test_weighted.py

Line 161 in e8a284f

def expected_weighted(da, weights, dim, skipna, operation):

So there would be three possibilities: (1) the current implementation (using xr.dot(a, b)) (2) something similar to expected_weighted (using (a * b).sum()) (3) xr.apply_ufunc(a, b, expected_weighted, ...). I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).

seth-p · 2020-03-20T13:25:21Z

@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations.

seth-p · 2020-03-20T13:44:03Z

@mathause, ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

seth-p · 2020-03-20T13:47:39Z

@mathause, have you considered using these functions?

np.average() to calculate weighted mean().
np.cov() to calculate weighted cov(), var(), and std().
sp.stats.cumfreq() to calculate weighted median() (I haven't thought this through).
sp.spatial.distance.correlation() to calculate weighted corrcoef(). (Of course one could also calculate this from weighted cov() (see above), but first need to mask the two arrays simultaneously.)
sklearn.utils.extmath.weighted_mode() to calculate weighted mode().
gmisclib.weighted_percentile.{wp,wtd_median}() to calculate weighted quantile() and median().

mathause · 2020-03-20T17:31:15Z

There is some stuff I can do to reduce the memory footprint if skipna=False or not da.isnull().any(). Also, the functions should support dask arrays out of the box.

ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

Yes, this would be nice. xr.dot uses np.einsum which is quite a beast that I don't entirely see through. I don't expect it to support NaNs any time soon.

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Again this could be avoided if skipna=False or if (and only if) there are no NaNs in da.

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

No it's important to make sure this stuff works for large arrays. However, using xr.dot already gives quite a performance penalty, which I am not super happy about.

have you considered using these functions? [...]

None of your suggested functions support NaNs so they won't work.

I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!

Further libraries that support weighted operations:

esmlab (xarray-based, supports NaN)
statsmodels (numpy-based, does not support NaN)

seth-p · 2020-03-20T19:57:54Z

All good points:

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Well, (da * weights) will be at least as large as da. I'm not certain, but I don't think np.einsum creates huge temporary arrays.

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).

None of your suggested functions support NaNs so they won't work.

Correct. These have nothing to do with the NaNs issue.

For profiling memory usage, I use psutil.Process(os.getpid()).memory_info().rss for current usage and resource.getusage(resource.RUSAGE_SElF).ru_maxrss for peak usage (on linux).

* upstream/master: add spacing in the versions section of the issue report (pydata#3876) map_blocks: allow user function to add new unindexed dimension. (pydata#3817) Delete associated indexes when deleting coordinate variables. (pydata#3840) remove macos build while waiting for libwebp fix (pydata#3875) Fix html repr on non-str keys (pydata#3870) Allow ellipsis to be used in stack (pydata#3826) Improve where docstring (pydata#3836) Add DataArray.pad, Dataset.pad, Variable.pad (pydata#3596) Fix some warnings (pydata#3864) Feature/weighted (pydata#2922) Fix recombination in groupby when changing size along the grouped dimension (pydata#3807) Blacken the doctest code in docstrings (pydata#3857) Fix multi-index with categorical values. (pydata#3860) Fix interp bug when indexer shares coordinates with array (pydata#3758) Fix alignment with join="override" when some dims are unindexed (pydata#3839)

mathause added 2 commits April 26, 2019 18:39

weighted for DataArray

0f2da8e

remove some commented code

5f64492

mathause added 5 commits April 26, 2019 19:15

pep8 and faulty import tests

685e5c4

add weighted sum, replace 0s in sum_of_wgt

c9d612d

weighted: overhaul tests

a20a4cf

weighted: pep8

26c24b6

weighted: pep8 lines

f3c6758

mathause commented Apr 30, 2019

View reviewed changes

mathause added 2 commits May 2, 2019 20:01

weighted update docs

25c3c29

weighted: fix typo

5d37d11

aaronspring mentioned this pull request May 3, 2019

xr_area_weight should maintain attributes, etc. bradyrx/esmtools#18

Closed

fujiisoup reviewed May 7, 2019

View reviewed changes

weighted: pep8

b1c572b

mathause added 9 commits October 17, 2019 16:36

undo changes to avoid merge conflict

d1d1f2c

Merge branch 'master' into feature/weighted

6be1414

add weighted to dataarray again

059263c

remove super

8b1904b

overhaul core/weighted.py

8cad145

add DatasetWeighted class

49d4e43

_maybe_get_all_dims return sorted tuple

527256e

work on: test_weighted

739568f

black and flake8

f01305d

mathause commented Oct 17, 2019

View reviewed changes

Apply suggestions from code review (docs)

2e3880d

max-sixty reviewed Oct 18, 2019

View reviewed changes

fix & add references

1e06adc

mathause commented Mar 17, 2020

View reviewed changes

xarray/core/common.py Outdated Show resolved Hide resolved

mathause added 3 commits March 17, 2020 18:49

doc: return 0/NaN on 0 weights

706579a

Merge branch 'feature/weighted' of https://github.com/mathause/xarray …

b2718db

…into feature/weighted

Merge branch 'master' into feature/weighted

4c17108

dcherian reviewed Mar 18, 2020

View reviewed changes

xarray/core/common.py Outdated Show resolved Hide resolved

Update xarray/core/common.py

8acc78e

dcherian merged commit df614b9 into pydata:master Mar 19, 2020

dcherian mentioned this pull request Mar 19, 2020

Move weighted reductions to xarray NCAR/esmlab#164

Open

zxdawn mentioned this pull request Mar 22, 2020

Oversampling module results to satellite's pixels Kang-Sun-CfA/Oversampling_matlab#1

Closed

mathause mentioned this pull request Mar 24, 2020

weighted operations: performance optimisations #3883

Open

mathause mentioned this pull request Oct 27, 2020

Option to skip tests in weighted() #4541

Closed

mathause deleted the feature/weighted branch February 3, 2021 14:47

huard mentioned this pull request Mar 22, 2021

Weighted statistics Ouranosinc/xclim#605

Closed

2 tasks

seth-p mentioned this pull request Jun 8, 2021

DataArray.clip() no longer supports the out argument #5278

Closed

cjauvin mentioned this pull request Oct 15, 2021

Add var and std to weighted computations #5870

Merged

3 tasks

		return sum_of_weights.where(valid_weights)


		def _weighted_sum(

	return where(da.isnull(), 0.0, da).dot(weights, dims=dims)
	return da.fillna(0.0).dot(weights, dims=dims)

	mask = where(da.notnull(), 1, 0) # binary mask
	mask = da.isnull()



		class DatasetWeighted(Weighted):
		def _dataset_implementation(self, func, **kwargs) -> "Dataset":


		weighted[key] = func(da, self.weights, **kwargs)

		return Dataset(weighted, coords=self.obj.coords)

Feature/weighted #2922

Feature/weighted #2922

Conversation

mathause commented Apr 26, 2019 • edited Loading

pep8speaks commented Apr 26, 2019 • edited Loading

Comment last updated at 2020-03-18 01:42:05 UTC

mathause commented Apr 30, 2019

fujiisoup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jul 12, 2019

mathause commented Jul 17, 2019

mathause commented Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathause left a comment

Choose a reason for hiding this comment

dcherian commented Mar 19, 2020

mathause commented Mar 19, 2020

max-sixty commented Mar 19, 2020

jhamman commented Mar 19, 2020

max-sixty commented Mar 19, 2020

seth-p commented Mar 20, 2020 • edited Loading

max-sixty commented Mar 20, 2020

mathause commented Mar 20, 2020 • edited Loading

seth-p commented Mar 20, 2020

seth-p commented Mar 20, 2020 • edited Loading

seth-p commented Mar 20, 2020 • edited Loading

mathause commented Mar 20, 2020

seth-p commented Mar 20, 2020 • edited Loading

mathause commented Apr 26, 2019 •

edited

Loading

pep8speaks commented Apr 26, 2019 •

edited

Loading

mathause commented Oct 17, 2019 •

edited

Loading

seth-p commented Mar 20, 2020 •

edited

Loading

mathause commented Mar 20, 2020 •

edited

Loading

seth-p commented Mar 20, 2020 •

edited

Loading

seth-p commented Mar 20, 2020 •

edited

Loading

seth-p commented Mar 20, 2020 •

edited

Loading