Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/weighted #2922

Merged
merged 60 commits into from
Mar 19, 2020
Merged

Feature/weighted #2922

merged 60 commits into from
Mar 19, 2020

Conversation

mathause
Copy link
Collaborator

@mathause mathause commented Apr 26, 2019

I took a shot at the weighted function - I added a DataArrayWeighted class, that currently only implements mean. So, there is still quite a bit missing (e.g. DatasetWeighted), but let me know what you think.

import numpy as np
import xarray as xr
da = xr.DataArray([1, 2])
weights = xr.DataArray([4, 6])
da.weighted(weights).mean()

# <xarray.DataArray ()>
# array(1.6)

There are quite a number of difficult edge cases with invalid data, that can be discussed.

  • I decided to replace all NaN in the weights with 0.
  • if weights sum to 0 it returns NaN (and not inf)
weights = xr.DataArray([0, 0])
da.weighted(weights).mean()
  • The following returns NaN (could be 1)
da = xr.DataArray([1, np.nan])
weights = xr.DataArray([1, 0])
da.weighted(weights).mean(skipna=False)

It could be good to add all edge-case logic to a separate function but I am not sure if this is possible...

@pep8speaks
Copy link

pep8speaks commented Apr 26, 2019

Hello @mathause! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-18 01:42:05 UTC

@mathause
Copy link
Collaborator Author

I updated the PR

  • added a weighted sum method
  • tried to clean the tests

Before I continue, it would be nice to get some feedback.

  • Currently I only do very simple tests - is that enough? How could these be generalized without re-implementing the functions to obtain expected.
  • Eventually it would be cool to be able to do da.groupby(regions).weighted(weights).mean() - do you see a possibility for this with the current implementation?

As mentioned by @aaronspring, esmlab already implemented weighted statistic functions. Similarly, statsmodels for 1D data without handling of NaNs (docs / code). Thus it should be feasible to implement further statistics here (weighted std).

xarray/core/common.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/tests/test_weighted.py Outdated Show resolved Hide resolved
Copy link
Member

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mathause, for sending a PR, and sorry for late response.

It looks a clean implementation to me.
Only a few comments.

xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/tests/test_weighted.py Outdated Show resolved Hide resolved
"""

# we need to mask DATA values that are nan; else the weights are wrong
masked_weights = self.weights.where(self.obj.notnull())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If weights has an additional dimension to those of self.obj, masked_weight may consume large memory.
Can we avoid this for example by separating mask from weights?

mask = xr.where(self.obj.notnull(), 1, 0)  # binary mask
sum_of_weights = xr.dot(mask, weights)

Do you think if the above is worth doing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - I do not have any performance tests (memory or speed)


# calculate weighted sum
return (self.obj * self.weights).sum(dim, axis=axis, skipna=skipna,
**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason above,

xr.where(self.obj, self.obj, 0).dot(self.weights)

may work if skipna is True?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - I do not have any performance tests (memory or speed)

@rabernat
Copy link
Contributor

Hi @mathause - We really appreciate your contribution. Sorry your PR has stalled!

Do you think you can respond to @fujiisoup's review and add documentation? Then we can get this merged.

@mathause
Copy link
Collaborator Author

Thanks, I am still very interested to get this in. I don't think I'll manage before my holidays, though.

@mathause
Copy link
Collaborator Author

mathause commented Oct 17, 2019

I finally made some time to work on this - altough I feel far from finished...

  • added a DatasetWeighted class
  • for this I pulled the functionality our of DataArrayWeighted class in to own functions taking da and weights as input
  • the tests need more work
  • implanted the functionality using xr.dot -> this makes the logic a bit more complicated
  • I think the failure in Linux py37-upstream-dev is unrelated

Questions:

  • does this implementation look reasonable to you?
  • xr.dot does not have a axis keyword is it fine if I leave it out in my functions?
  • flake8 fails because I use @overload for typing -> should I remove this?
  • Currently I have the functionality 3-times: once as _weighted_sum, once as da.weighted.sum() and once as ds.weighted().sum: how do I best test this?

xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
Comment on lines 14 to 17
axis : int or sequence of int, optional
Axis(es) over which to apply the weighted `{fcn}`. Only one of the
'dim' and 'axis' arguments can be supplied. If neither are supplied,
then the weighted `{fcn}` is calculated over all axes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove as xr.dot does not provide this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree re removing (or am I missing something—why was this here?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency it would be nice to offer axis here. Originally sum was implemented as (weights * da).sum(...) and we got axis for free. With dot it is not as straightforward any more. Honestly, I never use axis with xarray, so I suppose it is fine to only implement it if anyone would ever request it...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Unless I'm missing something, let's remove axis

xarray/core/weighted.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this progress! I left a few comments & questions

return sum_of_weights.where(valid_weights)


def _weighted_sum(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just put these on the Weighted object? (not a huge deal, trying to reduce indirection if possible though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like so?

class Weighted
    ...

    def _weighted_sum(self, da, dims, skipna, **kwargs)
        pass

Yes good idea, this might actually even work better as this avoids the hassle with the non-sanitized weights


# need to mask invalid DATA as dot does not implement skipna
if skipna or skipna is None:
return where(da.isnull(), 0.0, da).dot(weights, dims=dims)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return where(da.isnull(), 0.0, da).dot(weights, dims=dims)
return da.fillna(0.0).dot(weights, dims=dims)

?

Comment on lines 14 to 17
axis : int or sequence of int, optional
Axis(es) over which to apply the weighted `{fcn}`. Only one of the
'dim' and 'axis' arguments can be supplied. If neither are supplied,
then the weighted `{fcn}` is calculated over all axes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree re removing (or am I missing something—why was this here?)

xarray/core/weighted.py Show resolved Hide resolved
""" Calcualte the sum of weights, accounting for missing values """

# we need to mask DATA values that are nan; else the weights are wrong
mask = where(da.notnull(), 1, 0) # binary mask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mask = where(da.notnull(), 1, 0) # binary mask
mask = da.isnull()

bool works with dot from my limited checks



class DatasetWeighted(Weighted):
def _dataset_implementation(self, func, **kwargs) -> "Dataset":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we do this lots of places; we should find a common approach at some point!


weighted[key] = func(da, self.weights, **kwargs)

return Dataset(weighted, coords=self.obj.coords)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if coords are on dimensions that are reduced / removed by the function?

Copy link
Collaborator Author

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcherian! One question concerning :py:class: from my side.

Just came to my mind: I should probably add a comment that DataArray([1]).weighted(DataArray([0])) returns NaN.

xarray/core/common.py Outdated Show resolved Hide resolved
xarray/core/common.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

This is going in.

Thanks @mathause. This is a major contribution!

@dcherian dcherian merged commit df614b9 into pydata:master Mar 19, 2020
@mathause
Copy link
Collaborator Author

Great! Thanks for all the feedback and support!

@max-sixty
Copy link
Collaborator

Thanks @mathause !

@jhamman
Copy link
Member

jhamman commented Mar 19, 2020

Big time!!!! Thanks @mathause! #422 was opened in June of 2015, amazing.

@max-sixty
Copy link
Collaborator

#422 was opened in June of 2015, amazing.

😂

@mathause props for the persistence...

@seth-p
Copy link
Contributor

seth-p commented Mar 20, 2020

I realize this is a bit late, but I'm still concerned about memory usage, specifically in https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L130 and https://github.com/pydata/xarray/blob/master/xarray/core/weighted.py#L143.
If da.sizes = {'dim_0': 100000, 'dim_1': 100000}, the two lines above will cause da.weighted(weights).mean('dim_0') to create two simultaneous temporary 100000x100000 arrays, which could be problematic.

I would have implemented this using apply_ufunc, so that one creates these temporary variables only on as small an array as absolutely necessary -- in this case just of size sizes['dim_0'] = 100000. (Much as I would like to, I'm afraid I'm not able to contribute code.) Of course this won't help in the case one is summing over all dimensions, but might as well minimize memory usage in some cases even if not in all.

@max-sixty
Copy link
Collaborator

We do those sorts of operations fairly frequently, so it's not unique here. Generally users should expect to have available ~3x the memory of an array for most operations.

@seth-p it's great you've taken an interest in the project! Is there any chance we could harness that into some contributions? 😄

@mathause
Copy link
Collaborator Author

mathause commented Mar 20, 2020

tldr: if someone knows how to do memory profiling with reasonable effort this can still be changed

It's certainly not too late to change the "backend" of the weighting functions. I once tried to profile the memory usage but I gave up at some point (I think I would have needed to annotate a ton of functions, also in numpy).

@fujiisoup suggested using xr.dot(a, b) (instead of (a * b).sum()) to ameliorate part of the memory footprint. Which is done, however, this comes at a performance penalty. So an alternative code path might actually be beneficial. (edit: I now think xr.dot is actually faster, except for very small arrays).

Also mask is an array of dtype bool, which should help.

It think it should not be very difficult to write something that can be passed to apply_ufinc, probably similar to:

def expected_weighted(da, weights, dim, skipna, operation):

So there would be three possibilities: (1) the current implementation (using xr.dot(a, b)) (2) something similar to expected_weighted (using (a * b).sum()) (3) xr.apply_ufunc(a, b, expected_weighted, ...). I assume (2) is fastest with the largest memory footprint, but I cannot tell about (1) and (3).

@seth-p
Copy link
Contributor

seth-p commented Mar 20, 2020

@max-sixty, I wish I could, but I'm afraid that I cannot submit code due to employer limitations.

@seth-p
Copy link
Contributor

seth-p commented Mar 20, 2020

@mathause, ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

@seth-p
Copy link
Contributor

seth-p commented Mar 20, 2020

@mathause, have you considered using these functions?

@mathause
Copy link
Collaborator Author

There is some stuff I can do to reduce the memory footprint if skipna=False or not da.isnull().any(). Also, the functions should support dask arrays out of the box.


ideally dot() would support skipna, so you could eliminate the da = da.fillna(0.0) and pass the skipna down the line. But alas it doesn't...

Yes, this would be nice. xr.dot uses np.einsum which is quite a beast that I don't entirely see through. I don't expect it to support NaNs any time soon.

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

(da * weights).sum(dim=dim, skipna=skipna) would likely make things worse, I think, as it would necessarily create a temporary array of sized at least da, no?

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Either way, this only addresses the da = da.fillna(0.0), not the mask = da.notnull().

Again this could be avoided if skipna=False or if (and only if) there are no NaNs in da.

Also, perhaps the test if weights.isnull().any() in Weighted.__init__() should be optional?

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Maybe I'm more sensitive to this than others, but I regularly deal with 10-100GB arrays.

No it's important to make sure this stuff works for large arrays. However, using xr.dot already gives quite a performance penalty, which I am not super happy about.

have you considered using these functions? [...]

None of your suggested functions support NaNs so they won't work.

I am all in to support more functions, but currently I am happy we got a weighted sum and mean into xarray after 5(!) years!

Further libraries that support weighted operations:

  • esmlab (xarray-based, supports NaN)
  • statsmodels (numpy-based, does not support NaN)

@seth-p
Copy link
Contributor

seth-p commented Mar 20, 2020

All good points:

What could be done, though is to only do da = da.fillna(0.0) if da contains NaNs.

Good idea, though I don't know what the performance hit would be of the extra check (in the case that da does contain NaNs, so the check is for naught).

I assume so. I don't know what kind of temporary variables np.einsum creates. Also np.einsum is wrapped in xr.apply_ufunc so all kinds of magic is going on.

Well, (da * weights) will be at least as large as da. I'm not certain, but I don't think np.einsum creates huge temporary arrays.

Do you want to leave it away for performance reasons? Because it was a deliberate decision to not support NaNs in the weights and I don't think this is going to change.

Yes. You can continue not supporting NaNs in the weights, yet not explicitly check that there are no NaNs (optionally, if the caller assures you that there are no NaNs).

None of your suggested functions support NaNs so they won't work.

Correct. These have nothing to do with the NaNs issue.

For profiling memory usage, I use psutil.Process(os.getpid()).memory_info().rss for current usage and resource.getusage(resource.RUSAGE_SElF).ru_maxrss for peak usage (on linux).

dcherian added a commit to dcherian/xarray that referenced this pull request Mar 22, 2020
* upstream/master:
  add spacing in the versions section of the issue report (pydata#3876)
  map_blocks: allow user function to add new unindexed dimension. (pydata#3817)
  Delete associated indexes when deleting coordinate variables. (pydata#3840)
  remove macos build while waiting for libwebp fix (pydata#3875)
  Fix html repr on non-str keys (pydata#3870)
  Allow ellipsis to be used in stack (pydata#3826)
  Improve where docstring (pydata#3836)
  Add DataArray.pad, Dataset.pad, Variable.pad (pydata#3596)
  Fix some warnings (pydata#3864)
  Feature/weighted (pydata#2922)
  Fix recombination in groupby when changing size along the grouped dimension (pydata#3807)
  Blacken the doctest code in docstrings (pydata#3857)
  Fix multi-index with categorical values. (pydata#3860)
  Fix interp bug when indexer shares coordinates with array (pydata#3758)
  Fix alignment with join="override" when some dims are unindexed (pydata#3839)
@mathause mathause deleted the feature/weighted branch February 3, 2021 14:47
@huard huard mentioned this pull request Mar 22, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add average function
8 participants