Option to skip tests in `weighted()` #4541

jbusecke · 2020-10-26T23:32:36Z

When working with large dask-array weights, this check triggers computation of the array. This affects xgcms ability to layer operations lazily

Would you be open to implement an option to skip this test, maybe with a warning displayed? Happy to submit a PR.

max-sixty · 2020-10-27T00:57:04Z

I don't have that much context on xgcm so others may know better on this.

~~Could you help me understand in what context you're running the tests?~~

~~IIRC we used to have --skip-slow here, but it wasn't used that much and so is no longer there. It's definitely possible to add that sort of flag.~~

dcherian · 2020-10-27T01:04:14Z

The relevant context is that .any() will trigger computation on a dask array. Maybe we skip the check using is_duck_dask_array?

max-sixty · 2020-10-27T01:13:04Z

Sorry, I completely misunderstood! I thought you were asking about skipping tests as in pytest, hence my confusion.

For sure re skipping those checks with dask arrays.

jbusecke · 2020-10-27T01:56:28Z

Sorry if my initial issue was unclear.
So you favor not having a 'skip' kwarg to just internally skipping the call to .any() if weights is a dask array?

max-sixty · 2020-10-27T02:00:40Z

Sorry if my initial issue was unclear.

Not at all, my mistake

So you favor not having a 'skip' kwarg to just internally skipping the call to .any() if weights is a dask array?

👍

jbusecke · 2020-10-27T02:06:35Z

What would happen in this case if a dask array with nans is passed? Would this somehow silently influence the results or would it not matter (in that case I wonder what the check was for).
If this could lead to undetected errors I would still consider a kwargs a safer alternative, especially for new users?

max-sixty · 2020-10-27T02:29:59Z

If it leads to incorrect results, I agree. If it leads to a lazy error (even if more confusing), or a result array full of NaNs, then I think it's fine. Not super confident on the latter case, tbc.

If we want more control, I would advocate for using a standard kwarg that offers control over the computation — e.g. skip_na often gives more performance in exchange for (edit: the user) ensuring no NaNs — rather than an idiosyncratic kwarg that's derived by the internals of this implementation

jbusecke · 2020-10-27T04:33:04Z

Sounds good. I'll see if I can make some time to test and put up a PR this week.

mathause · 2020-10-27T09:27:25Z

weights cannot contain NaNs else the result will just be NaN, even with skipna=True. But then the weights rarely contain NaN. So this test is a bit a trade-off between time and convenience. A kwarg can certainly make sense (was also requested before). I would probably not call the kwarg skipna. Maybe check_weights? or check_nan? (better names welcome)

I think da.isnull().any() is lazy and it's the if that makes it eager. So an alternative would be to make the statement lazy but I don't know how this would be done.

The relevant test is here:

xarray/xarray/tests/test_weighted.py

Line 22 in adc55ac

def test_weighted_weights_nan_raises(as_dataset, weights):

mathause · 2020-10-27T13:24:43Z

The other possibility would be to do sth like:

def __init__(..., skipna=False):

    if skipna:
        weights = weighs.fillna(0)

we did decide to not do this somewhere in the discussion, not entirely sure anymore why.

jbusecke · 2020-10-27T14:03:34Z

Thanks @mathause , I was wondering how much of a performance trade off .fillna(0) is on a dask array with no nans, compared to the check.

I favor this, since it allows slicing before the calculation is triggered: I have a current situation where I do a bunch of operations on a large multi-model dataset. The weights are time and member dependent and I am trying to save each member separately. Having the calculation triggered for the full dataset is problematic and fillna(0) avoids that (working with a hacked branch where I simply removed the check for nans)

mathause · 2020-10-27T15:23:55Z

The discussion goes back to here: #2922 (comment) (by @dcherian)

I decided to replace all NaN in the weights with 0.

Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

Thinking a bit more about this I now favour the isnull().any() test and would add a check_weights kwargs. I would even be fine to set check_weights=False per default and say the user is responsible to supply valid weights (but I'd want others to weigh in here).

In addition, a.isnull().any() is quite a bit faster than a.fillna(0) (even if there are no nans present). This is mostly true for numpy arrays, not so much for dask (by my limited tests). On the other hand the isnull().any() test is a small percentage of the total time (#3883 (comment)).

I am also not entirely sure I understand where your issue lies. You eventually have to compute, right? Do you do something between w = data.weighted(weights) and w.mean()?

Ah maybe I understand, your data looks like:

data: <xarray.DataArray (time: 1000, models: 1)>
weights: <xarray.DataArray (time: 1000, models: 100)>

And now weights gets checked for all 100 models where only one would be relevant. Is this correct? (So as another workaround would be using xr.align before sending weights to weighted.)

My limited speed tests:

import numpy as np
import xarray as xr

a = xr.DataArray(np.random.randn(1000, 1000, 10, 10))
%timeit a.isnull().any()
%timeit a.fillna(0)

b = xr.DataArray(np.random.randn(1000, 1000, 10, 10)).chunk(100)
%timeit b.isnull().any()
%timeit b.fillna(0)

dcherian · 2020-10-27T15:57:03Z

The discussion goes back to here: #2922 (comment) (by @dcherian)

Ah, sorry! I was thinking of weights as being numpy arrays, not so much dask arrays.

Do you do something between w = data.weighted(weights) and w.mean()?

Yeah I think this is the issue. .weighted should be lazy.

Thinking a bit more about this I now favour the isnull().any() test and would add a check_weights kwargs.

This would be OK. We could also drop the check and let users deal with it, and also add a warning to the docstring.

dcherian · 2020-10-27T15:57:50Z

Another option would be to put the check in a .map_blocks call for dask arrays. This would only run and raise at compute time.

jbusecke · 2020-10-28T00:45:31Z

Another option would be to put the check in a .map_blocks call for dask arrays. This would only run and raise at compute time.

Uh that sounds great actually. Same functionality, no triggered computation, and no intervention needed from the user. Should I try to implement this?

mathause · 2020-10-31T09:10:10Z

Yes that would be great.

ahuang11 mentioned this issue Oct 27, 2020

np.isnan triggers eager computation xarray-contrib/xskillscore#218

Closed

jbusecke mentioned this issue Oct 31, 2020

Dask friendly check in .weighted() #4559

Merged

4 tasks

mathause closed this as completed in #4559 Nov 9, 2020

mathause mentioned this issue Jan 13, 2021

Improve performance of xarray.corr() on big datasets #4804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to skip tests in `weighted()` #4541

Option to skip tests in `weighted()` #4541

jbusecke commented Oct 26, 2020

max-sixty commented Oct 27, 2020 •

edited

Loading

dcherian commented Oct 27, 2020 •

edited

Loading

max-sixty commented Oct 27, 2020

jbusecke commented Oct 27, 2020

max-sixty commented Oct 27, 2020

jbusecke commented Oct 27, 2020

max-sixty commented Oct 27, 2020 •

edited

Loading

jbusecke commented Oct 27, 2020

mathause commented Oct 27, 2020

mathause commented Oct 27, 2020

jbusecke commented Oct 27, 2020

mathause commented Oct 27, 2020

dcherian commented Oct 27, 2020

dcherian commented Oct 27, 2020

jbusecke commented Oct 28, 2020

mathause commented Oct 31, 2020

Option to skip tests in weighted() #4541

Option to skip tests in weighted() #4541

Comments

jbusecke commented Oct 26, 2020

max-sixty commented Oct 27, 2020 • edited Loading

dcherian commented Oct 27, 2020 • edited Loading

max-sixty commented Oct 27, 2020

jbusecke commented Oct 27, 2020

max-sixty commented Oct 27, 2020

jbusecke commented Oct 27, 2020

max-sixty commented Oct 27, 2020 • edited Loading

jbusecke commented Oct 27, 2020

mathause commented Oct 27, 2020

mathause commented Oct 27, 2020

jbusecke commented Oct 27, 2020

mathause commented Oct 27, 2020

dcherian commented Oct 27, 2020

dcherian commented Oct 27, 2020

jbusecke commented Oct 28, 2020

mathause commented Oct 31, 2020

Option to skip tests in `weighted()` #4541

Option to skip tests in `weighted()` #4541

max-sixty commented Oct 27, 2020 •

edited

Loading

dcherian commented Oct 27, 2020 •

edited

Loading

max-sixty commented Oct 27, 2020 •

edited

Loading