Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to skip tests in weighted() #4541

Closed
jbusecke opened this issue Oct 26, 2020 · 16 comments · Fixed by #4559
Closed

Option to skip tests in weighted() #4541

jbusecke opened this issue Oct 26, 2020 · 16 comments · Fixed by #4559

Comments

@jbusecke
Copy link
Contributor

When working with large dask-array weights, this check triggers computation of the array. This affects xgcms ability to layer operations lazily

Would you be open to implement an option to skip this test, maybe with a warning displayed? Happy to submit a PR.

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 27, 2020

I don't have that much context on xgcm so others may know better on this.

Could you help me understand in what context you're running the tests?

IIRC we used to have --skip-slow here, but it wasn't used that much and so is no longer there. It's definitely possible to add that sort of flag.

@dcherian
Copy link
Contributor

dcherian commented Oct 27, 2020

The relevant context is that .any() will trigger computation on a dask array. Maybe we skip the check using is_duck_dask_array?

@max-sixty
Copy link
Collaborator

Sorry, I completely misunderstood! I thought you were asking about skipping tests as in pytest, hence my confusion.

For sure re skipping those checks with dask arrays.

@jbusecke
Copy link
Contributor Author

Sorry if my initial issue was unclear.
So you favor not having a 'skip' kwarg to just internally skipping the call to .any() if weights is a dask array?

@max-sixty
Copy link
Collaborator

Sorry if my initial issue was unclear.

Not at all, my mistake

So you favor not having a 'skip' kwarg to just internally skipping the call to .any() if weights is a dask array?

👍

@jbusecke
Copy link
Contributor Author

What would happen in this case if a dask array with nans is passed? Would this somehow silently influence the results or would it not matter (in that case I wonder what the check was for).
If this could lead to undetected errors I would still consider a kwargs a safer alternative, especially for new users?

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 27, 2020

If it leads to incorrect results, I agree. If it leads to a lazy error (even if more confusing), or a result array full of NaNs, then I think it's fine. Not super confident on the latter case, tbc.

If we want more control, I would advocate for using a standard kwarg that offers control over the computation — e.g. skip_na often gives more performance in exchange for (edit: the user) ensuring no NaNs — rather than an idiosyncratic kwarg that's derived by the internals of this implementation

@jbusecke
Copy link
Contributor Author

Sounds good. I'll see if I can make some time to test and put up a PR this week.

@mathause
Copy link
Collaborator

weights cannot contain NaNs else the result will just be NaN, even with skipna=True. But then the weights rarely contain NaN. So this test is a bit a trade-off between time and convenience. A kwarg can certainly make sense (was also requested before). I would probably not call the kwarg skipna. Maybe check_weights? or check_nan? (better names welcome)

I think da.isnull().any() is lazy and it's the if that makes it eager. So an alternative would be to make the statement lazy but I don't know how this would be done.

The relevant test is here:

def test_weighted_weights_nan_raises(as_dataset, weights):

@mathause
Copy link
Collaborator

The other possibility would be to do sth like:

def __init__(..., skipna=False):

    if skipna:
        weights = weighs.fillna(0)

we did decide to not do this somewhere in the discussion, not entirely sure anymore why.

@jbusecke
Copy link
Contributor Author

Thanks @mathause , I was wondering how much of a performance trade off .fillna(0) is on a dask array with no nans, compared to the check.

I favor this, since it allows slicing before the calculation is triggered: I have a current situation where I do a bunch of operations on a large multi-model dataset. The weights are time and member dependent and I am trying to save each member separately. Having the calculation triggered for the full dataset is problematic and fillna(0) avoids that (working with a hacked branch where I simply removed the check for nans)

@mathause
Copy link
Collaborator

The discussion goes back to here: #2922 (comment) (by @dcherian)

I decided to replace all NaN in the weights with 0.

Can we raise an error instead? It should be easy for the user to do weights.fillna(0) instead of relying on xarray's magical behaviour.

Thinking a bit more about this I now favour the isnull().any() test and would add a check_weights kwargs. I would even be fine to set check_weights=False per default and say the user is responsible to supply valid weights (but I'd want others to weigh in here).

In addition, a.isnull().any() is quite a bit faster than a.fillna(0) (even if there are no nans present). This is mostly true for numpy arrays, not so much for dask (by my limited tests). On the other hand the isnull().any() test is a small percentage of the total time (#3883 (comment)).


I am also not entirely sure I understand where your issue lies. You eventually have to compute, right? Do you do something between w = data.weighted(weights) and w.mean()?

Ah maybe I understand, your data looks like:

  • data: <xarray.DataArray (time: 1000, models: 1)>
  • weights: <xarray.DataArray (time: 1000, models: 100)>

And now weights gets checked for all 100 models where only one would be relevant. Is this correct? (So as another workaround would be using xr.align before sending weights to weighted.)


My limited speed tests:

import numpy as np
import xarray as xr

a = xr.DataArray(np.random.randn(1000, 1000, 10, 10))
%timeit a.isnull().any()
%timeit a.fillna(0)

b = xr.DataArray(np.random.randn(1000, 1000, 10, 10)).chunk(100)
%timeit b.isnull().any()
%timeit b.fillna(0)

@dcherian
Copy link
Contributor

The discussion goes back to here: #2922 (comment) (by @dcherian)

Ah, sorry! I was thinking of weights as being numpy arrays, not so much dask arrays.

Do you do something between w = data.weighted(weights) and w.mean()?

Yeah I think this is the issue. .weighted should be lazy.

Thinking a bit more about this I now favour the isnull().any() test and would add a check_weights kwargs.

This would be OK. We could also drop the check and let users deal with it, and also add a warning to the docstring.

@dcherian
Copy link
Contributor

Another option would be to put the check in a .map_blocks call for dask arrays. This would only run and raise at compute time.

@jbusecke
Copy link
Contributor Author

Another option would be to put the check in a .map_blocks call for dask arrays. This would only run and raise at compute time.

Uh that sounds great actually. Same functionality, no triggered computation, and no intervention needed from the user. Should I try to implement this?

@mathause
Copy link
Collaborator

Yes that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants