Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling: allow control over padding #2007

Open
mathause opened this issue Mar 22, 2018 · 20 comments · May be fixed by #3587
Open

rolling: allow control over padding #2007

mathause opened this issue Mar 22, 2018 · 20 comments · May be fixed by #3587

Comments

@mathause
Copy link
Collaborator

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr

x = np.arange(1, 366)
y = np.random.randn(365)
ds = xr.DataArray(y, dims=dict(dayofyear=x))

ds.rolling(center=True, dayofyear=31).mean()

Problem description

rolling cannot directly handle periodic boundary conditions (lon, dayofyear, ...), but could be very helpful to e.g. calculate climate indices. Also I cannot really think of an easy way to append the first elements to the end of the dataset and then calculate rolling.

Is there a way to do this? Should xarray support this feature?

This might also belong to SO...

@rabernat
Copy link
Contributor

Very useful suggestion.

Also I cannot really think of an easy way to append the first elements to the end of the dataset and then calculate rolling.

We already support a different type of "rolling" with periodicity
http://xarray.pydata.org/en/latest/generated/xarray.DataArray.roll.html?highlight=roll
and it is straightforward to apply roll operations at the variable level:
https://github.com/pydata/xarray/blob/master/xarray/core/variable.py#L1007-L1026

I suspect this would not be too hard to implement.

@max-sixty
Copy link
Collaborator

ds.rolling(center=True, dayofyear=31).mean()

What do you mean by rolling here? As a reduce operation over a rolling window, like a moving average (rolling in xarray)? Or rolling around the end of a dataarray when shifting (roll in xarray)? Or a mix of both?

Specifically, what's the dayofyear=31 doing?

@mathause
Copy link
Collaborator Author

mathause commented Mar 22, 2018

Probably a mix of both - I want to compute a moving average, but with periodic boundaries. rolling sets window_length - 1 elements to nan. However I want to calculate these like so:

running_mean_0 = xr.concat([ds[-15:], ds[:16]]).mean()
running_mean_1 = xr.concat([ds[-14:], ds[:17]]).mean()

and so on...

@mathause
Copy link
Collaborator Author

I think what I want is like the filter function in R with circular=True.

I found two possibilities but they are quite "hand made" and certainly not very efficient

Solution with slicing:

# take the last and first elements and append/ prepend them
first = ds[:15]
last = ds[-15:]
extended = xr.concat([last, ds, first], 'dayofyear')
# do the rolling on the extended ds and get rid of NaNs
sol1 = extended.rolling(dayofyear=31, center=True).mean().dropna('dayofyear')

Solution with roll:

roll1 = ds.roll(dayofyear=150).rolling(dayofyear=31, center=True).mean()
roll2 = ds.rolling(dayofyear=31, center=True).mean()
sol2 = xr.concat([roll1, roll2], dim='r').mean('r')

Call rolling on original and rolled dataset, and put them together again.

@fujiisoup
Copy link
Member

I think the implementation would be not so difficult by supporting more flexible fill_value option in xr.DataArrayRolling.construct method.

Maybe fill_value='periodic' would be a possible API,

da.rolling(dayofyear=31).construct('window', fill_value='periodic').mean('window')

@mathause, any interest in contributing?

@max-sixty
Copy link
Collaborator

Though I'm not sure you need the construct machinery.

IIUC you need to copy a window-sized amount of data from the front of the array onto the back.

You could do that with construct-like machinery, which would save a copy - though not a large copy

@mathause mathause mentioned this issue Mar 23, 2018
4 tasks
@fujiisoup
Copy link
Member

@maxim-lian , do you agree to add this feature?
Although the same behavior can be realized by adding head/tail values to the original array and truncate them after the computation, the periodic option would significantly simplify this.

@max-sixty
Copy link
Collaborator

@fujiisoup Yes for sure - I think it would be good. I think there are two salient questions:

  • Where this lives in the API: Should this be under construct? I don't think the kwarg should be called fill_value - that traditionally has a specific meaning of "the value to replace NaN with". I don't understand the periodic reference, but likely I'm missing something. Could be wrap=True, or roll=True?
  • How it's implemented - do you have a view here? Can you use the construct machinery? Or we make a new array and run rolling over it?

@fujiisoup
Copy link
Member

I don't think the kwarg should be called fill_value - that traditionally has a specific meaning of "the value to replace NaN with".

Agreed. dask.ghost has boundaries keyword, for which we can choose between periodic, reflect, and any constant.
I think this would be a good reference.
Maybe we can deprecate fill_value keyword and replace it by boundaries?
(I slightly regret that I choose fill_value keyword in construct).

How it's implemented - do you have a view here?

Only a slight modification of construct machinery realizes this (see #2011).
I think this option should be available only in construct method (not in the traditional rolling constructor) for the sake of simplicity (according to this comment).

@max-sixty
Copy link
Collaborator

#2011 looks good - I didn't realize numpy already had pad

Agree with your other comments. Thanks as ever @fujiisoup

@raybellwaves
Copy link
Contributor

I was going to suggest this feature so glad others are interested.

In my use case I would like to smooth a daily climatology. My colleague uses matlab and uses https://www.mathworks.com/matlabcentral/fileexchange/52688-nan-tolerant-fast-smooth

Using the slice solution as @mathause showed above, it would look something like (using code from http://xarray.pydata.org/en/stable/examples/weather-data.html#toy-weather-data)

import numpy as np
import pandas as pd
import xarray as xr

times = pd.date_range('2000-01-01', '2010-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 366 - 0.28))
noise = 15 * np.random.rand(annual_cycle.size)
data = 10 + (15 * annual_cycle) + noise
da = xr.DataArray(data, coords=[times], dims='time')
#da.plot()
#Check variability at one day
#da.groupby('time.dayofyear').std('time')[0]
da_clim = da.groupby('time.dayofyear').mean('time')
_da_clim = xr.concat([da_clim[-15:], da_clim, da_clim[:15]], 'dayofyear')
da_clim_smooth = _da_clim.rolling(dayofyear=31, center=True).mean().dropna('dayofyear')
#da_clim_smooth.plot()

@shoyer shoyer mentioned this issue Dec 13, 2018
@mathause
Copy link
Collaborator Author

mathause commented Jun 6, 2019

I am coming back to @shoyer suggestion in #2011 - your idea would be to do first a pad and then a rolling operation as e.g.:

import numpy as np
import xarray as xr

x = np.arange(1, 366)
y = np.random.randn(365)
ds = xr.DataArray(y, dims=dict(dayofyear=x))

ds.pad(dayofyear=15, mode='wrap').rolling(center=True, dayofyear=31).mean()

@shoyer
Copy link
Member

shoyer commented Jun 6, 2019

I think you may need to do cropping afterwards, too, before taking the mean.

@mathause mathause linked a pull request Sep 13, 2020 that will close this issue
4 tasks
@mathause
Copy link
Collaborator Author

mathause commented Dec 17, 2020

I just need to find the three warmest consecutive months from a temperature dataset for my work, so I thought I add a complete example. First, create an example dataset with monthly temperature:

import xarray as xr
import numpy as np
import pandas as pd

time = pd.date_range("2000", periods=12 * 30, freq="M")
temp = np.sin((time.month - 5) / 6 * np.pi) + np.random.randn(*time.shape) * 0.3
da = xr.DataArray(temp, dims=["time"], coords=dict(time=time))
print(da)
<xarray.DataArray (time: 360)>
array([-0.676731, -0.812742, -1.367547, ...,  0.186731,  0.237676, -0.343879])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 ... 2029-12-31

Currently we can achieve this like:

n_months = 3

monthly = da.groupby("time.month").mean()
padded = monthly.pad(month=n_months, mode="wrap")
rolled = padded.rolling(center=True, month=n_months).mean(skipna=False)
sliced = rolled.isel(month=slice(3, -3))

central_month = sliced.idxmax()

Implementing pad_mode in rolling would allow to do:

monthly = da.groupby("time.month").mean()
rolled = monthly.rolling(center=True, month=n_months, pad_mode="wrap").mean(skipna=False)

central_month = rolled.idxmax()

@dcherian
Copy link
Contributor

I think we should do pad_kwargs so that monthly.rolling(center-True, month=n_months, pad_kwargs=dict(mode="constant", constant_values=5)} is possible.

@mdgoldberg
Copy link

mdgoldberg commented Jul 7, 2021

Hello! First of all, thanks so much for those of you who contribute to xarray, I've found it super useful as an n-dimensional extension of pandas!

I was just wondering what the current state of this issue is? I'm running into exactly the issue described in #4743 which seems like a bug; that issue was closed as a dupe of this. Are we just waiting for someone to implement something here, or are there other blockers?

@dcherian
Copy link
Contributor

dcherian commented Jul 7, 2021

This should be easy now so we just need someone to try it out.

This is where the padding happens so the kwarg needs to be passed all the way down to Variable.rolling_window

padded = var.pad(pads, mode="constant", constant_values=fill_value)

What should the API be? .rolling(..., pad_kwargs={...}) or just have .pad(...).rolling(..., pad=False) i.e. have the user pad beforehand? The only advantage to the first one AFAICT is that the user doesn't have to specify the window length (or padding length) twice.

@kmsquire
Copy link
Contributor

kmsquire commented Jul 8, 2021

.pad(...).rolling(..., pad=False)

For this API, it seems that the only thing that would need to be implemented would be adding a pad keyword argument to rolling, defaulting to True. Is that correct?

@dcherian
Copy link
Contributor

dcherian commented Jul 8, 2021

Yes. I think we might want this anyway; for rolling without any padding at all.

@kmsquire
Copy link
Contributor

I added support for .rolling(..., pad=False) in #5603. The basic implementation was simple, but getting it working for bottleneck/dask took a little more work.

That fixes, e.g., #4743, but I don't think it's a complete fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants