Strided rolling.construct results in unexpected behavior #4743

griffinmilsap · 2020-12-30T17:07:31Z

Minimal Complete Verifiable Example: Consider the following -

def print_da( da ):
    print( da.values )
    return da

arr = xr.DataArray( np.arange( 20 ), dims = [ 'time' ] )
arr.rolling( time = 4, min_periods = 4 ) \
   .construct( 'window', stride = 4 ) \
   .groupby( 'time' ) \
   .apply( print_da )

What you expected to happen:

[0. 1. 2. 3.]
[4. 5. 6. 7.]
[ 8.  9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]

What happened: Output:

[nan nan nan  0.]
[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]

Calling construct on a rolling window with a stride parameter results in unexpected behavior. After reading through some of the documentation, I now understand that rolling.construct mostly ignores the min_periods argument -- so it may be valid to say that this is expected output. That said, it is actually impossible to achieve the windowing result I desire using the current implementation of xarray without reconstructing the whole array in memory, which leads to wasteful memory usage, and in some cases, expansion of the array beyond my machine's memory capabilities.

Anything else we need to know?:
My use case here is for a strided FFT computation along timeseries data (spectrogram) which is one of the proposed use cases of the rolling.construct method. I desire a rolling constant-size window with strided overlap that sweeps across the data with no NaNs, resulting in any number of full windows (min_periods = None (window size) ). I can get close to my desired result with:

arr.rolling( time = 4 ) \
   .construct( 'window', stride = 4 ) \
   .dropna( 'time' ) \
   .groupby( 'time' ) \
   .apply( print_da )

but this gives me the (expected) result:

[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]

Note how the first sample of the data is dropped

My synthesized data has a particular phase property that requires the first window consists of the first n samples of the data; basically I need that first sample. I can get my desired result by not using the strided keyword argument:

arr.rolling( time = 4 ) \
   .construct( 'window' ) \
   .dropna( 'time' ) \
   .isel( time = slice( None, None, 4 ) ) \
   .groupby( 'time' ) \
   .apply( print_da )

But this implementation inflates the representation of the data in memory and is very very slow for obvious reasons. On my actual timeseries dataset, this method causes a memory error after inflating 200 MB dataset to several tens of gigabytes.

There are obviously other ways to achieve this result, but none of them met my needs. One possibility would be to use groupby_bins, but that requires me to specify the number of bins I want along the axis. I actually care less about the number of bins, moreso that they're all the same size and have a consistent stride -- functionality that pointed me to rolling. I could also use rolling.reduce followed by a dropna, but that method requires me to rewrite my chunked analysis method to operate only on the raw data of the array, without access to coordinates associated with that chunk. I actually find the rolling.reduce methodology to be quite counterintuitive and would prefer a rolling.apply method instead, but that's a separate feature request.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.15.0-115-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None

xarray: 0.15.1
pandas: 1.0.5
numpy: 1.18.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: None
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.3
pytest: None
IPython: 7.15.0
sphinx: 3.1.1

The text was updated successfully, but these errors were encountered:

dcherian · 2020-12-30T17:12:37Z

I think this is a good feature request. rolling pads with nans by default. We should allow users to turn that off or specify other padding options (see #2007, #3587)

griffinmilsap · 2020-12-30T17:37:14Z

@dcherian -- Very good call. Simply being able to turn off NaN padding would permit my use case. Would it make more sense to close this bug report and submit a feature request with that functionality? I can submit that if so :)

dcherian · 2020-12-30T18:29:46Z

I'll close this as a dupe of #2007, we can keep the conversation going there.

dcherian closed this as completed Dec 30, 2020

mdgoldberg mentioned this issue Jul 7, 2021

rolling: allow control over padding #2007

Open

kmsquire mentioned this issue Jul 14, 2021

Allow no padding for rolling windows #5603

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strided rolling.construct results in unexpected behavior #4743

Strided rolling.construct results in unexpected behavior #4743

griffinmilsap commented Dec 30, 2020

dcherian commented Dec 30, 2020

griffinmilsap commented Dec 30, 2020

dcherian commented Dec 30, 2020

Strided rolling.construct results in unexpected behavior #4743

Strided rolling.construct results in unexpected behavior #4743

Comments

griffinmilsap commented Dec 30, 2020

dcherian commented Dec 30, 2020

griffinmilsap commented Dec 30, 2020

dcherian commented Dec 30, 2020