Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strided rolling.construct results in unexpected behavior #4743

Closed
griffinmilsap opened this issue Dec 30, 2020 · 3 comments · May be fixed by #5603
Closed

Strided rolling.construct results in unexpected behavior #4743

griffinmilsap opened this issue Dec 30, 2020 · 3 comments · May be fixed by #5603

Comments

@griffinmilsap
Copy link

Minimal Complete Verifiable Example: Consider the following -

def print_da( da ):
    print( da.values )
    return da

arr = xr.DataArray( np.arange( 20 ), dims = [ 'time' ] )
arr.rolling( time = 4, min_periods = 4 ) \
   .construct( 'window', stride = 4 ) \
   .groupby( 'time' ) \
   .apply( print_da )

What you expected to happen:

[0. 1. 2. 3.]
[4. 5. 6. 7.]
[ 8.  9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]

What happened: Output:

[nan nan nan  0.]
[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]

Calling construct on a rolling window with a stride parameter results in unexpected behavior. After reading through some of the documentation, I now understand that rolling.construct mostly ignores the min_periods argument -- so it may be valid to say that this is expected output. That said, it is actually impossible to achieve the windowing result I desire using the current implementation of xarray without reconstructing the whole array in memory, which leads to wasteful memory usage, and in some cases, expansion of the array beyond my machine's memory capabilities.

Anything else we need to know?:
My use case here is for a strided FFT computation along timeseries data (spectrogram) which is one of the proposed use cases of the rolling.construct method. I desire a rolling constant-size window with strided overlap that sweeps across the data with no NaNs, resulting in any number of full windows (min_periods = None (window size) ). I can get close to my desired result with:

arr.rolling( time = 4 ) \
   .construct( 'window', stride = 4 ) \
   .dropna( 'time' ) \
   .groupby( 'time' ) \
   .apply( print_da )

but this gives me the (expected) result:

[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]

Note how the first sample of the data is dropped

My synthesized data has a particular phase property that requires the first window consists of the first n samples of the data; basically I need that first sample. I can get my desired result by not using the strided keyword argument:

arr.rolling( time = 4 ) \
   .construct( 'window' ) \
   .dropna( 'time' ) \
   .isel( time = slice( None, None, 4 ) ) \
   .groupby( 'time' ) \
   .apply( print_da )

But this implementation inflates the representation of the data in memory and is very very slow for obvious reasons. On my actual timeseries dataset, this method causes a memory error after inflating 200 MB dataset to several tens of gigabytes.

There are obviously other ways to achieve this result, but none of them met my needs. One possibility would be to use groupby_bins, but that requires me to specify the number of bins I want along the axis. I actually care less about the number of bins, moreso that they're all the same size and have a consistent stride -- functionality that pointed me to rolling. I could also use rolling.reduce followed by a dropna, but that method requires me to rewrite my chunked analysis method to operate only on the raw data of the array, without access to coordinates associated with that chunk. I actually find the rolling.reduce methodology to be quite counterintuitive and would prefer a rolling.apply method instead, but that's a separate feature request.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.15.0-115-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None

xarray: 0.15.1
pandas: 1.0.5
numpy: 1.18.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: None
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.3
pytest: None
IPython: 7.15.0
sphinx: 3.1.1

@dcherian
Copy link
Contributor

I think this is a good feature request. rolling pads with nans by default. We should allow users to turn that off or specify other padding options (see #2007, #3587)

@griffinmilsap
Copy link
Author

@dcherian -- Very good call. Simply being able to turn off NaN padding would permit my use case. Would it make more sense to close this bug report and submit a feature request with that functionality? I can submit that if so :)

@dcherian
Copy link
Contributor

I'll close this as a dupe of #2007, we can keep the conversation going there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants