Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient rolling 'trick' #1978

Closed
max-sixty opened this issue Mar 10, 2018 · 4 comments
Closed

Efficient rolling 'trick' #1978

max-sixty opened this issue Mar 10, 2018 · 4 comments

Comments

@max-sixty
Copy link
Collaborator

Based off http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html, we wrote up a function that 'tricks' numpy into presenting an array that looks rolling, but without the O^2 memory requirements

Would people be interested in this going into xarray?

It seems to work really well on a few use-cases, but I imagine it's enough trickery that we might not want to support it in xarray.
And, to be clear, it's strictly worse where we have rolling algos. But where we don't, you get a rolling apply without the python loops.

def rolling_window_numpy(a, window):
    """ Make an array appear to be rolling, but using only a view
    http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html
    """
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def rolling_window(da, span, dim=None, new_dim='dim_0'):
    """ Adds a rolling dimension to a DataArray using only a view """
    original_dims = da.dims
    da = da.transpose(*tuple(d for d in da.dims if d != dim) + (dim,))

    result = apply_ufunc(
        rolling_window_numpy,
        da,
        output_core_dims=((new_dim,),),
        kwargs=(dict(window=span)))

    return result.transpose(*(original_dims + (new_dim,)))

# tests

import numpy as np
import pandas as pd
import pytest
import xarray as xr


@pytest.fixture
def da(dims):
    return xr.DataArray(
        np.random.rand(5, 10, 15), dims=(list('abc'))).transpose(*dims)


@pytest.fixture(params=[
    list('abc'),
    list('bac'),
    list('cab'),
])
def dims(request):
    return request.param


def test_iterate_imputation_fills_missing(sample_data):
    sample_data.iloc[2, 2] = pd.np.nan
    result = iterate_imputation(sample_data)
    assert result.shape == sample_data.shape
    assert result.notnull().values.all()


def test_rolling_window(da, dims):

    result = rolling_window(da, 3, dim='c', new_dim='x')

    assert result.transpose(*list('abcx')).shape == (5, 10, 13, 3)

    # should be a view, so doesn't have any larger strides
    assert np.max(result.values.strides) == 10 * 15 * 8


def test_rolling_window_values():

    da = xr.DataArray(np.arange(12).reshape(2, 6), dims=('item', 'date'))

    rolling = rolling_window(da, 3, dim='date', new_dim='rolling_date')

    expected = sum([11, 10, 9])
    result = rolling.sum('rolling_date').isel(item=1, date=-1)
    assert result == expected
@jhamman
Copy link
Member

jhamman commented Mar 10, 2018

I'll let @fujiisoup / @shoyer confirm but this looks eerily similar to #1837.

@shoyer
Copy link
Member

shoyer commented Mar 10, 2018

This is exactly what @fujiisoup just implemented in #1837.

It's a nice trick. If the window is large, it's not quite as fast using a clever algorithm for rolling sums like bottleneck, but it still gives something like a 100x performance boost over the naive loop we used to use.

@fujiisoup
Copy link
Member

Yes.

In [7]: da.rolling(date=3).construct('rolling_date')
Out[7]: 
<xarray.DataArray (item: 2, date: 6, rolling_date: 3)>
array([[[nan, nan,  0.],
        [nan,  0.,  1.],
        [ 0.,  1.,  2.],
        [ 1.,  2.,  3.],
        [ 2.,  3.,  4.],
        [ 3.,  4.,  5.]],

       [[nan, nan,  6.],
        [nan,  6.,  7.],
        [ 6.,  7.,  8.],
        [ 7.,  8.,  9.],
        [ 8.,  9., 10.],
        [ 9., 10., 11.]]])
Dimensions without coordinates: item, date, rolling_date

does the similar thing (rolling_window in your example).

FYI, using sum without skipna option for such a strided DataArray (in your test_rolling_window_values) is not a good idea.
We internally use np.nansum and this copies the entire array once. It throws away the advantage of the strided trick.
sum(skipna=False) is memory efficient.

@max-sixty
Copy link
Collaborator Author

this looks eerily similar to #1837

🤦‍♂️ I think my memory is getting worse every day! I glanced at this myself back in Jan.

Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants