Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Rolling standard deviation #8695

Closed
beckernick opened this issue Jul 8, 2021 · 4 comments · Fixed by #9097
Closed

[FEA] Rolling standard deviation #8695

beckernick opened this issue Jul 8, 2021 · 4 comments · Fixed by #9097
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

Today, I can calculate rolling average, sum, and a variety of other aggregations. I'd like to also calculate the rolling standard deviation.

As an example, I might have a large set of sensor data. To make sure my sensors are behaving within normal range, I'd like to measure the rolling standard deviation and post-process the results to alert me if any window has a standard deviation more than some threshold beyond an acceptable range.

Spark differentiates between the sample and population standard deviation (stddev_samp vs stddev_pop), while pandas instead parameterized the std function with an argument for degrees of freedom.

Pandas:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
import pandas as pdspark = SparkSession.builder \
    .master("local") \
    .getOrCreate()
​
df = pd.DataFrame({
    "a": [10,3,4,2,-3,9,10],
    "b": [10,23,-4,2,-3,9,19],
    "c": [10,-23,-4,21,-3,19,19],
})
​
print(df.a.rolling(3).std())
0         NaN
1         NaN
2    3.785939
3    1.000000
4    3.605551
5    6.027714
6    7.234178
Name: a, dtype: float64

Spark:

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")
​
sdf.withColumn(
    "std",
    F.stddev_samp("a").over(Window.rowsBetween(-2, 0))
).show()
+---+---+---+------------------+
|  a|  b|  c|               std|
+---+---+---+------------------+
| 10| 10| 10|              null|
|  3| 23|-23| 4.949747468305833|
|  4| -4| -4|3.7859388972001824|
|  2|  2| 21|               1.0|
| -3| -3| -3| 3.605551275463989|
|  9|  9| 19| 6.027713773341708|
| 10| 19| 19| 7.234178138070234|
+---+---+---+------------------+
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 8, 2021
@beckernick beckernick added this to the Time Series Analysis milestone Jul 14, 2021
@isVoid isVoid self-assigned this Jul 20, 2021
@beckernick
Copy link
Member Author

beckernick commented Jul 20, 2021

Chatted with @isVoid offline to discuss this in the context of data types (decimal, datetime, and timedelta).

Datetime

Neither Spark nor pandas support this operation on built in Datetime types.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
import pandas as pd
import numpy as np

np.random.seed(12)

spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()

nrows = 100
keycol = [0] * (nrows//2) + [1] * (nrows//2)

df = pd.DataFrame({
    "key": keycol,
    "a": np.random.randint(0, 100, nrows),
    "b": np.random.randint(0, 100, nrows),
    "c": np.random.randint(0, 100, nrows),
    "d": pd.date_range(start="2001-01-01", periods=nrows, freq="D"),
})
df["e"] = pd.to_timedelta(df.d.astype("int"))


# df.rolling(4).d.std().head(10) # NotImplementError

sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")

# sdf.withColumn(
#     "std",
#     F.stddev_samp("d").over(Window.rowsBetween(-2, 0))
# ).show(5) # AnalysisException

Decimal
Spark supports this operation on Decimal types. Pandas doesn’t have a builtin type, but will succeed with an object
column of Decimals.

new = sdf.withColumn("b_decimal", sdf.b.cast("Decimal"))
new.select(["b_decimal"]).withColumn(
    "std",
    F.stddev_samp("b_decimal").over(Window.rowsBetween(-2, 0))
).show(5)
+---------+------------------+
|b_decimal|               std|
+---------+------------------+
|       68|              null|
|       25|30.405591591021544|
|       44|21.548395145191982|
|       22|11.930353445448853|
|       69|23.515952032609693|
+---------+------------------+
only showing top 5 rows
from decimal import Decimal
s = pd.Series([Decimal("10.0"), Decimal("10.0"), Decimal("11.0")])
s.rolling(2).std()
0         NaN
1    0.000000
2    0.707107
dtype: float64

Timedelta

Pandas does not support this operation on the timedelta dtype, and I believe Spark does not have an analogous type to timedelta (please correct me if I'm wrong!).

df.e.rolling(2).std()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_series(self, homogeneous_func, name)
    368             input = obj.values if name != "count" else notna(obj.values).astype(int)
--> 369             values = self._prep_values(input)
    370         except (TypeError, NotImplementedError) as err:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _prep_values(self, values)
    276         elif needs_i8_conversion(values.dtype):
--> 277             raise NotImplementedError(
    278                 f"ops for {self._window_type} for this "

NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented

@harrism
Copy link
Member

harrism commented Jul 20, 2021

@sameerz for Spark

@revans2
Copy link
Contributor

revans2 commented Jul 27, 2021

From the Spark perspective we really would like to be able to do stddev_samp and stddev_pop. I am not a data scientist nor a statistician so I don't know if there is a way for us to get stddev_samp, stddev_pop, and degrees of freedom from the same core aggregation. If there is we are happy to use it, even if it requires some extra post processing. Spark only supports stddev_samp and stddev_pop on double values. It will automatically convert many other types to doubles before doing the computation.

Spark is trying to become more ANSI complaint and is adding in some time delta like support, but it is not something that the RAPIDS plugin is working on right now. Spark does support a CalendarInterval type. This is a combination of month, day, and microsecond intervals, but it is mostly used for operations like add 3 months and 2 days to a date column. You can have a column of CalendarIntervals, but it is not common.

@revans2
Copy link
Contributor

revans2 commented Jul 27, 2021

OK I looked at the math used by spark to calculate std_pop vs std_samp and the ddof explanation in #8809 so it looks like it will work for us.

rapids-bot bot pushed a commit that referenced this issue Sep 8, 2021
Part 1 of #8695 

This PR adds support to `STD` and `VARIANCE` rolling aggregations in libcudf.
- Supported types include numeric types and fixed point types. Chrono types are not supported - see thread in issue.

Implementation notes:
- [Welford](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)'s algorithm is used

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - David Wendt (https://github.com/davidwendt)

URL: #8809
rapids-bot bot pushed a commit that referenced this issue Sep 15, 2021
…rolling.std` (#9097)

Closes #8695
Closes #8696 

This PR creates bindings for rolling aggregations for variance and standard deviations. Unlike pandas, the underlying implementation from libcudf computes each window independently from other windows.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9097
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants