-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Rolling standard deviation #8695
Comments
Chatted with @isVoid offline to discuss this in the context of data types (decimal, datetime, and timedelta). Datetime Neither Spark nor pandas support this operation on built in Datetime types. from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
np.random.seed(12)
spark = SparkSession.builder \
.master("local") \
.getOrCreate()
nrows = 100
keycol = [0] * (nrows//2) + [1] * (nrows//2)
df = pd.DataFrame({
"key": keycol,
"a": np.random.randint(0, 100, nrows),
"b": np.random.randint(0, 100, nrows),
"c": np.random.randint(0, 100, nrows),
"d": pd.date_range(start="2001-01-01", periods=nrows, freq="D"),
})
df["e"] = pd.to_timedelta(df.d.astype("int"))
# df.rolling(4).d.std().head(10) # NotImplementError
sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")
# sdf.withColumn(
# "std",
# F.stddev_samp("d").over(Window.rowsBetween(-2, 0))
# ).show(5) # AnalysisException Decimal new = sdf.withColumn("b_decimal", sdf.b.cast("Decimal"))
new.select(["b_decimal"]).withColumn(
"std",
F.stddev_samp("b_decimal").over(Window.rowsBetween(-2, 0))
).show(5)
+---------+------------------+
|b_decimal| std|
+---------+------------------+
| 68| null|
| 25|30.405591591021544|
| 44|21.548395145191982|
| 22|11.930353445448853|
| 69|23.515952032609693|
+---------+------------------+
only showing top 5 rows from decimal import Decimal
s = pd.Series([Decimal("10.0"), Decimal("10.0"), Decimal("11.0")])
s.rolling(2).std()
0 NaN
1 0.000000
2 0.707107
dtype: float64 Timedelta Pandas does not support this operation on the timedelta dtype, and I believe Spark does not have an analogous type to timedelta (please correct me if I'm wrong!). df.e.rolling(2).std()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_series(self, homogeneous_func, name)
368 input = obj.values if name != "count" else notna(obj.values).astype(int)
--> 369 values = self._prep_values(input)
370 except (TypeError, NotImplementedError) as err:
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/pandas/core/window/rolling.py in _prep_values(self, values)
276 elif needs_i8_conversion(values.dtype):
--> 277 raise NotImplementedError(
278 f"ops for {self._window_type} for this "
NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented |
@sameerz for Spark |
From the Spark perspective we really would like to be able to do stddev_samp and stddev_pop. I am not a data scientist nor a statistician so I don't know if there is a way for us to get stddev_samp, stddev_pop, and degrees of freedom from the same core aggregation. If there is we are happy to use it, even if it requires some extra post processing. Spark only supports stddev_samp and stddev_pop on double values. It will automatically convert many other types to doubles before doing the computation. Spark is trying to become more ANSI complaint and is adding in some time delta like support, but it is not something that the RAPIDS plugin is working on right now. Spark does support a CalendarInterval type. This is a combination of month, day, and microsecond intervals, but it is mostly used for operations like add 3 months and 2 days to a date column. You can have a column of CalendarIntervals, but it is not common. |
OK I looked at the math used by spark to calculate std_pop vs std_samp and the ddof explanation in #8809 so it looks like it will work for us. |
Part 1 of #8695 This PR adds support to `STD` and `VARIANCE` rolling aggregations in libcudf. - Supported types include numeric types and fixed point types. Chrono types are not supported - see thread in issue. Implementation notes: - [Welford](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)'s algorithm is used Authors: - Michael Wang (https://github.com/isVoid) Approvers: - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) URL: #8809
…rolling.std` (#9097) Closes #8695 Closes #8696 This PR creates bindings for rolling aggregations for variance and standard deviations. Unlike pandas, the underlying implementation from libcudf computes each window independently from other windows. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Sheilah Kirui (https://github.com/skirui-source) URL: #9097
Today, I can calculate rolling average, sum, and a variety of other aggregations. I'd like to also calculate the rolling standard deviation.
As an example, I might have a large set of sensor data. To make sure my sensors are behaving within normal range, I'd like to measure the rolling standard deviation and post-process the results to alert me if any window has a standard deviation more than some threshold beyond an acceptable range.
Spark differentiates between the sample and population standard deviation (stddev_samp vs stddev_pop), while pandas instead parameterized the
std
function with an argument for degrees of freedom.Pandas:
Spark:
The text was updated successfully, but these errors were encountered: