Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support VARIANCE and STD aggregation in rolling op #8809

Merged
merged 40 commits into from
Sep 8, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jul 21, 2021

Part 1 of #8695

This PR adds support to STD and VARIANCE rolling aggregations in libcudf.

  • Supported types include numeric types and fixed point types. Chrono types are not supported - see thread in issue.

Implementation notes:

@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 21, 2021
@isVoid isVoid added this to the Time Series Analysis milestone Jul 21, 2021
@isVoid isVoid added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Jul 21, 2021
@isVoid isVoid self-assigned this Jul 23, 2021
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Jul 23, 2021
@isVoid isVoid changed the base branch from branch-21.08 to branch-21.10 July 23, 2021 21:56
@isVoid isVoid marked this pull request as ready for review July 23, 2021 22:10
@isVoid isVoid requested a review from a team as a code owner July 23, 2021 22:10
@isVoid isVoid removed the 2 - In Progress Currently a work in progress label Jul 23, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Aug 25, 2021

Per discussion above, the current rolling aggregation returns null under following corner cases (assume count is the number of valid observations in the window):

  • count == 0
  • count < min_periods
  • ddof > count (results in negative variance) [EDIT 08/30/2021] - now valid but NaN, because it has valid input but mathematically ill defined

When ddof == count, the corresponding row is valid, the result is div by zero values (inf/nan).

@ttnghia
Copy link
Contributor

ttnghia commented Aug 25, 2021

FYI: In Spark, there is an option allowing you to specify whether to output a null or a NaN when divide-by-zero: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L67

@isVoid
Copy link
Contributor Author

isVoid commented Aug 25, 2021

@ttnghia Is it a global switch or specific to std/var aggregation?

@ttnghia
Copy link
Contributor

ttnghia commented Aug 25, 2021

It's just specific to std/var aggregation. I'm not sure if this option is exposed to the user though. @revans2 do you have any idea?

@revans2
Copy link
Contributor

revans2 commented Aug 25, 2021

It's just specific to std/var aggregation. I'm not sure if this option is exposed to the user though. @revans2 do you have any idea?

It is a global config in newer versions of Spark spark.sql.legacy.statisticalAggregate. It was added as a part of Spark 3.1.0 apparently as a bug fix to make the Spark SQL more ANSI complaint https://issues.apache.org/jira/browse/SPARK-13860. But like I said before we don't have to push this down onto CUDF to support. If CUDF wants to support it being configurable, that is great. But we can do a cheap non-null count aggregation on the same column and then swap out any values based on 0 or 1 counts to fix up the CUDF results to match Spark as needed.

Co-authored-by: Christopher Harris <xixonia@gmail.com>
@isVoid
Copy link
Contributor Author

isVoid commented Aug 27, 2021

@revans2 Thanks for the info. As mentioned several times by Bobby, for var/std case, doing a COUNT_VALID rolling aggregation on the input column can resolve the ambiguity if desired. Per discussion above, configurable null/nan has greater impact to the aggregation API as a whole, and is out of scope of this PR.

@mythrocks
Copy link
Contributor

This appears to be just a copy of the GPU code for which it is testing but in CPU code. So a changes to one would likely require changing the other. The expected output is guaranteed to match the results if the test code logic always matches regardless if the algorithm is correct.

In defense of @isVoid's PR, the oldest rolling-window code was this way for a bit. I've tried moving to using more explicit (manually verifiable) values for the more recent feature additions. It will be worthwhile to dismantle the old tests and write explicit ones for them. As @davidwendt, @isVoid, etc. have noted, we should probably do that in a follow-up.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/rolling_detail.cuh Show resolved Hide resolved
@mythrocks
Copy link
Contributor

Regarding the change in behaviour since my last review (i.e. specifically when count < ddof: My understanding of Spark's stddev function is that it's equivalent to having ddof=1. The proposed behaviour doesn't look like it messes things up for Spark. Still 👍.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

isVoid and others added 2 commits September 1, 2021 09:17
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
@isVoid
Copy link
Contributor Author

isVoid commented Sep 7, 2021

rerun tests

1 similar comment
@isVoid
Copy link
Contributor Author

isVoid commented Sep 7, 2021

rerun tests

@isVoid isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 7, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Sep 8, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9b532ab into rapidsai:branch-21.10 Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants