-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Fix rolling-window count for null input #6344
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
0e6f2f4
to
b062211
Compare
This looks like it would fix the issue for Spark where we cannot do a COUNT on a column for a rolling window. @kkraus14 is this going to mess anything up for python side |
b062211
to
8211a68
Compare
(Apologies for the force-push. I thought I'd get the fix for the code-break in before the review begins.) |
cc @shwina @brandon-b-miller who worked on rolling windows |
I don't think we rely on this operation producing a particular result internally for anything on the python side, so I don't think anything will end up happening except for the associated changes propagating to the user facing version of this API. I think this might address #5580 and is related to pandas changes discussed in pandas-dev/pandas#34466 and the issues linked to therein. Honestly, when @shwina and I looked at this a few months back, we had a hard time parsing out the logic behind the pandas behavior. It seemed like no matter how we preprocessed I'm curious to see what happens if we build this branch and remove the |
Thank you for the link, @brandon-b-miller. Your assessment is accurate.
(He means the preceding row, right? This was an informative discussion.) Regardless, it appears that behaviour of count in |
Some of the python tests seem to be failing.
Is there an easy way to run a specific Python test? I suspect these tests will need their expected output modified. |
You can use |
The failure is quite interesting. Reproduced here in full, to save on browser tabs:
For some odd reason, the fifth input causes a |
I suspect that the Pandas series rolling-count If my hunch is right, we might need to move this test to |
There's some tests in
|
On the face of it, my code seems to be producing the incorrect result. I will investigate. |
Connected offline and found this to be part of the issue https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/core/window/rolling.py#L233-L234 |
@brandon-b-miller is right, of course.
I'll post a correction for the Python tests. |
Drat. Removing the |
I'll look into this today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Couple small things. My review can replace @trxcllnt 's if he is busy.
1. Switched explicit for-loop to thrust::count_if(). 2. Fixed spelling in rolling-window python tests.
3. Switched SFINAE from return-type to template parameter.
rerun tests |
@shwina, might I please bug you to have a look at the Python bits, at your earliest convenience? |
Thanks for the review, @harrism! |
1. Added test for expected values for rolling_window count 2. Fixed convention for parametrized agg values
3. Reformatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @mythrocks!
Closes #6343.
Fixes
COUNT_ALL
,COUNT_VALID
for window functions.In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.
As it currently stands, the COUNT* rolling_window() operation returns null if even one of its input rows is null. That behaviour, while correct for aggregations like SUM, is incorrect for COUNT.
This commit should fix the problem.