[REVIEW] Fix rolling-window count for null input #6344

mythrocks · 2020-09-29T06:00:11Z

Closes #6343.

Fixes COUNT_ALL, COUNT_VALID for window functions.

In rolling_window() operations, COUNT_VALID/COUNT_ALL should only return null rows if the min_periods requirement is not satisfied. For all other cases, the count produced must be valid, even if the input row is null.

As it currently stands, the COUNT* rolling_window() operation returns null if even one of its input rows is null. That behaviour, while correct for aggregations like SUM, is incorrect for COUNT.

This commit should fix the problem.

GPUtester · 2020-09-29T06:00:41Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

revans2 · 2020-09-29T14:52:16Z

This looks like it would fix the issue for Spark where we cannot do a COUNT on a column for a rolling window.

@kkraus14 is this going to mess anything up for python side

mythrocks · 2020-09-29T15:06:25Z

(Apologies for the force-push. I thought I'd get the fix for the code-break in before the review begins.)

mythrocks · 2020-09-29T15:08:26Z

@kkraus14 is this going to mess anything up for python side

A more complete accounting of the problem is available at #6343, in case the description here is too terse.

kkraus14 · 2020-09-29T15:09:50Z

@kkraus14 is this going to mess anything up for python side

cc @shwina @brandon-b-miller who worked on rolling windows

brandon-b-miller · 2020-09-29T15:35:33Z

I don't think we rely on this operation producing a particular result internally for anything on the python side, so I don't think anything will end up happening except for the associated changes propagating to the user facing version of this API.

I think this might address #5580 and is related to pandas changes discussed in pandas-dev/pandas#34466 and the issues linked to therein. Honestly, when @shwina and I looked at this a few months back, we had a hard time parsing out the logic behind the pandas behavior. It seemed like no matter how we preprocessed min_periods and win_size with count, there was always a set of cases where we got a different answer than pandas and it was very hard to pin down what was happening.

I'm curious to see what happens if we build this branch and remove the xfails which mark those tests right now - I'll build this branch this afternoon and try it and report back here.

mythrocks · 2020-09-29T18:22:45Z

I think this might address #5580 and is related to pandas changes discussed in pandas-dev/pandas#34466

Thank you for the link, @brandon-b-miller. Your assessment is accurate.
As a side-note, was anyone else confused with pandas-dev's nomenclature regarding the "element above"?

When your window size is an even number, with center=True you actually include 1 more element above the anchor than below the anchor.

(He means the preceding row, right? This was an informative discussion.) Regardless, it appears that behaviour of count in pandas < 1.0 was incorrect as well.

mythrocks · 2020-09-29T19:30:22Z

Some of the python tests seem to be failing.

=========================== short test summary info ============================
FAILED cudf/tests/test_rolling.py::test_rolling_with_offset[count] - Assertio...
= 1 failed, 53475 passed, 2206 skipped, 954 xfailed, 4040 xpassed, 55395 warnings in 3978.62s (1:06:18) =

Is there an easy way to run a specific Python test? I suspect these tests will need their expected output modified.

brandon-b-miller · 2020-09-29T19:32:43Z

You can use py.test test_rolling.py -k test_rolling_with_offset[count] to run just that one test

mythrocks · 2020-09-30T06:46:38Z

The failure is quite interesting. Reproduced here in full, to save on browser tabs:

agg = 'count'
    @pytest.mark.parametrize(
        "agg", ["sum", pytest.param("min"), pytest.param("max"), "mean", "count"]
    )
    def test_rolling_with_offset(agg):
        psr = pd.Series(
            [1, 2, 4, 4, np.nan, 9],
            index=[
                pd.Timestamp("20190101 09:00:00"),
                pd.Timestamp("20190101 09:00:01"),
                pd.Timestamp("20190101 09:00:02"),
                pd.Timestamp("20190101 09:00:04"),
                pd.Timestamp("20190101 09:00:07"),
                pd.Timestamp("20190101 09:00:08"),
            ],
        )
        gsr = cudf.from_pandas(psr)
>       assert_eq(
            getattr(psr.rolling("2s"), agg)().fillna(-1),
            getattr(gsr.rolling("2s"), agg)().fillna(-1),
            check_dtype=False,
        )

cudf/tests/test_rolling.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/_libs/testing.pyx:67: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: Series are different
E   
E   Series values are different (16.66667 %)
E   [index]: [2019-01-01T09:00:00.000000000, 2019-01-01T09:00:01.000000000, 2019-01-01T09:00:02.000000000, 2019-01-01T09:00:04.000000000, 2019-01-01T09:00:07.000000000, 2019-01-01T09:00:08.000000000]
E   [left]:  [1.0, 2.0, 2.0, 1.0, -1.0, 1.0]
E   [right]: [1, 2, 2, 1, 0, 1]

For some odd reason, the fifth input causes a nan to be produced, (and replaced with -1.0). I'm poring through the pandas documentation now, to see if I can make sense of this. On the face of it, it would appear that in pandas, the nan rows cause a nan output, but only in the indexed column case.

mythrocks · 2020-09-30T08:20:36Z

I suspect that the Pandas series rolling-count min_periods fix has not been extended to "indexed" columns (using time-range based windows). I have requested the author for clarification on this behaviour.

If my hunch is right, we might need to move this test to xfail for the moment.

brandon-b-miller · 2020-09-30T12:52:44Z

There's some tests in test_rolling.py that are marked as xfail right now due to these issues. I tried removing the marks, but I'm still seeing failures. Here's an example. Any idea what might be happening?

>>> import pandas as pd
>>> import cudf
>>> psr = pd.Series([1,1,1,1])
>>> gsr = cudf.Series([1,1,1,1])
>>> psr.rolling(2,2,True).count()
0    NaN
1    2.0
2    2.0
3    2.0
dtype: float64
>>> gsr.rolling(2,2,True).count()
0    1
1    2
2    2
3    2
dtype: int32
>>> pd.__version__
'1.1.2'

mythrocks · 2020-09-30T17:52:00Z

Any idea what might be happening?

On the face of it, my code seems to be producing the incorrect result. I will investigate.

brandon-b-miller · 2020-09-30T20:52:40Z

Connected offline and found this to be part of the issue https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/core/window/rolling.py#L233-L234

mythrocks · 2020-09-30T22:30:41Z

@brandon-b-miller is right, of course. cudf::rolling_window() seems to be working correctly, as tested with a temporarily added test-case:

42: [ RUN      ] UntypedRollingTest.TestMinPeriods
42: Input vector: 1,1,1,1
42: count(preceding=2, following=0, min_periods=2) : NULL,2,2,2
42: count(preceding=2, following=0, min_periods=1) : 1,2,2,2
42: count(preceding=2, following=0, min_periods=0) : 1,2,2,2
42: [       OK ] UntypedRollingTest.TestMinPeriods (0 ms)

I'll post a correction for the Python tests.

mythrocks · 2020-10-01T00:47:49Z

Drat. Removing the min_periods hack from rolling.py breaks test_rolling_groupby_*. This will need a more delicate hand.

brandon-b-miller · 2020-10-01T12:25:08Z

I'll look into this today.

harrism

Looks good. Couple small things. My review can replace @trxcllnt 's if he is busy.

cpp/src/rolling/rolling.cu

python/cudf/cudf/tests/test_rolling.py

1. Switched explicit for-loop to thrust::count_if(). 2. Fixed spelling in rolling-window python tests.

3. Switched SFINAE from return-type to template parameter.

mythrocks · 2020-10-06T22:18:08Z

@harrism, thank you for your review and suggestions. Please let me know if this PR is looking alright to go in. (I'm a little nervous about rebasing #6277 in time for tomorrow.)

cpp/src/rolling/rolling.cu

mythrocks · 2020-10-09T16:59:30Z

rerun tests

mythrocks · 2020-10-12T00:44:18Z

@shwina, might I please bug you to have a look at the Python bits, at your earliest convenience?

mythrocks · 2020-10-12T00:44:36Z

Thanks for the review, @harrism!

python/cudf/cudf/tests/test_rolling.py

1. Added test for expected values for rolling_window count 2. Fixed convention for parametrized agg values

3. Reformatting

shwina

LGTM. Thanks @mythrocks!

mythrocks · 2020-10-13T05:16:41Z

Thanks for the review, @shwina (and everyone). Thanks for merging this, @harrism!

mythrocks requested a review from a team as a code owner September 29, 2020 06:00

mythrocks requested review from trxcllnt and vuule September 29, 2020 06:00

mythrocks changed the title ~~Fix rolling-window count for null input~~ [WIP] Fix rolling-window count for null input Sep 29, 2020

mythrocks force-pushed the window-count-fix branch from 0e6f2f4 to b062211 Compare September 29, 2020 06:09

mythrocks self-assigned this Sep 29, 2020

harrism added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Sep 29, 2020

Fix rolling-window count for null input

8211a68

mythrocks force-pushed the window-count-fix branch from b062211 to 8211a68 Compare September 29, 2020 15:02

sameerz mentioned this pull request Sep 29, 2020

[FEA] Support window count for a column NVIDIA/spark-rapids#838

Closed

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

4da93f5

harrism requested changes Oct 6, 2020

View reviewed changes

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_rolling.py Show resolved Hide resolved

mythrocks added 2 commits October 5, 2020 22:08

Rolling Window count fix: Review fixes:

1298b0c

1. Switched explicit for-loop to thrust::count_if(). 2. Fixed spelling in rolling-window python tests.

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

8697cf1

mythrocks removed the request for review from trxcllnt October 6, 2020 05:10

mythrocks added 2 commits October 5, 2020 22:42

Rolling Window count fix: Review fixes:

9a4b168

3. Switched SFINAE from return-type to template parameter.

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

45003cd

mythrocks added 4 commits October 6, 2020 23:17

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

6264fcc

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

c982404

Rolling Window count fix: Fixed formatting

d901023

Rolling Window count fix: Fixed process_rolling_window for COUNT_ALL

f7a8c90

harrism reviewed Oct 8, 2020

View reviewed changes

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved

Rolling Window count fix: Use is_valid_no_check().

f34835b

harrism approved these changes Oct 11, 2020

View reviewed changes

shwina reviewed Oct 12, 2020

View reviewed changes

python/cudf/cudf/tests/test_rolling.py Show resolved Hide resolved

kkraus14 reviewed Oct 12, 2020

View reviewed changes

python/cudf/cudf/tests/test_rolling.py Outdated Show resolved Hide resolved

mythrocks added 3 commits October 12, 2020 11:34

Rolling Window count fix: Python review

62709b8

1. Added test for expected values for rolling_window count 2. Fixed convention for parametrized agg values

Merge remote-tracking branch 'origin/branch-0.16' into window-count-fix

7f9b391

Rolling Window count fix: Python review

6445c5d

3. Reformatting

shwina approved these changes Oct 12, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 12, 2020

harrism merged commit 4d5db1f into rapidsai:branch-0.16 Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix rolling-window count for null input #6344

[REVIEW] Fix rolling-window count for null input #6344

mythrocks commented Sep 29, 2020

GPUtester commented Sep 29, 2020

revans2 commented Sep 29, 2020

mythrocks commented Sep 29, 2020

mythrocks commented Sep 29, 2020

kkraus14 commented Sep 29, 2020

brandon-b-miller commented Sep 29, 2020

mythrocks commented Sep 29, 2020 •

edited

Loading

mythrocks commented Sep 29, 2020

brandon-b-miller commented Sep 29, 2020

mythrocks commented Sep 30, 2020

mythrocks commented Sep 30, 2020

brandon-b-miller commented Sep 30, 2020

mythrocks commented Sep 30, 2020

brandon-b-miller commented Sep 30, 2020

mythrocks commented Sep 30, 2020

mythrocks commented Oct 1, 2020

brandon-b-miller commented Oct 1, 2020

harrism left a comment

mythrocks commented Oct 6, 2020

mythrocks commented Oct 9, 2020

mythrocks commented Oct 12, 2020

mythrocks commented Oct 12, 2020

shwina left a comment

mythrocks commented Oct 13, 2020

[REVIEW] Fix rolling-window count for null input #6344

[REVIEW] Fix rolling-window count for null input #6344

Conversation

mythrocks commented Sep 29, 2020

GPUtester commented Sep 29, 2020

revans2 commented Sep 29, 2020

mythrocks commented Sep 29, 2020

mythrocks commented Sep 29, 2020

kkraus14 commented Sep 29, 2020

brandon-b-miller commented Sep 29, 2020

mythrocks commented Sep 29, 2020 • edited Loading

mythrocks commented Sep 29, 2020

brandon-b-miller commented Sep 29, 2020

mythrocks commented Sep 30, 2020

mythrocks commented Sep 30, 2020

brandon-b-miller commented Sep 30, 2020

mythrocks commented Sep 30, 2020

brandon-b-miller commented Sep 30, 2020

mythrocks commented Sep 30, 2020

mythrocks commented Oct 1, 2020

brandon-b-miller commented Oct 1, 2020

harrism left a comment

Choose a reason for hiding this comment

mythrocks commented Oct 6, 2020

mythrocks commented Oct 9, 2020

mythrocks commented Oct 12, 2020

mythrocks commented Oct 12, 2020

shwina left a comment

Choose a reason for hiding this comment

mythrocks commented Oct 13, 2020

mythrocks commented Sep 29, 2020 •

edited

Loading