Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series rolling count ignores min_periods #30923

Merged

Conversation

fujiaxiang
Copy link
Member

@fujiaxiang fujiaxiang commented Jan 11, 2020

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks generally good just a question

pandas/core/window/rolling.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added the Window rolling, ewma, expanding label Jan 11, 2020
@fujiaxiang
Copy link
Member Author

As mentioned in my previous reply to a change request, I think we should properly handles min_periods when it is explicitly set by user, but the default behavior should remain the same, i.e. min_periods=0 by default (it defaults to None in method signature but inside is set to 0).

The reason is that this count method is meant to count the non-NAs values in the window, and we don't want to require all values in the window to be valid in order to produce a valid count, which is what other similar APIs do (e.g. rolling.mean sets min_periods=window, effectively requires all values in the window to be valid, otherwise produces NaN)

However, if you feel strongly we should make the default behavior exactly the same as other APIs like rolling.mean, let me know and I will update this PR. I have the code ready. The new code updates 63 existing test cases because of the change in default behavior.

pandas/core/window/rolling.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.0.0 milestone Jan 20, 2020
@jreback jreback added the Bug label Jan 20, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with this for 1.0.0 if you can move this note

Updated the behavior of rolling and expanding count so that it becomes consistent with all other rolling and expanding functions.
Also updated many test cases to reflect this change of behavior.
@fujiaxiang fujiaxiang changed the title BUG: Series rolling count ignores min_periods ENH: Series rolling count ignores min_periods Jan 21, 2020
@fujiaxiang
Copy link
Member Author

After further investigation, I realize we can in fact make count function behave consistently with other rolling and expanding functions.

I have pushed the code changes, and also updated many test cases that were "incorrect".

Now the count function correctly handles min_periods argument, which was ignored previously.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think looks good but given scope of change I don't think this is suitable for 1.0 - it isn't backwards compatible in its current form with the change to how count works, so probably best for 1.1 @jreback ?

pandas/core/window/rolling.py Outdated Show resolved Hide resolved
@fujiaxiang
Copy link
Member Author

After a few rounds of debugging and code clean-up, now rolling.count behaves completely consistent with other APIs (e.g rolling.mean).

@jreback mentioned the fix is ok for V1.0 at 2 commits ago (bfe10f0), @WillAyd thinks the scope of change is too large for V1.0 since last commit (c0046b6).

Would it be a good idea to roll-out the backward-compatible smaller fix at 2 commits ago for V1.0, and the longer-term complete fix for V1.1 with a separate PR?

@fujiaxiang
Copy link
Member Author

Is there a way to retrigger checks without new commits? Recently CI seems to randomly fail a lot.

@TomAugspurger
Copy link
Contributor

If you merge master & repush then the CI should be fixed.

@mroeschke
Copy link
Member

I think we would still need some sort of DeprecationWarning on count or min_periods (although the parameter wasn't working before) since the default behavior of the function would change with the latest commit you have.

Another reason why it may just be simpler to keep the default min_periods to 0 and have the PR be a bug fix that min_periods now is respected.

@fujiaxiang fujiaxiang changed the title ENH: Series rolling count ignores min_periods BUG: Series rolling count ignores min_periods Jan 24, 2020
@fujiaxiang
Copy link
Member Author

fujiaxiang commented Jan 24, 2020

@mroeschke I have changed default min_periods back to 0. Could you review?

I have kept most of the changes in the test cases, which now explicitly specifies min_periods (min_periods=0 mostly, min_periods=min_periods otherwise where the effect is the same and consistent with others). This would make it easier when we decide to go-ahead with changing the default behavior in future versions.

Should I also create a new issue to add the DeprecationWarning and the actual change for next version?

@fujiaxiang
Copy link
Member Author

Ah thanks for highlighting that @fujiaxiang. Sorry for not totally following the entire conversation to date.

I have a slight preference on keeping the default min_periods for count to be 0 instead of window so count is consistent with count functions on other pandas objects (e.g. DataFrame.count.

Otherwise, I do agree with @WillAyd and think changing the default min_periods is a bit aggressive to include 1.0.

@mroeschke btw changing the default min_periods doesn't make the result that different from DataFrame.count. Unlike rolling.mean or rolling.sum etc, the min_periods argument for rolling.count only takes effect on the window size, and doesn't care how many NA values are in the window. If we change its default value to the same as window, same as other APIs, we would only see NAs in the first few rows (which would be obvious to user that this is caused by min_periods), and the rest are consistent.

@mroeschke
Copy link
Member

Yes please create a new issue for possibly changing the default behavior of rolling.count with respect to min_periods. Good to discuss the change outside this PR.

@mroeschke
Copy link
Member

One small whatsnew comment otherwise LGTM

@fujiaxiang
Copy link
Member Author

Yes please create a new issue for possibly changing the default behavior of rolling.count with respect to min_periods. Good to discuss the change outside this PR.

Created new issue #31302. Let me know what you guys think there.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2020

does this fully close #26996 ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @mroeschke this is a straight bug fix now?

pandas/core/window/rolling.py Outdated Show resolved Hide resolved
@jreback jreback merged commit 823bf6c into pandas-dev:master Jan 26, 2020
@jreback
Copy link
Contributor

jreback commented Jan 26, 2020

thanks for the patch @fujiaxiang

@jreback
Copy link
Contributor

jreback commented Jan 26, 2020

hmm, this was supposed to backport. very odd. @fujiaxiang can you push a PR to the 1.0.x branch as well (the bot is supposed to do this).

@fujiaxiang
Copy link
Member Author

PR created (#31319). It has some conflict with target branch in .travis.yml, @jreback could you help resolve?

@fujiaxiang fujiaxiang deleted the series_rolling_count_ignores_min_periods branch January 26, 2020 04:45
keechongtan added a commit to keechongtan/pandas that referenced this pull request Jan 27, 2020
…ndexing-1row-df

* upstream/master: (194 commits)
  DOC Remove Python 2 specific comments from documentation (pandas-dev#31198)
  Follow up PR: pandas-dev#28097 Simplify branch statement (pandas-dev#29243)
  BUG: DatetimeIndex.snap incorrectly setting freq (pandas-dev#31188)
  Move DataFrame.info() to live with similar functions (pandas-dev#31317)
  ENH: accept a dictionary in plot colors (pandas-dev#31071)
  PERF: add shortcut to Timestamp constructor (pandas-dev#30676)
  CLN/MAINT: Clean and annotate stata reader and writers (pandas-dev#31072)
  REF: define _get_slice_axis in correct classes (pandas-dev#31304)
  BUG: DataFrame.floordiv(ser, axis=0) not matching column-wise bheavior (pandas-dev#31271)
  PERF: optimize is_scalar, is_iterator (pandas-dev#31294)
  BUG: Series rolling count ignores min_periods (pandas-dev#30923)
  xfail sparse warning; closes pandas-dev#31310 (pandas-dev#31311)
  REF: DatetimeIndex.get_value wrap DTI.get_loc (pandas-dev#31314)
  CLN: internals.managers (pandas-dev#31316)
  PERF: avoid copies if possible in fill_binop (pandas-dev#31300)
  Add test for multiindex json (pandas-dev#31307)
  BUG: passing TDA and wrong freq to TimedeltaIndex (pandas-dev#31268)
  BUG: inconsistency between PeriodIndex.get_value vs get_loc (pandas-dev#31172)
  CLN: remove _set_subtyp (pandas-dev#31301)
  CI: Updated version of macos image (pandas-dev#31292)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.rolling min_period is ignored and NA behaves strangely
5 participants