Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG SeriesGroupBy.mean() overflowed on some integer arrays (#22487) #22653

Merged
merged 1 commit into from
Sep 18, 2018

Conversation

troels
Copy link
Contributor

@troels troels commented Sep 9, 2018

When integer arrays contained integers that were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.

@pep8speaks
Copy link

Hello @troels! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Sep 10, 2018

Codecov Report

Merging #22653 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22653      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50711       +3     
==========================================
+ Hits        46740    46743       +3     
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <100%> (ø) ⬆️
#single 42.35% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 96.81% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0976e12...5340c0b. Read the comment docs.

pandas/tests/arrays/test_integer.py Outdated Show resolved Hide resolved
@@ -471,7 +471,12 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1,
if (values == iNaT).any():
values = ensure_float64(values)
else:
values = values.astype('int64', copy=False)
try:
values = values.astype('int64', copy=False, casting='safe')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well this certainly works though I'm wondering if there's not a more comprehensive way that we should be handling (ex: adding the casting="safe" call to our algos templates)

cc @jreback and @jbrockmendel for any input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is something that occurs many places, as e.g. int and uint64 can not always be cast into int64 and thinking they can, just for satisfying is_integer_dtype is wrong.

A few places around the code seems to have the same problem, e.g.:
pandas/core/algorithms.py:427
pandas/core/groupby/generic.py:1168

But as far as I can see, there isn't a general solution to this except for fixing the code in the respective places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u take this code and make a function out of it and put where we defined ensure_float64
define

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've created a function in dtypes/common.py now.

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
@WillAyd WillAyd added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Sep 11, 2018
@@ -471,7 +471,12 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1,
if (values == iNaT).any():
values = ensure_float64(values)
else:
values = values.astype('int64', copy=False)
try:
values = values.astype('int64', copy=False, casting='safe')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u take this code and make a function out of it and put where we defined ensure_float64
define

…#22487)

When integer arrays contained integers that could were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.
@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

lgtm. @troels there might be some other places you can now use this routine (in groupby), but can certainly be done as a followup.

back to you @WillAyd

@WillAyd WillAyd merged commit 79f7762 into pandas-dev:master Sep 18, 2018
@WillAyd
Copy link
Member

WillAyd commented Sep 18, 2018

Thanks @troels - very nice change!

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.gropuby().mean() incorrect result
5 participants