Skip to content

Commit

Permalink
BUG SeriesGroupBy.mean() overflowed on some integer array (pandas-dev…
Browse files Browse the repository at this point in the history
…#22487)

When integer arrays contained integers that could were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.
  • Loading branch information
troels committed Sep 16, 2018
1 parent 1c500fb commit 6e8045b
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -761,6 +761,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`Resampler.apply` when passing postiional arguments to applied func (:issue:`14615`).
- Bug in :meth:`Series.resample` when passing ``numpy.timedelta64`` to ``loffset`` kwarg (:issue:`7687`).
- Bug in :meth:`Resampler.asfreq` when frequency of ``TimedeltaIndex`` is a subperiod of a new frequency (:issue:`13022`).
- Bug in :meth:`SeriesGroupBy.mean` when values were integral but could not fit inside of int64, overflowing instead. (:issue:`22487`)

Sparse
^^^^^^
Expand Down
27 changes: 27 additions & 0 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,33 @@ def ensure_categorical(arr):
return arr


def ensure_int64_or_float64(arr, copy=False):
"""
Ensure that an dtype array of some integer dtype
has an int64 dtype if possible
If it's not possible, potentially because of overflow,
convert the array to float64 instead.
Parameters
----------
arr : array-like
The array whose data type we want to enforce.
copy: boolean
Whether to copy the original array or reuse
it in place, if possible.
Returns
-------
out_arr : The input array cast as int64 if
possible without overflow.
Otherwise the input array cast to float64.
"""
try:
return arr.astype('int64', copy=copy, casting='safe')
except TypeError:
return arr.astype('float64', copy=copy)


def is_object_dtype(arr_or_dtype):
"""
Check whether an array-like or dtype is of the object dtype.
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
ensure_float64,
ensure_platform_int,
ensure_int64,
ensure_int64_or_float64,
ensure_object,
needs_i8_conversion,
is_integer_dtype,
Expand Down Expand Up @@ -471,7 +472,7 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1,
if (values == iNaT).any():
values = ensure_float64(values)
else:
values = values.astype('int64', copy=False)
values = ensure_int64_or_float64(values)
elif is_numeric and not is_complex_dtype(values):
values = ensure_float64(values)
else:
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1125,3 +1125,12 @@ def h(df, arg3):
expected = pd.Series([4, 8, 12], index=pd.Int64Index([1, 2, 3]))

tm.assert_series_equal(result, expected)


def test_groupby_mean_no_overflow():
# Regression test for (#22487)
df = pd.DataFrame({
"user": ["A", "A", "A", "A", "A"],
"connections": [4970, 4749, 4719, 4704, 18446744073699999744]
})
assert df.groupby('user')['connections'].mean()['A'] == 3689348814740003840

0 comments on commit 6e8045b

Please sign in to comment.