Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.gropuby().mean() incorrect result #22487

Closed
tinchoroman opened this issue Aug 23, 2018 · 3 comments · Fixed by #22653
Closed

DataFrame.gropuby().mean() incorrect result #22487

tinchoroman opened this issue Aug 23, 2018 · 3 comments · Fixed by #22653
Milestone

Comments

@tinchoroman
Copy link

tinchoroman commented Aug 23, 2018

Anybody knows why I'm having different results when I apply the same operator to the same DataFrame but using groupby?
When using groupby , It returned negative values while all values are positive.

from pandas import DataFrame
df = DataFrame({"user":["A", "A", "A", "A", "A"],
                            "connections":[18446744073699999744, 4970, 4749, 4719, 4704]})

df.mean()

connections 3.689349e+18
dtype: float64

df.groupby("user")["connections"].mean()

user
A -1906546.0
Name: connections, dtype: float64

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2018

Can you try on master? Looks like an int overflow somewhere if still present investigation and PRs are always welcome

@tinchoroman
Copy link
Author

Hi WillAyd, thanks for your prompt response. This is the first time I post an issue. Could you please explain little further what you exactly mean by "try on master" ? Thanks in advance!

@tinchoroman
Copy link
Author

I've upgraded to latest version and the problem still persists. In the investigation line that WillAyd suggests, the same example whit float numbers worked fine.

df = DataFrame({"user":["A", "A", "A", "A", "A"],
           "connections":[18446744073699999744.0, 4970.0, 4749.0, 4719.0, 4704.0]})

df.mean()
connections    3.689349e+18
dtype: float64

df.groupby("user")["connections"].mean()
user
A    3.689349e+18
Name: connections, dtype: float64

df.mean()[0] == df.groupby("user")["connections"].mean()[0]
True

troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
…#22487)

When integer arrays contained integers that could were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.
troels added a commit to troels/pandas that referenced this issue Sep 11, 2018
…#22487)

When integer arrays contained integers that could were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.
troels added a commit to troels/pandas that referenced this issue Sep 16, 2018
…#22487)

When integer arrays contained integers that could were outside
the range of int64, the conversion would overflow.
Instead only allow allow safe casting and if a safe cast can not
be done, cast to float64 instead.
@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants