Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum of a column of an empty dataframe #19813

Closed
therblack opened this issue Feb 21, 2018 · 4 comments
Closed

sum of a column of an empty dataframe #19813

therblack opened this issue Feb 21, 2018 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@therblack
Copy link

therblack commented Feb 21, 2018

Code Sample, a copy-pastable example if possible

# Your code here
import pandas

# The following gives (0.0, False)
print(pandas.Series([]).sum(), pandas.DataFrame(columns=['col1'])['col1'].sum())

Problem description

In pandas 0.22, the sum of a column of an empty dataframe is False. In earlier versions, 0.18.1 at least, the result would have been 0

While consistent with the default dtype of a DataFrame being obj, this isn't consistent with the 0.22 statement that the sum of an empty series is 0.0

Expected Output

(0.0, 0.0)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.11.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.0
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

There's two issues here.

  1. The default empty dtype differ between Series and DataFrame. We have other issues for that API: Use object dtype for empty Series #17261. The correct comparison is to pd.Series([], dtype='object').sum().

  2. The result of pd.Series([], dtype='object').sum() has changed:

    • 0.20.3: 0
    • 0.21.1: nan
    • 0.22.0: False

Given that the behavior now matches np.array([], dtype='object').sum() @shoyer do you know if that's the documented correct behavior for NumPy? If so, I'd like to test and document it for pandas as well.

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 21, 2018
@shoyer
Copy link
Member

shoyer commented Feb 21, 2018

This is really strange behavior for NumPy. I suspect it's a bug. See numpy/numpy#10639.

For pandas, this is slightly complicated by how we use object arrays for different types, including booleans with NA and strings as well as arbitrary Python types. On a string array (with object dtype), sum() concatenates:

In [32]: pd.Series(['foo', 'bar']).sum()
Out[32]: 'foobar'

So it's not entirely clear that the right answer is 0 here. I suspect it is, and we should encourage using .str.cat() for string concatenation in favor of .sum().

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 21, 2018 via email

@TomAugspurger
Copy link
Contributor

Closing, since this may be fixed upstream in NumPy 1.15: numpy/numpy#10639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

3 participants