sum of a column of an empty dataframe #19813

therblack · 2018-02-21T13:56:44Z

Code Sample, a copy-pastable example if possible

# Your code here
import pandas

# The following gives (0.0, False)
print(pandas.Series([]).sum(), pandas.DataFrame(columns=['col1'])['col1'].sum())

Problem description

In pandas 0.22, the sum of a column of an empty dataframe is False. In earlier versions, 0.18.1 at least, the result would have been 0

While consistent with the default dtype of a DataFrame being obj, this isn't consistent with the 0.22 statement that the sum of an empty series is 0.0

Expected Output

(0.0, 0.0)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.11.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.0
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-21T14:27:58Z

There's two issues here.

The default empty dtype differ between Series and DataFrame. We have other issues for that API: Use object dtype for empty Series #17261. The correct comparison is to pd.Series([], dtype='object').sum().
The result of pd.Series([], dtype='object').sum() has changed:
- 0.20.3: 0
- 0.21.1: nan
- 0.22.0: False

Given that the behavior now matches np.array([], dtype='object').sum() @shoyer do you know if that's the documented correct behavior for NumPy? If so, I'd like to test and document it for pandas as well.

shoyer · 2018-02-21T16:05:31Z

This is really strange behavior for NumPy. I suspect it's a bug. See numpy/numpy#10639.

For pandas, this is slightly complicated by how we use object arrays for different types, including booleans with NA and strings as well as arbitrary Python types. On a string array (with object dtype), sum() concatenates:

In [32]: pd.Series(['foo', 'bar']).sum()
Out[32]: 'foobar'

So it's not entirely clear that the right answer is 0 here. I suspect it is, and we should encourage using .str.cat() for string concatenation in favor of .sum().

TomAugspurger · 2018-02-21T16:25:41Z

Thanks. Returning a bool did seem a little strange. In that case, I'm not sure what the thing to do here is...Maybe a new subsection in http://pandas-docs.github.io/pandas-docs-tr avis/basics.html#dtypes specific to empty containers would be helpful, but we have some work to do making those consistent first.

…

On Wed, Feb 21, 2018 at 10:06 AM, Stephan Hoyer ***@***.***> wrote: This is really strange behavior for NumPy. I suspect it's a bug. See numpy/numpy#10639 <numpy/numpy#10639>. For pandas, this is slightly complicated by how we use object arrays for different types, including booleans with NA and strings as well as arbitrary Python types. On a string array (with object dtype), sum() concatenates: In [32]: pd.Series(['foo', 'bar']).sum() Out[32]: 'foobar' So it's not entirely clear that the right answer is 0 here. I suspect it is, and we should encourage using .str.cat() for string concatenation in favor of .sum(). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19813 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIo-NLnXLnDgQJ0eoPK3w-7ogMr7Aks5tXD7QgaJpZM4SNrLz> .

TomAugspurger · 2018-07-06T22:51:32Z

Closing, since this may be fixed upstream in NumPy 1.15: numpy/numpy#10639

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 21, 2018

shoyer mentioned this issue Feb 21, 2018

Sum/product of empty object array is False/True numpy/numpy#10639

Closed

jorisvandenbossche mentioned this issue Jul 6, 2018

Series.sum has inconsistent return type #9733

Closed

TomAugspurger closed this as completed Jul 6, 2018

jschendel mentioned this issue Mar 22, 2019

MAC OS: sum over empy series with object dtype gives False #25835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum of a column of an empty dataframe #19813

sum of a column of an empty dataframe #19813

therblack commented Feb 21, 2018 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Feb 21, 2018

shoyer commented Feb 21, 2018

TomAugspurger commented Feb 21, 2018 via email

TomAugspurger commented Jul 6, 2018

sum of a column of an empty dataframe #19813

sum of a column of an empty dataframe #19813

Comments

therblack commented Feb 21, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Feb 21, 2018

shoyer commented Feb 21, 2018

TomAugspurger commented Feb 21, 2018 via email

TomAugspurger commented Jul 6, 2018

therblack commented Feb 21, 2018 •

edited

Loading

Output of `pd.show_versions()`