PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

yakymp · 2022-01-27T12:38:45Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows.

I apologize if this is a known issue, I could not find a reference based on keywords that come to mind. Inspired by this SO question.

from timeit import timeit

import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))

assert pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T.equals(df_ten_rows.agg(["sum"]))
assert pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T.equals(df_10k_rows.agg(["sum"]))

setup = """
import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))
"""
number = 10 # nubmer of repetitions

codes = {
  "10 rows, agg on sum": 'pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T',
  "10 rows, agg on [sum]": 'df_ten_rows.agg(["sum"])',
  "10k rows, agg on sum": 'pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T',
  "10k rows, agg on [sum]": 'df_10k_rows.agg(["sum"])',
}

times = {
  description: timeit(code, setup=setup, number=number)/number 
  for description, code in codes.items()
  }

# {
#   '10 rows, agg on sum':    0.005952695199812297, 
#   '10 rows, agg on [sum]':  2.789351126600013, 
#   '10k rows, agg on sum':   0.018355781599893817, 
#   '10k rows, agg on [sum]': 2.341517233000195
# }

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-1028-gcp
Version : #32~20.04.1-Ubuntu SMP Wed Jan 12 20:08:27 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.dev0
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.4.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.23
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

phofl · 2022-01-27T16:44:51Z

Please try on newer pandas, e.g. 1.4

yakymp · 2022-01-28T07:27:59Z

Updated example using pandas 1.4

rhshadrach · 2022-01-29T13:53:35Z

Thanks for the report!

One may think of DataFrame.agg(['sum']) as an alias for DataFrame.sum() (the former returning a DataFrame, the latter a Series), but it is not. Internally, supplying a list of aggregations breaks up the DataFrame into individual Series, and aggregates each of these. This is the reason DataFrame.agg(['sum']) gives a transpose of DataFrame.sum(), and why it is so much less performant on wide frames. I think this should be fixed.

The trouble is that because agg allows for partial failures, you can't call DataFrame.sum() under the hood and retain the current behavior. Allowing partial failure is deprecated, and once removed, would enable us to move forward. To maintain the current behavior, a hefty amount of reshaping would have to be done for the transpose (especially to get the index levels in the correct order when dealing with a MultiIndex), and I don't think it will be possible to handle all dtypes correctly. Instead, I think we should not be transposing the result when comparing .agg('sum') and .agg(['sum']).

However we run into more trouble because the code paths for agg and apply are intertwined - sometimes agg internally calls apply and vice-versa (#41112 (comment)). Because each of these can accept a UDF, it is hard to understand how we might be impacting user code. This is why I think an experimental option as proposed in #41112 is the right way to go.

This performance would be improved by #45557, when using the experimental option.

jbrockmendel · 2023-02-01T16:29:27Z

Now that the deprecation @rhshadrach referred to is enforced, it may be viable to improve this.

rhshadrach · 2023-03-27T23:00:11Z

I've looked into this some more, the issue is that to use DataFrame reductions (instead of breaking up the frame into Series and using the Series reductions) needs to involve taking a transpose. E.g if you have a frame with a mixture of int/float dtypes, using DataFrame.sum results in a float Series and then upon transposing to a 1-row frame has the wrong dtypes. Since this needs to work for an arbitrary UDF, we don't have the necessary information to cast afterwards.

I'd be in favor of not transposing the result, so that df.agg('sum') and df.agg(['sum']) are the same result except the former is a Series and the latter a DataFrame. But this would require a deprecation and I don't see how to do that with without adding an argument to agg to allow adoption of future behavior (e.g. transpose_result).

jbrockmendel · 2023-03-27T23:28:28Z

DataFrame.agg(['sum']) gives a transpose of DataFrame.sum()

I would not have guessed this. Am I the only one who finds this weird?

But this would require a deprecation and I don't see how to do that with without adding an argument [...]

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

rhshadrach · 2023-03-28T22:05:10Z

I would not have guessed this. Am I the only one who finds this weird?

I think it goes back to allowing partial failure - that can only be done by first breaking up the frame into series and operating on each one. Once you've done this, it's a natural result of using concat.

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

+1

yakymp added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 27, 2022

phofl added the Needs Info Clarification about behavior needed to assess issue label Jan 27, 2022

yakymp changed the title ~~PERF:~~ PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. Jan 28, 2022

phofl added Apply Apply, Aggregate, Transform, Map and removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 28, 2022

rhshadrach mentioned this issue Mar 27, 2023

BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True #49352

Closed

3 tasks

rhshadrach mentioned this issue Jun 10, 2023

fix Series.apply(..., by_row) #53584

Closed

rhshadrach mentioned this issue Jun 25, 2023

BUG: Can't pass arguments to DataFrameGroupBy.agg when using list/dict-like and string aliases #53839

Open

3 tasks

rhshadrach mentioned this issue Apr 18, 2024

ENH: GroupBy.transform should accept similar arguments to GroupBy.agg #58318

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

yakymp commented Jan 27, 2022 •

edited

Loading

INSTALLED VERSIONS

phofl commented Jan 27, 2022

yakymp commented Jan 28, 2022

rhshadrach commented Jan 29, 2022 •

edited

Loading

jbrockmendel commented Feb 1, 2023

rhshadrach commented Mar 27, 2023

jbrockmendel commented Mar 27, 2023

rhshadrach commented Mar 28, 2023

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

Comments

yakymp commented Jan 27, 2022 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

phofl commented Jan 27, 2022

yakymp commented Jan 28, 2022

rhshadrach commented Jan 29, 2022 • edited Loading

jbrockmendel commented Feb 1, 2023

rhshadrach commented Mar 27, 2023

jbrockmendel commented Mar 27, 2023

rhshadrach commented Mar 28, 2023

yakymp commented Jan 27, 2022 •

edited

Loading

rhshadrach commented Jan 29, 2022 •

edited

Loading