Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. #45658

Open
2 of 3 tasks
yakymp opened this issue Jan 27, 2022 · 7 comments
Labels
Apply Apply, Aggregate, Transform, Map Performance Memory or execution speed performance

Comments

@yakymp
Copy link

yakymp commented Jan 27, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows.

I apologize if this is a known issue, I could not find a reference based on keywords that come to mind. Inspired by this SO question.

from timeit import timeit

import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))

assert pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T.equals(df_ten_rows.agg(["sum"]))
assert pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T.equals(df_10k_rows.agg(["sum"]))

setup = """
import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))
"""
number = 10 # nubmer of repetitions

codes = {
  "10 rows, agg on sum": 'pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T',
  "10 rows, agg on [sum]": 'df_ten_rows.agg(["sum"])',
  "10k rows, agg on sum": 'pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T',
  "10k rows, agg on [sum]": 'df_10k_rows.agg(["sum"])',
}

times = {
  description: timeit(code, setup=setup, number=number)/number 
  for description, code in codes.items()
  }

# {
#   '10 rows, agg on sum':    0.005952695199812297, 
#   '10 rows, agg on [sum]':  2.789351126600013, 
#   '10k rows, agg on sum':   0.018355781599893817, 
#   '10k rows, agg on [sum]': 2.341517233000195
# }

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-1028-gcp
Version : #32~20.04.1-Ubuntu SMP Wed Jan 12 20:08:27 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.dev0
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.4.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.23
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

@yakymp yakymp added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 27, 2022
@phofl
Copy link
Member

phofl commented Jan 27, 2022

Please try on newer pandas, e.g. 1.4

@phofl phofl added the Needs Info Clarification about behavior needed to assess issue label Jan 27, 2022
@yakymp
Copy link
Author

yakymp commented Jan 28, 2022

Updated example using pandas 1.4

@yakymp yakymp changed the title PERF: PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows. Jan 28, 2022
@phofl phofl added Apply Apply, Aggregate, Transform, Map and removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 28, 2022
@rhshadrach
Copy link
Member

rhshadrach commented Jan 29, 2022

Thanks for the report!

One may think of DataFrame.agg(['sum']) as an alias for DataFrame.sum() (the former returning a DataFrame, the latter a Series), but it is not. Internally, supplying a list of aggregations breaks up the DataFrame into individual Series, and aggregates each of these. This is the reason DataFrame.agg(['sum']) gives a transpose of DataFrame.sum(), and why it is so much less performant on wide frames. I think this should be fixed.

The trouble is that because agg allows for partial failures, you can't call DataFrame.sum() under the hood and retain the current behavior. Allowing partial failure is deprecated, and once removed, would enable us to move forward. To maintain the current behavior, a hefty amount of reshaping would have to be done for the transpose (especially to get the index levels in the correct order when dealing with a MultiIndex), and I don't think it will be possible to handle all dtypes correctly. Instead, I think we should not be transposing the result when comparing .agg('sum') and .agg(['sum']).

However we run into more trouble because the code paths for agg and apply are intertwined - sometimes agg internally calls apply and vice-versa (#41112 (comment)). Because each of these can accept a UDF, it is hard to understand how we might be impacting user code. This is why I think an experimental option as proposed in #41112 is the right way to go.

This performance would be improved by #45557, when using the experimental option.

@jbrockmendel
Copy link
Member

Now that the deprecation @rhshadrach referred to is enforced, it may be viable to improve this.

@rhshadrach
Copy link
Member

I've looked into this some more, the issue is that to use DataFrame reductions (instead of breaking up the frame into Series and using the Series reductions) needs to involve taking a transpose. E.g if you have a frame with a mixture of int/float dtypes, using DataFrame.sum results in a float Series and then upon transposing to a 1-row frame has the wrong dtypes. Since this needs to work for an arbitrary UDF, we don't have the necessary information to cast afterwards.

I'd be in favor of not transposing the result, so that df.agg('sum') and df.agg(['sum']) are the same result except the former is a Series and the latter a DataFrame. But this would require a deprecation and I don't see how to do that with without adding an argument to agg to allow adoption of future behavior (e.g. transpose_result).

@jbrockmendel
Copy link
Member

DataFrame.agg(['sum']) gives a transpose of DataFrame.sum()

I would not have guessed this. Am I the only one who finds this weird?

But this would require a deprecation and I don't see how to do that with without adding an argument [...]

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

@rhshadrach
Copy link
Member

I would not have guessed this. Am I the only one who finds this weird?

I think it goes back to allowing partial failure - that can only be done by first breaking up the frame into series and operating on each one. Once you've done this, it's a natural result of using concat.

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants