PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow #60066

auderson · 2024-10-18T03:03:51Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.zeros((5000, 5000)))

%%time
_ = df.astype("Float64")

CPU times: user 3.87 s, sys: 185 ms, total: 4.05 s
Wall time: 4.05 s

%%timeit
_ = df.astype("float64[pyarrow]")

566 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
_ = df.astype("float64", copy=True)

148 ms ± 984 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It takes over 4s for a middle-sized dataframe of 5000 x 5000 to convert to Float64, while float64[pyarrow] is about 7x faster. The fastest float64 takes only 150ms, even with copy enabled.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.14
python-bits : 64
OS : Linux
OS-release : 5.15.0-122-generic
Version : #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
pip : 24.0
Cython : 3.0.7
sphinx : 7.3.7
IPython : 8.25.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : 3.9.2
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 16.1.0
pyreadstat : None
pytest : 8.2.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : 0.22.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

auderson added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow #60066

PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow #60066

auderson commented Oct 18, 2024

INSTALLED VERSIONS

PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow #60066

PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow #60066

Comments

auderson commented Oct 18, 2024

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance