Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance penalty with pd.arrays.SparseArray #228

Closed
mflova opened this issue May 6, 2024 · 5 comments
Closed

Huge performance penalty with pd.arrays.SparseArray #228

mflova opened this issue May 6, 2024 · 5 comments

Comments

@mflova
Copy link

mflova commented May 6, 2024

I would like to use this tool, but having such a big performance issue makes it unusable with big sparse arrays. Below there is the code I use to benchmark the issue. I usually work with dataframes with +1M columns but the benchmark is just with 100k. Although this tendency can be also seen with pd.Series (as expected) sparse arrays experiment a much bigger performance issue:

import numpy as np
import pandas as pd
import pint_pandas


M = 100_000
pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype="pint[rpm]")  # 3.7s
pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)   # 0.043s

pd.Series([1,2,3]*M, dtype="pint[rpm]")  # 0.336s
pd.Series([1,2,3]*M, dtype=np.float64)  # 0.04s

Here is the output of pyinstrument:

3.657 <module>  delete.py:1
├─ 3.650 SparseArray.__init__  pandas\core\arrays\sparse\array.py:364
│  ├─ 3.320 _make_sparse  pandas\core\arrays\sparse\array.py:1848
│  │  ├─ 3.287 PintArray.__array__  pint_pandas\pint_array.py:761
│  │  │  ├─ 3.285 PintArray._to_array_of_quantity  pint_pandas\pint_array.py:768
│  │  │  │  ├─ 2.228 <listcomp>  pint_pandas\pint_array.py:769
│  │  │  │  │  ├─ 1.822 Quantity.__new__  pint\facets\plain\quantity.py:194
│  │  │  │  │  │  ├─ 0.669 SharedRegistryObject.__new__  pint\util.py:958
│  │  │  │  │  │  │  ├─ 0.307 [self]  pint\util.py
│  │  │  │  │  │  │  ├─ 0.110 _handle_fromlist  <frozen importlib._bootstrap>:1053
│  │  │  │  │  │  │  │  ├─ 0.081 [self]  <frozen importlib._bootstrap>
│  │  │  │  │  │  │  │  ├─ 0.019 hasattr  <built-in>
│  │  │  │  │  │  │  │  └─ 0.010 isinstance  <built-in>
│  │  │  │  │  │  │  ├─ 0.086 hasattr  <built-in>
│  │  │  │  │  │  │  ├─ 0.086 ModuleSpec.parent  <frozen importlib._bootstrap>:404
│  │  │  │  │  │  │  │  ├─ 0.057 [self]  <frozen importlib._bootstrap>
│  │  │  │  │  │  │  │  └─ 0.029 str.rpartition  <built-in>

It seems the main penalty is creating a list where a N Quantity objects are created. I tried with quick alternatives but did not manage to find anything that does not break. Any possible ideas? Wouldn't it be possible to just create a numpy array with one assigned quantity instead of N Quantity objects? Thanks!

@mflova
Copy link
Author

mflova commented May 6, 2024

On the other hand, I saw how among those 0.336s of the pd.Series, 0.1s (30%) was just about parsing the file containing the units. This one is not expected to change during the Python runtime. Isn't is possible to cache this part? This is happening in pint though...

@andrewgsavage
Copy link
Collaborator

reading the definitions file is cached in latest pint release

can you try the different methods in:
https://pint-pandas.readthedocs.io/en/latest/user/initializing.html

It would be good to add a note in the docs to suggest the most performant method.

@andrewgsavage
Copy link
Collaborator

You can also use the SparseArray as the magnitude of the PintArray:

sa = pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)
pa = pint_pandas.PintArray(sa, dtype="pint[rpm]")
type(pa.data)

If you want better support for storing data in SparseArrays or other Arrays do comment in #192

@mflova
Copy link
Author

mflova commented May 6, 2024

reading the definitions file is cached in latest pint release

can you try the different methods in: https://pint-pandas.readthedocs.io/en/latest/user/initializing.html

It would be good to add a note in the docs to suggest the most performant method.

Sure, I did a quick benchmark test. Ordered from quickest to slowest: (test_series is just the standard pd.Series with np.float64. Number in parenthesis indicates the factor with respect to the best metric found)
image

Code used:

# Requires pytest, pytest-benchmark + pint related dependencies
import numpy as np
import pandas as pd
import pint_pandas
import pytest

PA_ = pint_pandas.PintArray

ureg = pint_pandas.PintType.ureg

Q_ = ureg.Quantity

@pytest.fixture
def M() -> int:
    return 1_000

def test_series(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype=np.float64)}))

def test_A(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype="pint[m]")}))

def test_B(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"B": pd.Series([0]*M).astype("pint[m]")}))

def test_C(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"C": PA_([0]*M, dtype="pint[m]")}))

def test_D(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"D": PA_([0]*M, dtype="m")}))

def test_E(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"E": PA_([0]*M, dtype=ureg.m)}))

def test_F(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"F": PA_.from_1darray_quantity(Q_([0]*M, ureg.m))}))

def test_G(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"G": PA_(Q_([0]*M, ureg.m))}))

@mflova
Copy link
Author

mflova commented May 6, 2024

I also benchmarked the time penalties for both pd.Series and pd.arrays.SparseArray in a more realistic way. To be honest, when compared to the built-in Pandas implementation, it seems it is not that bad when it comes to not much data:

Sparse data:
image

Dense data: Here I called quick to the quickest method found in the previous comment. The other one is just the "standard" way
image

In this case, the pint implementation is just 1.35 times slower than the built-in pandas one for the Sparse Array. Looking at these numbers I am not sure if there is room for improvement. I will close the issue and re-open it again if this is a problem :)

@mflova mflova closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants