-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: SparseDataFrame coerces input to dense matrix if string-type index is given #22630
Comments
A maybe related issue:
|
If you (or someone) could profile the SparseDataFrame constructor to see where time is spent, it'd be most welcome. FYI, if you're using pandas' sparse stuff you may be interested in following #22325 and giving feedback once that's merged (it doesn't fix this performance problem). |
Here it is for the constructor (I reduced the dimension so it didn't take 400GB):
I don't really know how to read this, but it looks like half the time is spent calling And for the sum
Thanks for the FYI. I'll look into it. |
Not sure if this is related, but taking the transpose of a Here's a demonstration, mostly copied from @scottgigante's code above— import scipy.sparse as sp
import pandas as pd
import numpy as np
import cProfile
shape = (50000, 50000)
data = np.repeat(1, 10000)
i = np.random.choice(shape[0], 10000, replace=False)
j = np.random.choice(shape[1], 10000, replace=False)
X = sp.coo_matrix((data, (i, j)), shape=shape)
df = pd.SparseDataFrame(X)
# This executes almost immediately (Using cProfile on this shows that it takes
# "70 function calls in 0.000 seconds")
X.T
# As of writing, this has been running for around an hour on my computer
df.T This was done with Pandas 0.24.2 on a 2012 MacBook Pro. (Since it seems like Pandas 0.25 will change a lot re: sparse data structures, this problem might go away with that new version—but I figured I should document it here, since I haven't seen any other mentions of this.) |
I don't expect this to change with 0.25. I also don't see how pandas SparseArray would behave well with a transpose (or a DataFrame with many SparseArrays). Every column's sparse index is stored separately. |
I did some profiling and it looks like the difference between constructing the SparseDataFrame with an int vs. a str comes from the construction of the new index (one per column) in the reindexing process: |
This seems to be no longer a problem with the new |
Code Sample, a copy-pastable example if possible
Problem description
pd.SparseDataFrame
densifies its input if it is handed a string index. This is extremely undesirable and very confusing for the user.Expected Output
The data frame should be created in a matter of seconds, without coercing to a dense matrix.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.7.3
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.6.0
The text was updated successfully, but these errors were encountered: