Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: SparseDataFrame coerces input to dense matrix if string-type index is given #22630

Closed
scottgigante opened this issue Sep 7, 2018 · 7 comments · Fixed by #28425
Closed
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type

Comments

@scottgigante
Copy link
Contributor

scottgigante commented Sep 7, 2018

Code Sample, a copy-pastable example if possible

import scipy.sparse as sp
import pandas as pd
import numpy as np
shape = (500000, 50000)
data = np.repeat(1, 10000)
i = np.random.choice(shape[0], 10000, replace=False)
j = np.random.choice(shape[1], 10000, replace=False)
X = sp.coo_matrix((data, (i, j)), shape=shape)

# this works fine
df = pd.SparseDataFrame(X, index=np.arange(shape[0]))
df.index = np.arange(shape[0]).astype(str)
# this requires 400GB of memory and takes an hour
df = pd.SparseDataFrame(X, index=np.arange(shape[0]).astype(str))

Problem description

pd.SparseDataFrame densifies its input if it is handed a string index. This is extremely undesirable and very confusing for the user.

Expected Output

The data frame should be created in a matter of seconds, without coercing to a dense matrix.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.3
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.6.0

@scottgigante scottgigante changed the title SparseDataFrame coerces input to dense matrix if string-type index is given BUG: SparseDataFrame coerces input to dense matrix if string-type index is given Sep 7, 2018
@scottgigante
Copy link
Contributor Author

A maybe related issue:

import scipy.sparse as sp
import pandas as pd
import numpy as np
shape = (500000, 50000)
data = np.repeat(1, 10000)
i = np.random.choice(shape[0], 10000, replace=False)
j = np.random.choice(shape[1], 10000, replace=False)
X = sp.coo_matrix((data, (i, j)), shape=shape)

df = pd.SparseDataFrame(X)

# this works fine
df.to_coo().sum(axis=0)
# this takes 400GB of memory and an hour
df.sum(axis=0)

@TomAugspurger
Copy link
Contributor

If you (or someone) could profile the SparseDataFrame constructor to see where time is spent, it'd be most welcome.

FYI, if you're using pandas' sparse stuff you may be interested in following #22325 and giving feedback once that's merged (it doesn't fix this performance problem).

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Sparse Sparse Data Type labels Sep 8, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Sep 8, 2018
@scottgigante
Copy link
Contributor Author

scottgigante commented Sep 8, 2018

Here it is for the constructor (I reduced the dimension so it didn't take 400GB):

>>> import scipy.sparse as sp
>>> import pandas as pd
>>> import numpy as np
>>> import cProfile
>>> shape = (50000, 50000)
>>> data = np.repeat(1, 10000)
>>> i = np.random.choice(shape[0], 10000, replace=False)
>>> j = np.random.choice(shape[1], 10000, replace=False)
>>> X = sp.coo_matrix((data, (i, j)), shape=shape)
>>>
>>> cProfile.run('pd.SparseDataFrame(X, index=np.arange(shape[0]).astype(str))')
         37201535 function calls (36931514 primitive calls) in 599.169 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   350027    0.110    0.000    0.490    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
   120007    0.120    0.000    0.208    0.000 <frozen importlib._bootstrap>:416(parent)
        1    0.436    0.436  599.169  599.169 <string>:1(<module>)
        3    0.004    0.001    0.004    0.001 __init__.py:124(lrange)
   220007    0.182    0.000    0.344    0.000 __init__.py:205(iteritems)
        1    0.000    0.000    0.003    0.003 _decorators.py:136(wrapper)
    50001    0.023    0.000    0.505    0.000 _methods.py:34(_sum)
        2    0.000    0.000    0.000    0.000 _methods.py:42(_any)
        1    0.000    0.000    0.000    0.000 _validators.py:114(_check_for_invalid_keys)
        1    0.000    0.000    0.000    0.000 _validators.py:130(validate_kwargs)
    10000    0.005    0.000    0.007    0.000 _validators.py:221(validate_bool_kwarg)
        1    0.000    0.000    0.000    0.000 _validators.py:32(_check_for_default_values)
    50003    0.019    0.000    0.054    0.000 abc.py:137(__instancecheck__)
        1    0.000    0.000    0.000    0.000 algorithms.py:141(_reconstruct_data)
        2    0.000    0.000    0.000    0.000 algorithms.py:1421(_get_take_nd_function)
        2    0.000    0.000    0.000    0.000 algorithms.py:1548(take_nd)
        1    0.000    0.000    0.000    0.000 algorithms.py:172(_ensure_arraylike)
        1    0.000    0.000    0.000    0.000 algorithms.py:224(_get_data_algo)
        1    0.000    0.000    0.001    0.001 algorithms.py:449(_factorize_array)
        2    0.000    0.000    0.000    0.000 algorithms.py:48(_ensure_data)
        1    0.000    0.000    0.003    0.003 algorithms.py:576(factorize)
   150001    0.439    0.000    2.800    0.000 array.py:156(__new__)
   150001    0.237    0.000    1.127    0.000 array.py:200(_simple_new)
    50000    0.024    0.000    0.030    0.000 array.py:234(kind)
   150001    0.123    0.000    0.161    0.000 array.py:281(__array_finalize__)
   100000    0.048    0.000    0.151    0.000 array.py:347(sp_values)
   100000    0.022    0.000    0.022    0.000 array.py:352(fill_value)
    50000    0.121    0.000    0.889    0.000 array.py:547(copy)
    50000    0.055    0.000    1.109    0.000 array.py:751(_maybe_to_sparse)
   100001    0.152    0.000    0.640    0.000 array.py:758(_sanitize_values)
        1    0.000    0.000    0.000    0.000 array.py:785(make_sparse)
        1    0.000    0.000    0.000    0.000 array.py:837(_make_index)
        1    0.000    0.000    0.000    0.000 arrayprint.py:1110(__init__)
        6    0.000    0.000    0.000    0.000 arrayprint.py:1118(__call__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:1457(array_str)
      3/1    0.000    0.000    0.000    0.000 arrayprint.py:314(_leading_trailing)
        1    0.000    0.000    0.000    0.000 arrayprint.py:348(_get_formatdict)
        1    0.000    0.000    0.000    0.000 arrayprint.py:356(<lambda>)
        1    0.000    0.000    0.000    0.000 arrayprint.py:401(_get_format_function)
        1    0.000    0.000    0.000    0.000 arrayprint.py:453(wrapper)
        1    0.000    0.000    0.000    0.000 arrayprint.py:470(_array2string)
        1    0.000    0.000    0.000    0.000 arrayprint.py:499(array2string)
        1    0.000    0.000    0.000    0.000 arrayprint.py:67(_make_options_dict)
        7    0.000    0.000    0.000    0.000 arrayprint.py:671(_extendLine)
        1    0.000    0.000    0.000    0.000 arrayprint.py:685(_formatArray)
      7/1    0.000    0.000    0.000    0.000 arrayprint.py:694(recurser)
        1    0.000    0.000    0.000    0.000 arrayprint.py:72(<dictcomp>)
        1    0.000    0.000    0.000    0.000 base.py:1187(isspmatrix)
    50000    0.008    0.000    0.008    0.000 base.py:1325(nlevels)
    10000    0.083    0.000    0.215    0.000 base.py:1500(is_monotonic_increasing)
    10002    0.020    0.000    0.029    0.000 base.py:1935(_engine)
    10001    0.007    0.000    0.103    0.000 base.py:1938(<lambda>)
   100000    0.169    0.000    0.581    0.000 base.py:1976(is_all_dates)
    50000    0.013    0.000    0.016    0.000 base.py:2033(__contains__)
    10000    0.034    0.000    0.622    0.000 base.py:2067(__getitem__)
        1    0.000    0.000    0.000    0.000 base.py:2179(take)
   150000    0.247    0.000   68.974    0.000 base.py:2445(equals)
    50000    0.021    0.000    0.082    0.000 base.py:2465(identical)
   110004    1.683    0.000  339.979    0.003 base.py:255(__new__)
        1    0.000    0.000    0.000    0.000 base.py:3071(get_loc)
    50000    0.243    0.000  244.124    0.005 base.py:3578(reindex)
   220004    2.034    0.000    2.674    0.000 base.py:473(_simple_new)
   510007    0.395    0.000  340.068    0.001 base.py:4914(_ensure_index)
    50000    0.019    0.000    0.032    0.000 base.py:4977(_ensure_has_len)
   110000    0.317    0.000    2.571    0.000 base.py:510(_shallow_copy)
    10001    0.033    0.000    0.548    0.000 base.py:520(_shallow_copy_with_infer)
   390036    0.205    0.000    0.298    0.000 base.py:61(is_dtype)
   150000    0.082    0.000    0.107    0.000 base.py:615(is_)
   220005    0.137    0.000    0.137    0.000 base.py:635(_reset_identity)
   270012    0.080    0.000    0.118    0.000 base.py:641(__len__)
   100000    0.028    0.000    0.028    0.000 base.py:662(dtype)
   340006    0.187    0.000    0.480    0.000 base.py:672(values)
    10004    0.005    0.000    0.025    0.000 base.py:677(_values)
   100000    0.043    0.000    0.187    0.000 base.py:711(get_values)
    10001    0.013    0.000    0.096    0.000 base.py:789(_ndarray_values)
        1    0.000    0.000    0.000    0.000 base.py:86(get_shape)
        1    0.000    0.000    0.001    0.001 base.py:893(tolist)
    10003    0.005    0.000    0.006    0.000 base.py:904(_coerce_to_ndarray)
        4    0.000    0.000    0.005    0.001 base.py:912(__iter__)
   120001    0.113    0.000    0.218    0.000 base.py:920(_get_attributes_dict)
   120001    0.081    0.000    0.105    0.000 base.py:922(<dictcomp>)
   110000    0.139    0.000    2.744    0.000 base.py:924(view)
        1    0.000    0.000    0.000    0.000 cast.py:1232(construct_1d_ndarray_preserving_na)
        1    0.000    0.000    0.000    0.000 cast.py:257(maybe_promote)
        1    0.000    0.000    0.000    0.000 cast.py:853(maybe_castable)
        1    0.000    0.000    0.000    0.000 cast.py:867(maybe_infer_to_datetimelike)
        1    0.000    0.000    0.000    0.000 cast.py:971(maybe_cast_to_datetime)
   110005    0.102    0.000    0.563    0.000 common.py:1043(is_datetime64_any_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1170(is_datetime_or_timedelta_dtype)
   160005    0.055    0.000    0.232    0.000 common.py:122(is_sparse)
        1    0.000    0.000    0.000    0.000 common.py:123(_default_index)
        1    0.000    0.000    0.000    0.000 common.py:1405(needs_i8_conversion)
   100001    0.055    0.000    0.094    0.000 common.py:1527(is_float_dtype)
        1    0.000    0.000    0.000    0.000 common.py:154(_all_none)
        1    0.000    0.000    0.000    0.000 common.py:155(is_scipy_sparse)
   100006    0.081    0.000    0.411    0.000 common.py:1578(is_bool_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1629(is_extension_type)
    10009    0.021    0.000    0.078    0.000 common.py:1688(is_extension_array_dtype)
    70007    0.120    0.000    0.288    0.000 common.py:1784(_get_dtype)
880030/880029    0.936    0.000    1.374    0.000 common.py:1835(_get_dtype_type)
        1    0.000    0.000    0.000    0.000 common.py:195(is_categorical)
    10008    0.005    0.000    0.026    0.000 common.py:227(is_datetimetz)
   100001    0.498    0.000  289.411    0.003 common.py:301(_asarray_tuplesafe)
   110007    0.108    0.000    0.347    0.000 common.py:332(is_datetime64_dtype)
   120017    0.054    0.000    0.126    0.000 common.py:369(is_datetime64tz_dtype)
   110007    0.067    0.000    0.255    0.000 common.py:407(is_timedelta64_dtype)
    50004    0.047    0.000    0.191    0.000 common.py:444(is_period_dtype)
   220012    0.098    0.000    0.376    0.000 common.py:477(is_interval_dtype)
   220014    0.111    0.000    0.229    0.000 common.py:513(is_categorical_dtype)
    50001    0.065    0.000    0.524    0.000 common.py:546(is_string_dtype)
        1    0.000    0.000    0.000    0.000 common.py:647(is_datetimelike)
    10003    0.009    0.000    0.029    0.000 common.py:692(is_dtype_equal)
   150001    0.139    0.000    0.493    0.000 common.py:811(is_integer_dtype)
   110006    0.080    0.000    0.136    0.000 common.py:858(is_signed_integer_dtype)
   100004    0.066    0.000    0.468    0.000 common.py:89(is_object_dtype)
   100001    0.054    0.000    0.095    0.000 common.py:907(is_unsigned_integer_dtype)
        1    0.000    0.000    0.000    0.000 coo.py:403(tocoo)
        6    0.000    0.000    0.000    0.000 cycler.py:227(<genexpr>)
        1    0.000    0.000    0.000    0.000 dtypes.py:266(construct_from_string)
        2    0.000    0.000    0.000    0.000 dtypes.py:401(__new__)
        2    0.000    0.000    0.000    0.000 dtypes.py:459(construct_from_string)
    50004    0.096    0.000    0.144    0.000 dtypes.py:584(is_dtype)
   110008    0.192    0.000    0.277    0.000 dtypes.py:707(is_dtype)
        1    0.259    0.259  257.365  257.365 frame.py:139(_init_dict)
        1    0.021    0.021    0.037    0.037 frame.py:143(<dictcomp>)
        1    0.000    0.000    0.001    0.001 frame.py:151(<lambda>)
        1    0.003    0.003    0.003    0.003 frame.py:177(<genexpr>)
        1  104.736  104.736  535.611  535.611 frame.py:188(_init_spmatrix)
        1    0.551    0.551  136.909  136.909 frame.py:210(<dictcomp>)
        1    0.000    0.000    0.000    0.000 frame.py:218(_prep_index)
        1   63.101   63.101  598.711  598.711 frame.py:57(__init__)
        1    0.004    0.004    2.953    2.953 frame.py:933(to_manager)
        1    0.029    0.029    0.029    0.029 frame.py:942(<listcomp>)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2227(amax)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2337(amin)
        2    0.000    0.000    0.000    0.000 fromnumeric.py:64(_wrapreduction)
   120003    0.235    0.000    0.235    0.000 generic.py:124(__init__)
    10000    0.001    0.000    0.001    0.000 generic.py:1620(to_dense)
    50000    0.118    0.000    0.173    0.000 generic.py:317(_construct_axes_from_arguments)
    50000    0.027    0.000    0.044    0.000 generic.py:349(<dictcomp>)
   110002    0.158    0.000    0.199    0.000 generic.py:364(_get_axis_number)
    50000    0.378    0.000  253.727    0.005 generic.py:3647(reindex)
   100000    0.100    0.000    0.296    0.000 generic.py:3674(<genexpr>)
    50000    0.244    0.000  252.433    0.005 generic.py:3691(_reindex_axes)
   100001    0.075    0.000    0.102    0.000 generic.py:377(_get_axis_name)
    50000    0.336    0.000    7.879    0.000 generic.py:3789(_reindex_with_indexers)
   100001    0.062    0.000    0.193    0.000 generic.py:390(_get_axis)
    50000    0.027    0.000    0.100    0.000 generic.py:394(_get_block_manager_axis)
   120001    0.169    0.000    0.286    0.000 generic.py:4345(__finalize__)
   120002    0.155    0.000    0.155    0.000 generic.py:4362(__getattr__)
   270005    1.422    0.000    3.275    0.000 generic.py:4378(__setattr__)
    50001    0.064    0.000    0.197    0.000 generic.py:4423(_protect_consolidate)
    50001    0.035    0.000    0.232    0.000 generic.py:4433(_consolidate_inplace)
    50001    0.049    0.000    0.122    0.000 generic.py:4436(f)
    10000    0.030    0.000    0.632    0.000 generic.py:5009(copy)
        1    0.000    0.000    0.001    0.001 generic.py:6592(groupby)
   740054    0.293    0.000    0.463    0.000 generic.py:7(_check)
        1    0.000    0.000    0.000    0.000 groupby.py:2143(groupby)
        1    0.000    0.000    0.000    0.000 groupby.py:2196(__init__)
    10001    0.031    0.000    1.172    0.000 groupby.py:2217(get_iterator)
        1    0.000    0.000    0.003    0.003 groupby.py:2231(_get_splitter)
        1    0.000    0.000    0.000    0.000 groupby.py:2235(_get_group_keys)
        1    0.000    0.000    0.000    0.000 groupby.py:2295(levels)
        1    0.000    0.000    0.000    0.000 groupby.py:2297(<listcomp>)
        1    0.000    0.000    0.003    0.003 groupby.py:2333(group_info)
        1    0.000    0.000    0.003    0.003 groupby.py:2350(_get_compressed_labels)
        1    0.000    0.000    0.003    0.003 groupby.py:2351(<listcomp>)
        1    0.000    0.000    0.000    0.000 groupby.py:2939(__init__)
        2    0.000    0.000    0.003    0.001 groupby.py:3067(labels)
        2    0.000    0.000    0.000    0.000 groupby.py:3089(group_index)
        1    0.000    0.000    0.003    0.003 groupby.py:3095(_make_labels)
        1    0.000    0.000    0.000    0.000 groupby.py:3114(_get_grouper)
        2    0.000    0.000    0.000    0.000 groupby.py:3228(<genexpr>)
        2    0.000    0.000    0.000    0.000 groupby.py:3229(<genexpr>)
        2    0.000    0.000    0.000    0.000 groupby.py:3230(<genexpr>)
        1    0.000    0.000    0.000    0.000 groupby.py:3258(is_in_axis)
        1    0.000    0.000    0.000    0.000 groupby.py:3268(is_in_obj)
        1    0.000    0.000    0.000    0.000 groupby.py:3327(_is_label_like)
        1    0.000    0.000    0.000    0.000 groupby.py:3332(_convert_grouper)
        1    0.000    0.000    0.000    0.000 groupby.py:5021(__init__)
        1    0.000    0.000    0.000    0.000 groupby.py:5028(slabels)
        1    0.000    0.000    0.000    0.000 groupby.py:5033(sort_idx)
    10001    0.026    0.000    1.138    0.000 groupby.py:5038(__iter__)
        1    0.000    0.000    0.000    0.000 groupby.py:5057(_get_sorted_data)
    10000    0.020    0.000    1.111    0.000 groupby.py:5069(_chop)
        1    0.000    0.000    0.000    0.000 groupby.py:5120(get_splitter)
        1    0.000    0.000    0.000    0.000 groupby.py:567(__init__)
        1    0.000    0.000    0.000    0.000 groupby.py:881(__iter__)
        2    0.000    0.000    0.000    0.000 index_tricks.py:656(__getitem__)
   100002    0.057    0.000    0.103    0.000 inference.py:119(is_iterator)
    50003    0.027    0.000    0.132    0.000 inference.py:251(is_list_like)
        1    0.000    0.000    0.000    0.000 inference.py:287(is_array_like)
   170002    0.214    0.000    0.364    0.000 internals.py:116(__init__)
   170002    0.044    0.000    0.044    0.000 internals.py:127(_check_ndim)
   100000    0.030    0.000    0.041    0.000 internals.py:166(_consolidate_key)
   150000    0.507    0.000    1.307    0.000 internals.py:1723(__init__)
    10000    0.002    0.000    0.002    0.000 internals.py:199(external_values)
        1    0.000    0.000    0.000    0.000 internals.py:203(internal_values)
   280002    0.036    0.000    0.036    0.000 internals.py:233(mgr_locs)
   170002    0.075    0.000    0.091    0.000 internals.py:237(mgr_locs)
    50000    0.038    0.000    0.550    0.000 internals.py:251(make_block)
    10000    0.013    0.000    0.066    0.000 internals.py:269(make_block_same_class)
   150000    0.243    0.000    1.700    0.000 internals.py:3000(__init__)
    50000    0.031    0.000    0.031    0.000 internals.py:3039(sp_index)
    50000    0.038    0.000    0.068    0.000 internals.py:3043(kind)
    50000    0.124    0.000    1.925    0.000 internals.py:3061(copy)
    50000    0.202    0.000    1.686    0.000 internals.py:3067(make_block_same_class)
    10000    0.015    0.000    0.015    0.000 internals.py:310(_slice)
   160002    0.148    0.000    0.490    0.000 internals.py:3148(get_block_type)
   170002    0.287    0.000    2.401    0.000 internals.py:3191(make_block)
        1    0.012    0.012    0.197    0.197 internals.py:3265(__init__)
        1    0.000    0.000    0.000    0.000 internals.py:3266(<listcomp>)
        5    0.000    0.000    0.000    0.000 internals.py:3307(shape)
       15    0.000    0.000    0.000    0.000 internals.py:3309(<genexpr>)
   100000    0.133    0.000    0.326    0.000 internals.py:3315(set_axis)
        2    0.128    0.064    0.192    0.096 internals.py:3363(_rebuild_blknos_and_blklocs)
        2    0.000    0.000    0.000    0.000 internals.py:3384(_get_items)
        1    0.001    0.001    0.023    0.023 internals.py:3488(_verify_integrity)
    50001    0.011    0.000    0.019    0.000 internals.py:3490(<genexpr>)
    60000    0.436    0.000    2.805    0.000 internals.py:3500(apply)
   200001    0.030    0.000    0.030    0.000 internals.py:352(dtype)
    50000    0.014    0.000    0.048    0.000 internals.py:356(ftype)
    60000    0.009    0.000    0.009    0.000 internals.py:3561(<genexpr>)
        1    0.000    0.000    0.000    0.000 internals.py:3776(is_consolidated)
        1    0.002    0.002    0.058    0.058 internals.py:3784(_consolidate_check)
        1    0.008    0.008    0.056    0.056 internals.py:3785(<listcomp>)
    60000    0.185    0.000    5.267    0.000 internals.py:3895(copy)
    60000    0.041    0.000    2.228    0.000 internals.py:3915(<lambda>)
    60000    0.048    0.000    2.276    0.000 internals.py:3916(<listcomp>)
    50001    0.027    0.000    0.034    0.000 internals.py:4085(consolidate)
        1    0.001    0.001    0.190    0.190 internals.py:4101(_consolidate_inplace)
    50000    0.142    0.000    5.003    0.000 internals.py:4388(reindex_indexer)
   120002    0.345    0.000    1.457    0.000 internals.py:4639(__init__)
   120002    0.053    0.000    0.053    0.000 internals.py:4684(_block)
    10000    0.036    0.000    0.900    0.000 internals.py:4702(get_slice)
    80000    0.019    0.000    0.019    0.000 internals.py:4709(index)
    50001    0.022    0.000    0.051    0.000 internals.py:4718(dtype)
    10000    0.010    0.000    0.015    0.000 internals.py:4742(external_values)
        1    0.000    0.000    0.000    0.000 internals.py:4745(internal_values)
    50001    0.007    0.000    0.007    0.000 internals.py:4768(is_consolidated)
   120000    0.015    0.000    0.015    0.000 internals.py:4774(_consolidate_inplace)
        1    0.013    0.013    2.916    2.916 internals.py:4869(create_block_manager_from_arrays)
        1    0.067    0.067    2.514    2.514 internals.py:4880(form_blocks)
        1    0.138    0.138    2.163    2.163 internals.py:5003(_sparse_blockify)
        1    0.011    0.011    0.093    0.093 internals.py:5057(_consolidate)
   100000    0.016    0.000    0.057    0.000 internals.py:5063(<lambda>)
        1    0.000    0.000    0.000    0.000 internals.py:5074(_merge_blocks)
    60001    0.070    0.000    0.110    0.000 internals.py:5101(_extend_blocks)
    10000    0.011    0.000    0.096    0.000 internals.py:774(copy)
        2    0.000    0.000    0.000    0.000 missing.py:112(_isna_new)
        1    0.000    0.000    0.000    0.000 missing.py:189(_isna_ndarraylike)
        1    0.000    0.000    0.000    0.000 missing.py:259(notna)
        2    0.000    0.000    0.000    0.000 missing.py:32(isna)
    50000    0.320    0.000   67.865    0.001 missing.py:376(array_equivalent)
    50000    0.030    0.000    0.045    0.000 missing.py:596(clean_reindex_fill_method)
    50000    0.015    0.000    0.015    0.000 missing.py:74(clean_fill_method)
    20002    0.003    0.000    0.003    0.000 numeric.py:110(is_all_dates)
        1    0.000    0.000    0.000    0.000 numeric.py:193(_assert_safe_casting)
    10003    0.029    0.000    0.147    0.000 numeric.py:35(__new__)
   340006    0.160    0.000  289.007    0.001 numeric.py:433(asarray)
   100002    0.035    0.000    0.308    0.000 numeric.py:556(ascontiguousarray)
    50000    0.180    0.000    0.378    0.000 numeric.py:630(require)
    20001    0.042    0.000    0.699    0.000 numeric.py:64(_shallow_copy)
   100000    0.034    0.000    0.045    0.000 numeric.py:701(<genexpr>)
        1    0.000    0.000    0.000    0.000 range.py:131(_simple_new)
        1    0.000    0.000    0.000    0.000 range.py:158(_validate_dtype)
        1    0.000    0.000    0.000    0.000 range.py:169(_data)
        3    0.000    0.000    0.004    0.001 range.py:257(tolist)
        1    0.000    0.000    0.000    0.000 range.py:315(equals)
    60011    0.034    0.000    0.054    0.000 range.py:481(__len__)
    60000    0.100    0.000    0.178    0.000 range.py:491(__getitem__)
        1    0.000    0.000    0.000    0.000 range.py:68(__new__)
        2    0.000    0.000    0.000    0.000 range.py:84(_ensure_int)
    20002    0.076    0.000    0.306    0.000 series.py:166(__init__)
    50000    0.028    0.000    0.089    0.000 series.py:175(values)
    50000    0.031    0.000    0.061    0.000 series.py:188(block)
    50000    0.012    0.000    0.012    0.000 series.py:226(_constructor)
    10000    0.053    0.000    0.941    0.000 series.py:2503(sort_index)
    50000    0.007    0.000    0.007    0.000 series.py:3237(_needs_reindex_multi)
   100000    0.067    0.000    0.067    0.000 series.py:332(_set_subtyp)
    50000    0.116    0.000  253.843    0.005 series.py:3323(reindex)
    20001    0.005    0.000    0.005    0.000 series.py:349(_constructor)
        1    0.000    0.000    0.000    0.000 series.py:3508(_take)
   120002    0.472    0.000    1.494    0.000 series.py:365(_set_axis)
    20002    0.008    0.000    0.008    0.000 series.py:391(_set_subtyp)
   240003    0.209    0.000    0.363    0.000 series.py:401(name)
        1    0.000    0.000    0.000    0.000 series.py:4019(_sanitize_array)
        1    0.000    0.000    0.000    0.000 series.py:4036(_try_cast)
   240003    0.091    0.000    0.091    0.000 series.py:405(name)
    50001    0.029    0.000    0.080    0.000 series.py:412(dtype)
    10000    0.008    0.000    0.023    0.000 series.py:432(values)
        1    0.000    0.000    0.000    0.000 series.py:465(_values)
    50000    0.150    0.000  253.993    0.005 series.py:565(reindex)
   100000    0.735    0.000  171.894    0.002 series.py:64(__init__)
    10000    0.028    0.000    1.090    0.000 series.py:875(_get_values)
        1    0.000    0.000    0.000    0.000 sorting.py:321(get_group_index_sorter)
   220005    0.095    0.000    0.095    0.000 {built-in method __new__ of type object at 0x7ff7181d1be0}
    50003    0.035    0.000    0.035    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
    50000    0.025    0.000    0.303    0.000 {built-in method builtins.all}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.any}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
        1    0.000    0.000  599.169  599.169 {built-in method builtins.exec}
  1980098    0.409    0.000    0.429    0.000 {built-in method builtins.getattr}
  1300057    0.577    0.000    0.577    0.000 {built-in method builtins.hasattr}
    50000    0.003    0.000    0.003    0.000 {built-in method builtins.hash}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.id}
 10460382    1.555    0.000    2.071    0.000 {built-in method builtins.isinstance}
  1160059    0.161    0.000    0.161    0.000 {built-in method builtins.issubclass}
   220011    0.109    0.000    0.109    0.000 {built-in method builtins.iter}
1640121/1370109    0.441    0.000    0.574    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.locals}
    60013    0.019    0.000    0.019    0.000 {built-in method builtins.max}
    50001    0.092    0.000    0.123    0.000 {built-in method builtins.sorted}
        1    0.003    0.003    0.022    0.022 {built-in method builtins.sum}
   100004    0.043    0.000    0.043    0.000 {built-in method numpy.core.multiarray.arange}
   540012  289.452    0.001  289.452    0.001 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.concatenate}
        7    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
        7    0.000    0.000    0.000    0.000 {built-in method pandas._libs.algos.ensure_int64}
   200000    0.034    0.000    0.034    0.000 {built-in method pandas._libs.algos.ensure_object}
        4    0.000    0.000    0.000    0.000 {built-in method pandas._libs.algos.ensure_platform_int}
    50000   66.855    0.001   66.855    0.001 {built-in method pandas._libs.lib.array_equivalent_object}
    10001    0.002    0.000    0.002    0.000 {built-in method pandas._libs.lib.is_bool}
   100000    0.242    0.000    0.242    0.000 {built-in method pandas._libs.lib.is_datetime_array}
    10001    0.002    0.000    0.002    0.000 {built-in method pandas._libs.lib.is_float}
   110004    0.018    0.000    0.018    0.000 {built-in method pandas._libs.lib.is_integer}
   170007    0.203    0.000    0.203    0.000 {built-in method pandas._libs.lib.is_scalar}
        1    0.000    0.000    0.000    0.000 {built-in method pandas._libs.missing.checknull}
    10001    0.140    0.000    0.140    0.000 {built-in method pandas._libs.sparse.get_blocks}
        1    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
        2    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
   210001    0.028    0.000    0.028    0.000 {method 'append' of 'list' objects}
        2    0.001    0.001    0.001    0.001 {method 'argsort' of 'numpy.ndarray' objects}
    20004    0.067    0.000    0.067    0.000 {method 'astype' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
    60000    0.080    0.000    0.080    0.000 {method 'copy' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'discard' of 'set' objects}
        1    0.001    0.001    0.001    0.001 {method 'extend' of 'list' objects}
        5    0.000    0.000    0.000    0.000 {method 'fill' of 'numpy.ndarray' objects}
    50001    0.029    0.000    0.029    0.000 {method 'format' of 'str' objects}
   260006    0.048    0.000    0.048    0.000 {method 'get' of 'dict' objects}
        1    0.001    0.001    0.001    0.001 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}
        1    0.000    0.000    0.000    0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}
   270010    0.061    0.000    0.061    0.000 {method 'items' of 'dict' objects}
    50000    0.016    0.000    0.016    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
   400003    0.056    0.000    0.056    0.000 {method 'pop' of 'dict' objects}
   100000    0.078    0.000    0.078    0.000 {method 'ravel' of 'numpy.ndarray' objects}
    50005    0.482    0.000    0.482    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   120007    0.088    0.000    0.088    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'search' of 're.Pattern' objects}
        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'take' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'to_array' of 'pandas._libs.hashtable.Int64Vector' objects}
        1    0.001    0.001    0.001    0.001 {method 'tolist' of 'numpy.ndarray' objects}
   120004    0.063    0.000    0.066    0.000 {method 'update' of 'dict' objects}
    50000    0.011    0.000    0.011    0.000 {method 'upper' of 'str' objects}
   590007    0.568    0.000    0.729    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {pandas._libs.algos.groupsort_indexer}
        2    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_int64_int64}
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.generate_slices}
   100001   45.236    0.000   45.236    0.000 {pandas._libs.lib.infer_dtype}
   100000    0.060    0.000    0.247    0.000 {pandas._libs.lib.values_from_object}

I don't really know how to read this, but it looks like half the time is spent calling np.asarray which is consistent with densifying the input matrix.

And for the sum

>>> df = pd.SparseDataFrame(X)
>>> cProfile.run('df.sum(axis=0)')
         1050275 function calls in 95.012 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000    0.000   95.012   95.012 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 __init__.py:211(itervalues)
        2    0.000    0.000    5.221    2.610 _methods.py:34(_sum)
        1    0.000    0.000    0.000    0.000 _methods.py:45(_all)
        1    0.000    0.000    0.000    0.000 abc.py:137(__instancecheck__)
    50000    0.027    0.000    0.027    0.000 array.py:306(__len__)
    50000    0.123    0.000    1.736    0.000 array.py:332(values)
    50000    0.011    0.000    0.011    0.000 array.py:352(fill_value)
    50000    0.059    0.000    1.795    0.000 array.py:372(to_dense)
        1    0.000    0.000    0.000    0.000 base.py:4914(_ensure_index)
        3    0.000    0.000    0.000    0.000 base.py:61(is_dtype)
        1    0.000    0.000    0.047    0.047 cast.py:1093(find_common_type)
    50000    0.009    0.000    0.044    0.000 cast.py:1118(<genexpr>)
        1    0.000    0.000    0.000    0.000 cast.py:853(maybe_castable)
        2    0.000    0.000    0.000    0.000 common.py:1170(is_datetime_or_timedelta_dtype)
        1    0.000    0.000    0.000    0.000 common.py:122(is_sparse)
        1    0.000    0.000    0.000    0.000 common.py:1405(needs_i8_conversion)
        2    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1578(is_bool_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1688(is_extension_array_dtype)
    99999    0.016    0.000    0.021    0.000 common.py:1784(_get_dtype)
       10    0.000    0.000    0.000    0.000 common.py:1835(_get_dtype_type)
        1    0.000    0.000    0.000    0.000 common.py:332(is_datetime64_dtype)
        1    0.000    0.000    0.000    0.000 common.py:369(is_datetime64tz_dtype)
        1    0.000    0.000    0.000    0.000 common.py:407(is_timedelta64_dtype)
        1    0.000    0.000    0.000    0.000 common.py:444(is_period_dtype)
        1    0.000    0.000    0.000    0.000 common.py:477(is_interval_dtype)
        1    0.000    0.000    0.000    0.000 common.py:546(is_string_dtype)
    49999    0.014    0.000    0.035    0.000 common.py:692(is_dtype_equal)
        1    0.000    0.000    0.000    0.000 common.py:811(is_integer_dtype)
        1    0.000    0.000    0.000    0.000 common.py:89(is_object_dtype)
        1    0.000    0.000    0.000    0.000 common.py:995(is_int_or_datetime_dtype)
        1    0.000    0.000    0.000    0.000 dtypes.py:584(is_dtype)
        1    0.000    0.000    0.000    0.000 dtypes.py:707(is_dtype)
        1    0.000    0.000    0.000    0.000 frame.py:133(_constructor)
        1    0.000    0.000   95.012   95.012 frame.py:6845(_reduce)
        1    0.000    0.000   86.265   86.265 frame.py:6856(f)
        1    0.000    0.000    0.000    0.000 frame.py:7047(_get_agg_axis)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:1933(any)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:64(_wrapreduction)
        1    0.000    0.000    0.000    0.000 function.py:38(__call__)
        1    0.000    0.000    0.000    0.000 generic.py:124(__init__)
        1    0.000    0.000    0.000    0.000 generic.py:364(_get_axis_number)
        1    0.000    0.000    0.000    0.000 generic.py:4362(__getattr__)
        2    0.000    0.000    0.000    0.000 generic.py:4378(__setattr__)
        1    0.000    0.000    0.000    0.000 generic.py:4423(_protect_consolidate)
        1    0.000    0.000    0.000    0.000 generic.py:4433(_consolidate_inplace)
        1    0.000    0.000    0.000    0.000 generic.py:4436(f)
        1    0.000    0.000    8.747    8.747 generic.py:4563(values)
        8    0.000    0.000    0.000    0.000 generic.py:7(_check)
        1    0.000    0.000   95.012   95.012 generic.py:9577(stat_func)
        1    0.000    0.000    0.000    0.000 inference.py:251(is_list_like)
        1    0.000    0.000    0.000    0.000 internals.py:116(__init__)
        1    0.000    0.000    0.000    0.000 internals.py:127(_check_ndim)
    50000    0.100    0.000    1.934    0.000 internals.py:1751(get_values)
    50001    0.017    0.000    0.017    0.000 internals.py:233(mgr_locs)
        1    0.000    0.000    0.000    0.000 internals.py:237(mgr_locs)
        1    0.000    0.000    0.000    0.000 internals.py:3148(get_block_type)
        1    0.000    0.000    0.000    0.000 internals.py:3191(make_block)
        2    0.000    0.000    0.000    0.000 internals.py:3307(shape)
        6    0.000    0.000    0.000    0.000 internals.py:3309(<genexpr>)
        1    0.000    0.000    0.000    0.000 internals.py:3311(ndim)
        1    0.000    0.000    0.000    0.000 internals.py:3351(_is_single_block)
    50000    0.004    0.000    0.004    0.000 internals.py:352(dtype)
        2    0.000    0.000    0.000    0.000 internals.py:3776(is_consolidated)
        1    0.000    0.000    0.000    0.000 internals.py:3789(is_mixed_type)
        1    0.000    0.000    8.747    8.747 internals.py:3922(as_array)
        1    6.736    6.736    8.747    8.747 internals.py:3953(_interleave)
        1    0.000    0.000    0.000    0.000 internals.py:4085(consolidate)
        1    0.000    0.000    0.000    0.000 internals.py:4101(_consolidate_inplace)
        1    0.000    0.000    0.000    0.000 internals.py:4639(__init__)
        1    0.000    0.000    0.059    0.059 internals.py:5044(_interleaved_dtype)
        1    0.008    0.008    0.012    0.012 internals.py:5048(<listcomp>)
        1    0.000    0.000    2.620    2.620 missing.py:112(_isna_new)
        1    2.620    2.620    2.620    2.620 missing.py:189(_isna_ndarraylike)
        1    0.000    0.000    2.620    2.620 missing.py:32(isna)
        1    0.000    0.000    0.000    0.000 nanops.py:179(_get_fill_value)
        1    0.000    0.000   80.090   80.090 nanops.py:202(_get_values)
        1    0.000    0.000    0.000    0.000 nanops.py:256(_na_ok_dtype)
        1    0.000    0.000    0.000    0.000 nanops.py:260(_view_if_needed)
        1    0.000    0.000    0.000    0.000 nanops.py:266(_wrap_results)
        1    0.000    0.000   85.312   85.312 nanops.py:328(nansum)
        4    0.000    0.000    0.000    0.000 nanops.py:64(check)
        1    0.952    0.952   86.264   86.264 nanops.py:69(_f)
        1    0.001    0.001    2.475    2.475 nanops.py:712(_maybe_null_out)
        5    0.000    0.000    0.000    0.000 nanops.py:72(<genexpr>)
        1    0.000    0.000    0.000    0.000 numeric.py:110(is_all_dates)
        2    0.000    0.000    0.000    0.000 numeric.py:2491(seterr)
        2    0.000    0.000    0.000    0.000 numeric.py:2592(geterr)
        1    0.000    0.000    0.000    0.000 numeric.py:2887(__init__)
        1    0.000    0.000    0.000    0.000 numeric.py:2891(__enter__)
        1    0.000    0.000    0.000    0.000 numeric.py:2896(__exit__)
    50000    0.018    0.000    0.078    0.000 numeric.py:556(ascontiguousarray)
        7    0.000    0.000    0.000    0.000 range.py:481(__len__)
        1    0.000    0.000    0.000    0.000 series.py:166(__init__)
        1    0.000    0.000    0.000    0.000 series.py:365(_set_axis)
        1    0.000    0.000    0.000    0.000 series.py:391(_set_subtyp)
        1    0.000    0.000    0.000    0.000 series.py:401(name)
        1    0.000    0.000    0.000    0.000 series.py:4019(_sanitize_array)
        1    0.000    0.000    0.000    0.000 series.py:4036(_try_cast)
        1    0.000    0.000    0.000    0.000 series.py:405(name)
        1    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.003    0.003    0.047    0.047 {built-in method builtins.all}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.any}
        1    0.000    0.000   95.012   95.012 {built-in method builtins.exec}
       14    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        7    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
   100049    0.005    0.000    0.005    0.000 {built-in method builtins.isinstance}
       14    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
    50019    0.031    0.000    0.058    0.000 {built-in method builtins.len}
        7    0.000    0.000    0.000    0.000 {built-in method builtins.max}
    50000    0.061    0.000    0.061    0.000 {built-in method numpy.core.multiarray.array}
    50001    0.179    0.000    0.179    0.000 {built-in method numpy.core.multiarray.empty}
        1   31.546   31.546   31.546   31.546 {built-in method numpy.core.multiarray.putmask}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.zeros}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.geterrobj}
        2    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.seterrobj}
        1    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.is_integer}
        1    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.is_scalar}
        1    0.000    0.000    0.000    0.000 {method 'all' of 'numpy.ndarray' objects}
        1   45.924   45.924   45.924   45.924 {method 'copy' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    50000    0.861    0.000    0.861    0.000 {method 'fill' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
    50000    0.099    0.000    0.099    0.000 {method 'put' of 'numpy.ndarray' objects}
        4    5.221    1.305    5.221    1.305 {method 'reduce' of 'numpy.ufunc' objects}
    50000    0.039    0.000    0.039    0.000 {method 'reshape' of 'numpy.ndarray' objects}
        2    0.000    0.000    5.221    2.610 {method 'sum' of 'numpy.ndarray' objects}
    50000    0.327    0.000    0.406    0.000 {method 'to_int_index' of 'pandas._libs.sparse.BlockIndex' objects}
        1    0.000    0.000    0.000    0.000 {method 'transpose' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.values_from_object}
Not quite sure what that one's doing, would benefit from some expert assistance here.

Thanks for the FYI. I'll look into it.

@fedarko
Copy link

fedarko commented Jul 4, 2019

Not sure if this is related, but taking the transpose of a SparseDataFrame also seems to take more resources than it should. This may be an artifact of somehow reverting to a dense representation.

Here's a demonstration, mostly copied from @scottgigante's code above—

import scipy.sparse as sp
import pandas as pd
import numpy as np
import cProfile
shape = (50000, 50000)
data = np.repeat(1, 10000)
i = np.random.choice(shape[0], 10000, replace=False)
j = np.random.choice(shape[1], 10000, replace=False)
X = sp.coo_matrix((data, (i, j)), shape=shape)
df = pd.SparseDataFrame(X)
# This executes almost immediately (Using cProfile on this shows that it takes
# "70 function calls in 0.000 seconds")
X.T
# As of writing, this has been running for around an hour on my computer
df.T

This was done with Pandas 0.24.2 on a 2012 MacBook Pro.

(Since it seems like Pandas 0.25 will change a lot re: sparse data structures, this problem might go away with that new version—but I figured I should document it here, since I haven't seen any other mentions of this.)

@TomAugspurger
Copy link
Contributor

I don't expect this to change with 0.25.

I also don't see how pandas SparseArray would behave well with a transpose (or a DataFrame with many SparseArrays). Every column's sparse index is stored separately.

@rtlee9
Copy link
Contributor

rtlee9 commented Jul 9, 2019

I did some profiling and it looks like the difference between constructing the SparseDataFrame with an int vs. a str comes from the construction of the new index (one per column) in the reindexing process: Int64Index constructed here for an int index and np.asarray called indirectly from several lines down for a str index. I'm not totally sure why df.index = np.arange(shape[0]).astype(str) is so fast but I'm guessing the array is simply copied instead of constructing via np.asarray for each column.

@jorisvandenbossche
Copy link
Member

This seems to be no longer a problem with the new pd.DataFrame.sparse.from_spmatrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants