sparse int dataframe from dict crashes with default_fill_value=0 #25694

matthias-k · 2019-03-12T19:37:27Z

Code Sample, a copy-pastable example if possible

pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, dtype=np.int64, default_fill_value=0)

Problem description

The command fails with ValueError: Cannot convert non-finite values (NA or inf) to integer. To me the problem seems to be that missing dict entries as always treated as np.nan instead of the fill value, however the nan values cannot be casted to ints. I haven't found any way to work around this yet that is feasible with larger data. E.g. if I use pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, dtype=np.int64), the fill value is np.nan, also if I use pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, default_fill_value=0) (then just the dtype changes).

Expected Output

    1  3
2   3  0
3   0  4

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-03-12T19:41:26Z

FYI, SparseDataFrame is being deprecated soon.

In this case, I believe that

df = pd.DataFrame({
    1: pd.SparseArray([3, 0], fill_value=0),
    2: pd.SparseArray([0, 4], fill_value=0),
}, index[2, 3])

will match for creating the dataframe with sparse values. When assigning new columns you'll need to assign them as SparseArrays.

matthias-k · 2019-03-12T19:49:50Z

@TomAugspurger thanks. The SparseArray workaround is what I'm trying right now. It's looking good but takes quite some time because I have to build all those sparse arrays in memory. But that should be doable with some help from numba.

I wasn't aware that SparseDataFrame is being deprecated. Is there any intended successor, or is sparse data handling just being removed from pandas altogether?

TomAugspurger · 2019-03-12T19:52:02Z

because I have to build all those sparse arrays in memory.

There isn't a memory difference between SparseDataFrame and a DataFrame where all the columns are sparse. Internally, they're stored the same.

or is sparse data handling just being removed from pandas altogether?

A DataFrame where each column is a sparse array is the successor.

matthias-k · 2019-03-15T14:33:37Z

because I have to build all those sparse arrays in memory.

There isn't a memory difference between SparseDataFrame and a DataFrame where all the columns are sparse. Internally, they're stored the same.

Sorry, I was not precise. I first have to build each column as dense vector that then I can hand to SparseArray. But that's doable.

or is sparse data handling just being removed from pandas altogether?

A DataFrame where each column is a sparse array is the successor.
Ah, good to know! Thanks!

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type labels Mar 16, 2019

simonjayhawkins mentioned this issue Apr 29, 2019

Sparse dataframe fails when astype() is called with dictionary #26227

Closed

TomAugspurger mentioned this issue Sep 16, 2019

Remove SparseSeries and SparseDataFrame #28425

Merged

TomAugspurger closed this as completed in #28425 Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse int dataframe from dict crashes with default_fill_value=0 #25694

sparse int dataframe from dict crashes with default_fill_value=0 #25694

matthias-k commented Mar 12, 2019

INSTALLED VERSIONS

TomAugspurger commented Mar 12, 2019

matthias-k commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

matthias-k commented Mar 15, 2019

sparse int dataframe from dict crashes with default_fill_value=0 #25694

sparse int dataframe from dict crashes with default_fill_value=0 #25694

Comments

matthias-k commented Mar 12, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 12, 2019

matthias-k commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

matthias-k commented Mar 15, 2019

Output of `pd.show_versions()`