Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse int dataframe from dict crashes with default_fill_value=0 #25694

Closed
matthias-k opened this issue Mar 12, 2019 · 4 comments · Fixed by #28425
Closed

sparse int dataframe from dict crashes with default_fill_value=0 #25694

matthias-k opened this issue Mar 12, 2019 · 4 comments · Fixed by #28425
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type

Comments

@matthias-k
Copy link

Code Sample, a copy-pastable example if possible

pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, dtype=np.int64, default_fill_value=0) 

Problem description

The command fails with ValueError: Cannot convert non-finite values (NA or inf) to integer. To me the problem seems to be that missing dict entries as always treated as np.nan instead of the fill value, however the nan values cannot be casted to ints. I haven't found any way to work around this yet that is feasible with larger data. E.g. if I use pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, dtype=np.int64), the fill value is np.nan, also if I use pd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, default_fill_value=0) (then just the dtype changes).

Expected Output

    1  3
2   3  0
3   0  4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

FYI, SparseDataFrame is being deprecated soon.

In this case, I believe that

df = pd.DataFrame({
    1: pd.SparseArray([3, 0], fill_value=0),
    2: pd.SparseArray([0, 4], fill_value=0),
}, index[2, 3])

will match for creating the dataframe with sparse values. When assigning new columns you'll need to assign them as SparseArrays.

@matthias-k
Copy link
Author

@TomAugspurger thanks. The SparseArray workaround is what I'm trying right now. It's looking good but takes quite some time because I have to build all those sparse arrays in memory. But that should be doable with some help from numba.

I wasn't aware that SparseDataFrame is being deprecated. Is there any intended successor, or is sparse data handling just being removed from pandas altogether?

@TomAugspurger
Copy link
Contributor

because I have to build all those sparse arrays in memory.

There isn't a memory difference between SparseDataFrame and a DataFrame where all the columns are sparse. Internally, they're stored the same.

or is sparse data handling just being removed from pandas altogether?

A DataFrame where each column is a sparse array is the successor.

@matthias-k
Copy link
Author

because I have to build all those sparse arrays in memory.

There isn't a memory difference between SparseDataFrame and a DataFrame where all the columns are sparse. Internally, they're stored the same.

Sorry, I was not precise. I first have to build each column as dense vector that then I can hand to SparseArray. But that's doable.

or is sparse data handling just being removed from pandas altogether?

A DataFrame where each column is a sparse array is the successor.
Ah, good to know! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants