-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparse int dataframe from dict crashes with default_fill_value=0 #25694
Comments
FYI, SparseDataFrame is being deprecated soon. In this case, I believe that df = pd.DataFrame({
1: pd.SparseArray([3, 0], fill_value=0),
2: pd.SparseArray([0, 4], fill_value=0),
}, index[2, 3]) will match for creating the dataframe with sparse values. When assigning new columns you'll need to assign them as SparseArrays. |
@TomAugspurger thanks. The SparseArray workaround is what I'm trying right now. It's looking good but takes quite some time because I have to build all those sparse arrays in memory. But that should be doable with some help from numba. I wasn't aware that SparseDataFrame is being deprecated. Is there any intended successor, or is sparse data handling just being removed from pandas altogether? |
There isn't a memory difference between SparseDataFrame and a DataFrame where all the columns are sparse. Internally, they're stored the same.
A DataFrame where each column is a sparse array is the successor. |
Sorry, I was not precise. I first have to build each column as dense vector that then I can hand to
|
Code Sample, a copy-pastable example if possible
Problem description
The command fails with
ValueError: Cannot convert non-finite values (NA or inf) to integer
. To me the problem seems to be that missing dict entries as always treated asnp.nan
instead of the fill value, however the nan values cannot be casted to ints. I haven't found any way to work around this yet that is feasible with larger data. E.g. if I usepd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, dtype=np.int64)
, the fill value isnp.nan
, also if I usepd.SparseDataFrame({1: {2: 3}, 3: {3: 4}}, default_fill_value=0)
(then just the dtype changes).Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: