DEPR: SparseDataFrame and SparseSeries subclasses #19239

jreback · 2018-01-14T14:15:29Z

This is slightly useful in a single dtyped case, but enough of a headache here that we should simply remove it. Ultimately we should even remove SparseSeries and just rely on the SparseArray abstraction and extension types (#19174) .

The text was updated successfully, but these errors were encountered:

jreback · 2018-01-14T14:16:04Z

cc @jorisvandenbossche @TomAugspurger @wesm
cc @hexgnu
cc @Licht-T
cc @kernc

datapythonista · 2018-01-24T16:35:20Z

In case my use case it's useful in the discussion. I mostly (probably only) use SparseDataFrame because of pandas.get_dummies(). And after that, I'm mainly interested in having a scipy.sparse structure (for example to use it in scikit-learn). I'd bet this is the case for many pandas users.

Personally I'm +1 on getting rid of it. Then, it could be useful to have a get_dummies sparse version that returns a scipy.sparse. Not sure if in pandas or as a separate project.

hexgnu · 2018-01-29T08:17:02Z

I think a lot of the issue with SparseDataFrame is honestly a lack of test coverage as well as a lot of bleedover into DataFrame. For instance a lot of code assumes that something is dense and that it has length > 0. This isn't the real case, and I feel that with a flag day or two of squashing sparse bugs and writing more tests it wouldn't be so awful.

But that being said I also don't want to spend a bunch of time fixing something nobody really uses.

Also this is not obvious to me on first look:

DataFrame can have both Series and SparseSeries in it
SparseDataFrame can only really have SparseSeries inside of it

Personally I would think that SparseDataFrame is the only time you can have SparseSeries and DataFrame is always dense.

hexgnu · 2018-02-02T08:29:35Z

Also I wanted to point out that Sparse bugs aren't the biggest label category in this repo

It looks to me like reshaping, indexing, and dtypes are the big hotspots. Getting rid of sparse I don't think would bring those bug counts down.

Also I think with some TLC SparseDataFrames can be a really nice little abstraction for saving on memory consumption in certain cases.

jreback · 2018-02-02T11:01:58Z

@hexgnu its not about the issues, rather about having another structure to use, which leads to confusion and code complexity. Yes a pure sparse frame does lead to efficiencies, but you (w/o a lot of acrobatics) lose the heterogenity which makes a DataFrame itself so useful. I think a SparseDataFrame should exist as a separate project :> (and the primitives, IOW SparseArray/Series can still live in pandas proper).

kernc · 2018-02-02T13:07:18Z

As long as high-level NDFrame APIs defer actions to block primitives as far as possible, having a regular DataFrame wrapping them should work fine.

TomAugspurger · 2018-03-19T12:47:32Z

What are the arguments for keeping SparseDataFrame around? I'm not familiar enough to say for sure whether this is possible, but ideally we would

Clearly document that DataFrame can hold sparse data
Move all current sparse-specific methods to a .sparse accessor (density, to_coo). Some of these methods would error if the DataFrame isn't entirely sparse.
Figure out how to handle default_fill_value
Deprceate SparseDataFrame in favor of a DataFrame holding sparse arrays

What can a SparseDataFrame do that a DataFrame[SparseArray] can't?

xpe · 2018-03-22T21:14:58Z

I hope we'd all agree that support for sparsity is very useful. I don't know the internals of Pandas, so I won't weigh in on the particular question of "What can a SparseDataFrame do that a DataFrame[SparseArray] cannot?"

As a user, I'll share my use case and an appreciation for clear documentation.

My use case involves load_svmlight_file from scikit-learn:

from sklearn.datasets import load_svmlight_file
features, labels = load_svmlight_file(data_filename)  # sparse data
feature_labels = ['a', 'b', 'c']  # etc
df = pd.SparseDataFrame(data=features, columns=feature_labels)
df.plot(y='c')  # fails
df.to_dense().plot(y='c')  # succeeds

I'd be open to using other ways to load the sparse LibSVM data, but I don't currently know them. Any recommendations?

Having .plot() work in this case with SparseDataFrame would be nice. :)

jorisvandenbossche · 2018-03-23T09:41:43Z

I am also not familiar enough with the sparse dataframe to answer the question ""What can a SparseDataFrame do that a DataFrame[SparseArray] cannot?".
But, when we would take this route and do:

Deprceate SparseDataFrame in favor of a DataFrame holding sparse arrays

My main worry is: how much will this complicate our current DataFrame implementation? Will this mean adding special-casing for sparse dataframes in multiple places? ... If that is the case, I am not sure I am in favor of this.
Eg in pandas/core/sparse/frame.py, there is still a lot of custom code overriding parent frame methods (next to the sparse-specific methods like to_coo or to_dense). I don't know what all this code does, but it might serve a purpose.

Of course it might be that with the new ExtensionArray interface we could make this much cleaner, and wouldn't need such amount of special casing. But I can't assess if that will be the case or not.

Figure out how to handle default_fill_value

This is a bit similar to the the extra metadata attributes I also have in geopandas. It would be good if we can figure out something.
I suppose having it as an attribute on a "sparse dtype" object might be one possible way.

jreback · 2018-03-23T10:32:37Z

The main difference is this. A SDF of a single dtype can be very efficiently represented because it has a default_fill_value for the entire frame. Think of csr_matrix here. A DataFrame can already hold SparseSeries. So in theory you can simply replace every case of a SDF with a DataFrame. If you have lots of columns (like a lot), then this would be pretty inefficient.

My argument for removing this is simple. This causes an enormous amount of complexity in the codebase, and we have many non-implemented features (mainly indexing). So its pretty half-baked. Simply deprecated / removing in pandas would allow an external library to implement this properly (to allow for the efficiency argument above), while not sacrificing the occasionaly use of a SparseSeries (or even a small number of SparseSeries), e.g. what get_dummies returns (or could return); this is in fact the main case.

We would also leave this, but move the implementation out of main pandas (to an external library) this is separately maintained.

jorisvandenbossche · 2018-03-23T10:41:12Z

The main difference is this. A SDF of a single dtype can be very efficiently represented because it has a default_fill_value for the entire frame. Think of csr_matrix here. A DataFrame can already hold SparseSeries. So in theory you can simply replace every case of a SDF with a DataFrame. If you have lots of columns (like a lot), then this would be pretty inefficient.

As said above, not familiar with sparse code, but it seems the above is not true and that even a SparseDataFrame with a single dtype already stores the data column-by-column?

In [61]: pd.SparseDataFrame([[0,1],[1,0]])._data
Out[61]: 
BlockManager
Items: RangeIndex(start=0, stop=2, step=1)
Axis 1: RangeIndex(start=0, stop=2, step=1)
SparseBlock: slice(0, 1, 1), 1 x 2, dtype: int64
SparseBlock: slice(1, 2, 1), 1 x 2, dtype: int64

jreback · 2018-03-23T10:44:12Z

construct by using .to_sparse() from a DF

the method above is not consolidated by default

jorisvandenbossche · 2018-03-23T10:47:03Z

There is no difference:

In [63]: pd.DataFrame([[0,1],[1,0]]).to_sparse()._data
Out[63]: 
BlockManager
Items: RangeIndex(start=0, stop=2, step=1)
Axis 1: RangeIndex(start=0, stop=2, step=1)
SparseBlock: slice(0, 1, 1), 1 x 2, dtype: int64
SparseBlock: slice(1, 2, 1), 1 x 2, dtype: int64

And SparseArray is also limited to 1D I think (it gives an error if you try to pass the above 2D array to SparseArray).

TomAugspurger · 2018-03-23T11:04:50Z

And SparseArray is also limited to 1D I think

Ahh, that answers it then. I was thinking SparseArray was 2D.

pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239

…es serializing This fixes [ARROW-2273](https://issues.apache.org/jira/browse/ARROW-2273). `pandas` Sparse types are planned to be deprecated in pandas future releases (pandas-dev/pandas#19239). `SparseDataFrame` and `SparseSeries` are naive implementation and have many bugs. IMO, this is not the right time to support these in `pyarrow`. Author: Licht-T <licht-t@outlook.jp> Closes #1997 from Licht-T/add-pandas-sparse-unsupported-msg and squashes the following commits: 64e24ce <Licht-T> ENH: Raise NotImplementedError when pandas Sparse types serializing pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239

TomAugspurger · 2018-10-14T12:19:16Z

Opened #23148 for creating a sparse accessor.

Have we identified anything SparseDataFrame can do that a regular DataFrame can't?

@xpe I assume that scikit-learn load_svmlight returns a spicy.sparse matrix? In my ideal world you would be able to load that with pd.DataFrame.sparse.from_coo.

Anything else?

TomAugspurger · 2018-10-15T11:13:48Z

I've updated the title to consider deprecating both SparseDataFrame and SparseSeries now that #22325 is in master (FYI, it'd be helpful to have people testing that on real workloads)

FWIW, when I was doing that refactor, I discovered several subtle difference between methods implemented on both Series and SparseSeries. The Series implementation was updated but the SparseSeries one lagged behind.

xref pandas-dev#19239

jreback · 2019-01-04T14:06:01Z

if we can't do this for 0.24, 0.25 is ok too.

frndlytm · 2019-01-10T21:27:59Z

Multi-Indexing SparseDataFrames is a nice feature you don't get from scipy.sparse arrays.

TomAugspurger · 2019-01-10T21:33:02Z

@frndlytm you'll still be able to do that with a DataFrame with sparse values

In [11]: df = pd.DataFrame({"A": [1, 0], "B": [0, 1]}, dtype="Sparse[int]", index=pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b')]))

In [12]: df.dtypes
Out[12]:
A    Sparse[int64, 0]
B    Sparse[int64, 0]
dtype: object

In [13]: df
Out[13]:
     A  B
A a  1  0
  b  0  1

or do I misunderstand?

jreback added Sparse Sparse Data Type Deprecate Functionality to remove in pandas Difficulty Intermediate labels Jan 14, 2018

jreback added this to the 0.23.0 milestone Jan 14, 2018

kernc mentioned this issue Jan 21, 2018

ENH: SparseDataFrame/SparseSeries value assignment #17785

Closed

4 tasks

hexgnu mentioned this issue Feb 12, 2018

BUG: concat with sparse not propogating default_fill_value #15737

Closed

jreback modified the milestones: 0.23.0, 0.24.0 Apr 14, 2018

Licht-T added a commit to Licht-T/arrow that referenced this issue May 4, 2018

ENH: Raise NotImplementedError when pandas Sparse types serializing

90f93b0

pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239

Licht-T mentioned this issue May 4, 2018

ARROW-2273: [Python] Raise NotImplementedError when pandas Sparse types serializing apache/arrow#1997

Closed

Licht-T added a commit to Licht-T/arrow that referenced this issue May 4, 2018

ENH: Raise NotImplementedError when pandas Sparse types serializing

64e24ce

pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239

scottgigante mentioned this issue Sep 7, 2018

scprep v0.6.1 KrishnaswamyLab/scprep#6

Merged

TomAugspurger changed the title ~~DEPR: SparseDataFrame~~ DEPR: SparseDataFrame and SparseSeries Oct 15, 2018

jorisvandenbossche changed the title ~~DEPR: SparseDataFrame and SparseSeries~~ DEPR: SparseDataFrame and SparseSeries subclasses Oct 15, 2018

glemaitre mentioned this issue Oct 22, 2018

[MRG] EHN: allow to upload DataFrame and infer dtype and column name openml/openml-python#545

Merged

1 task

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 14, 2018

API: Always return DataFrame from get_dummies

bfb3dfb

xref pandas-dev#19239

TomAugspurger mentioned this issue Dec 14, 2018

API: Always return DataFrame from get_dummies #24284

Merged

justmarkham mentioned this issue Dec 17, 2018

Roadmap: pandas SparseDataFrame may be deprecated scikit-learn/scikit-learn#12800

Closed

jreback modified the milestones: 0.24.0, 0.25.0 Jan 6, 2019

TomAugspurger mentioned this issue Feb 20, 2019

BUG: SparseDataFrame indexing sometimes loses fill_value of empty columns in 0.24 #25378

Closed

TomAugspurger mentioned this issue Apr 18, 2019

Deprecate SparseDataFrame and SparseSeries #26137

Merged

jreback closed this as completed in #26137 May 29, 2019

jsexauer mentioned this issue May 29, 2019

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback mentioned this issue Nov 22, 2019

DEPR: deprecations log for removed issues #13777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: SparseDataFrame and SparseSeries subclasses #19239

DEPR: SparseDataFrame and SparseSeries subclasses #19239

jreback commented Jan 14, 2018

jreback commented Jan 14, 2018

datapythonista commented Jan 24, 2018

hexgnu commented Jan 29, 2018

hexgnu commented Feb 2, 2018 •

edited

Loading

jreback commented Feb 2, 2018

kernc commented Feb 2, 2018

TomAugspurger commented Mar 19, 2018

xpe commented Mar 22, 2018 •

edited

Loading

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018 •

edited

Loading

TomAugspurger commented Mar 23, 2018

TomAugspurger commented Oct 14, 2018

TomAugspurger commented Oct 15, 2018

jreback commented Jan 4, 2019

frndlytm commented Jan 10, 2019

TomAugspurger commented Jan 10, 2019

DEPR: SparseDataFrame and SparseSeries subclasses #19239

DEPR: SparseDataFrame and SparseSeries subclasses #19239

Comments

jreback commented Jan 14, 2018

jreback commented Jan 14, 2018

datapythonista commented Jan 24, 2018

hexgnu commented Jan 29, 2018

hexgnu commented Feb 2, 2018 • edited Loading

jreback commented Feb 2, 2018

kernc commented Feb 2, 2018

TomAugspurger commented Mar 19, 2018

xpe commented Mar 22, 2018 • edited Loading

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018 • edited Loading

TomAugspurger commented Mar 23, 2018

TomAugspurger commented Oct 14, 2018

TomAugspurger commented Oct 15, 2018

jreback commented Jan 4, 2019

frndlytm commented Jan 10, 2019

TomAugspurger commented Jan 10, 2019

hexgnu commented Feb 2, 2018 •

edited

Loading

xpe commented Mar 22, 2018 •

edited

Loading

jorisvandenbossche commented Mar 23, 2018 •

edited

Loading