Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: SparseDataFrame and SparseSeries subclasses #19239

Closed
jreback opened this issue Jan 14, 2018 · 19 comments · Fixed by #26137
Closed

DEPR: SparseDataFrame and SparseSeries subclasses #19239

jreback opened this issue Jan 14, 2018 · 19 comments · Fixed by #26137
Labels
Deprecate Functionality to remove in pandas Sparse Sparse Data Type
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 14, 2018

This is slightly useful in a single dtyped case, but enough of a headache here that we should simply remove it. Ultimately we should even remove SparseSeries and just rely on the SparseArray abstraction and extension types (#19174) .

@jreback jreback added Sparse Sparse Data Type Deprecate Functionality to remove in pandas Difficulty Intermediate labels Jan 14, 2018
@jreback jreback added this to the 0.23.0 milestone Jan 14, 2018
@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2018

cc @jorisvandenbossche @TomAugspurger @wesm
cc @hexgnu
cc @Licht-T
cc @kernc

@datapythonista
Copy link
Member

In case my use case it's useful in the discussion. I mostly (probably only) use SparseDataFrame because of pandas.get_dummies(). And after that, I'm mainly interested in having a scipy.sparse structure (for example to use it in scikit-learn). I'd bet this is the case for many pandas users.

Personally I'm +1 on getting rid of it. Then, it could be useful to have a get_dummies sparse version that returns a scipy.sparse. Not sure if in pandas or as a separate project.

@hexgnu
Copy link
Contributor

hexgnu commented Jan 29, 2018

I think a lot of the issue with SparseDataFrame is honestly a lack of test coverage as well as a lot of bleedover into DataFrame. For instance a lot of code assumes that something is dense and that it has length > 0. This isn't the real case, and I feel that with a flag day or two of squashing sparse bugs and writing more tests it wouldn't be so awful.

But that being said I also don't want to spend a bunch of time fixing something nobody really uses.

Also this is not obvious to me on first look:

  • DataFrame can have both Series and SparseSeries in it
  • SparseDataFrame can only really have SparseSeries inside of it

Personally I would think that SparseDataFrame is the only time you can have SparseSeries and DataFrame is always dense.

@hexgnu
Copy link
Contributor

hexgnu commented Feb 2, 2018

Also I wanted to point out that Sparse bugs aren't the biggest label category in this repo

It looks to me like reshaping, indexing, and dtypes are the big hotspots. Getting rid of sparse I don't think would bring those bug counts down.

download 1

Also I think with some TLC SparseDataFrames can be a really nice little abstraction for saving on memory consumption in certain cases.

@jreback
Copy link
Contributor Author

jreback commented Feb 2, 2018

@hexgnu its not about the issues, rather about having another structure to use, which leads to confusion and code complexity. Yes a pure sparse frame does lead to efficiencies, but you (w/o a lot of acrobatics) lose the heterogenity which makes a DataFrame itself so useful. I think a SparseDataFrame should exist as a separate project :> (and the primitives, IOW SparseArray/Series can still live in pandas proper).

@kernc
Copy link
Contributor

kernc commented Feb 2, 2018

As long as high-level NDFrame APIs defer actions to block primitives as far as possible, having a regular DataFrame wrapping them should work fine.

@TomAugspurger
Copy link
Contributor

What are the arguments for keeping SparseDataFrame around? I'm not familiar enough to say for sure whether this is possible, but ideally we would

  • Clearly document that DataFrame can hold sparse data
  • Move all current sparse-specific methods to a .sparse accessor (density, to_coo). Some of these methods would error if the DataFrame isn't entirely sparse.
  • Figure out how to handle default_fill_value
  • Deprceate SparseDataFrame in favor of a DataFrame holding sparse arrays

What can a SparseDataFrame do that a DataFrame[SparseArray] can't?

@xpe
Copy link

xpe commented Mar 22, 2018

I hope we'd all agree that support for sparsity is very useful. I don't know the internals of Pandas, so I won't weigh in on the particular question of "What can a SparseDataFrame do that a DataFrame[SparseArray] cannot?"

As a user, I'll share my use case and an appreciation for clear documentation.

My use case involves load_svmlight_file from scikit-learn:

from sklearn.datasets import load_svmlight_file
features, labels = load_svmlight_file(data_filename)  # sparse data
feature_labels = ['a', 'b', 'c']  # etc
df = pd.SparseDataFrame(data=features, columns=feature_labels)
df.plot(y='c')  # fails
df.to_dense().plot(y='c')  # succeeds

I'd be open to using other ways to load the sparse LibSVM data, but I don't currently know them. Any recommendations?

Having .plot() work in this case with SparseDataFrame would be nice. :)

@jorisvandenbossche
Copy link
Member

I am also not familiar enough with the sparse dataframe to answer the question ""What can a SparseDataFrame do that a DataFrame[SparseArray] cannot?".
But, when we would take this route and do:

Deprceate SparseDataFrame in favor of a DataFrame holding sparse arrays

My main worry is: how much will this complicate our current DataFrame implementation? Will this mean adding special-casing for sparse dataframes in multiple places? ... If that is the case, I am not sure I am in favor of this.
Eg in pandas/core/sparse/frame.py, there is still a lot of custom code overriding parent frame methods (next to the sparse-specific methods like to_coo or to_dense). I don't know what all this code does, but it might serve a purpose.

Of course it might be that with the new ExtensionArray interface we could make this much cleaner, and wouldn't need such amount of special casing. But I can't assess if that will be the case or not.

Figure out how to handle default_fill_value

This is a bit similar to the the extra metadata attributes I also have in geopandas. It would be good if we can figure out something.
I suppose having it as an attribute on a "sparse dtype" object might be one possible way.

@jreback
Copy link
Contributor Author

jreback commented Mar 23, 2018

The main difference is this. A SDF of a single dtype can be very efficiently represented because it has a default_fill_value for the entire frame. Think of csr_matrix here. A DataFrame can already hold SparseSeries. So in theory you can simply replace every case of a SDF with a DataFrame. If you have lots of columns (like a lot), then this would be pretty inefficient.

My argument for removing this is simple. This causes an enormous amount of complexity in the codebase, and we have many non-implemented features (mainly indexing). So its pretty half-baked. Simply deprecated / removing in pandas would allow an external library to implement this properly (to allow for the efficiency argument above), while not sacrificing the occasionaly use of a SparseSeries (or even a small number of SparseSeries), e.g. what get_dummies returns (or could return); this is in fact the main case.

We would also leave this, but move the implementation out of main pandas (to an external library) this is separately maintained.

@jorisvandenbossche
Copy link
Member

The main difference is this. A SDF of a single dtype can be very efficiently represented because it has a default_fill_value for the entire frame. Think of csr_matrix here. A DataFrame can already hold SparseSeries. So in theory you can simply replace every case of a SDF with a DataFrame. If you have lots of columns (like a lot), then this would be pretty inefficient.

As said above, not familiar with sparse code, but it seems the above is not true and that even a SparseDataFrame with a single dtype already stores the data column-by-column?

In [61]: pd.SparseDataFrame([[0,1],[1,0]])._data
Out[61]: 
BlockManager
Items: RangeIndex(start=0, stop=2, step=1)
Axis 1: RangeIndex(start=0, stop=2, step=1)
SparseBlock: slice(0, 1, 1), 1 x 2, dtype: int64
SparseBlock: slice(1, 2, 1), 1 x 2, dtype: int64

@jreback
Copy link
Contributor Author

jreback commented Mar 23, 2018

construct by using .to_sparse() from a DF

the method above is not consolidated by default

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 23, 2018

There is no difference:

In [63]: pd.DataFrame([[0,1],[1,0]]).to_sparse()._data
Out[63]: 
BlockManager
Items: RangeIndex(start=0, stop=2, step=1)
Axis 1: RangeIndex(start=0, stop=2, step=1)
SparseBlock: slice(0, 1, 1), 1 x 2, dtype: int64
SparseBlock: slice(1, 2, 1), 1 x 2, dtype: int64

And SparseArray is also limited to 1D I think (it gives an error if you try to pass the above 2D array to SparseArray).

@TomAugspurger
Copy link
Contributor

And SparseArray is also limited to 1D I think

Ahh, that answers it then. I was thinking SparseArray was 2D.

@jreback jreback modified the milestones: 0.23.0, 0.24.0 Apr 14, 2018
Licht-T added a commit to Licht-T/arrow that referenced this issue May 4, 2018
pandas Sparse types are planned to be deprecated in pandas future releases.
pandas-dev/pandas#19239
Licht-T added a commit to Licht-T/arrow that referenced this issue May 4, 2018
pandas Sparse types are planned to be deprecated in pandas future releases.
pandas-dev/pandas#19239
xhochy pushed a commit to apache/arrow that referenced this issue May 5, 2018
…es serializing

This fixes [ARROW-2273](https://issues.apache.org/jira/browse/ARROW-2273).

`pandas` Sparse types are planned to be deprecated in pandas future releases (pandas-dev/pandas#19239).
`SparseDataFrame` and `SparseSeries` are naive implementation and have many bugs. IMO, this is not the right time to support these in `pyarrow`.

Author: Licht-T <licht-t@outlook.jp>

Closes #1997 from Licht-T/add-pandas-sparse-unsupported-msg and squashes the following commits:

64e24ce <Licht-T> ENH: Raise NotImplementedError when pandas Sparse types serializing pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239
@TomAugspurger
Copy link
Contributor

Opened #23148 for creating a sparse accessor.

Have we identified anything SparseDataFrame can do that a regular DataFrame can't?

@xpe I assume that scikit-learn load_svmlight returns a spicy.sparse matrix? In my ideal world you would be able to load that with pd.DataFrame.sparse.from_coo.

Anything else?

@TomAugspurger TomAugspurger changed the title DEPR: SparseDataFrame DEPR: SparseDataFrame and SparseSeries Oct 15, 2018
@TomAugspurger
Copy link
Contributor

I've updated the title to consider deprecating both SparseDataFrame and SparseSeries now that #22325 is in master (FYI, it'd be helpful to have people testing that on real workloads)

FWIW, when I was doing that refactor, I discovered several subtle difference between methods implemented on both Series and SparseSeries. The Series implementation was updated but the SparseSeries one lagged behind.

@jreback
Copy link
Contributor Author

jreback commented Jan 4, 2019

if we can't do this for 0.24, 0.25 is ok too.

@jreback jreback modified the milestones: 0.24.0, 0.25.0 Jan 6, 2019
@frndlytm
Copy link

Multi-Indexing SparseDataFrames is a nice feature you don't get from scipy.sparse arrays.

@TomAugspurger
Copy link
Contributor

@frndlytm you'll still be able to do that with a DataFrame with sparse values

In [11]: df = pd.DataFrame({"A": [1, 0], "B": [0, 1]}, dtype="Sparse[int]", index=pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b')]))

In [12]: df.dtypes
Out[12]:
A    Sparse[int64, 0]
B    Sparse[int64, 0]
dtype: object

In [13]: df
Out[13]:
     A  B
A a  1  0
  b  0  1

or do I misunderstand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants