-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: SparseDataFrame and SparseSeries subclasses #19239
Comments
cc @jorisvandenbossche @TomAugspurger @wesm |
In case my use case it's useful in the discussion. I mostly (probably only) use Personally I'm +1 on getting rid of it. Then, it could be useful to have a |
I think a lot of the issue with SparseDataFrame is honestly a lack of test coverage as well as a lot of bleedover into DataFrame. For instance a lot of code assumes that something is dense and that it has length > 0. This isn't the real case, and I feel that with a flag day or two of squashing sparse bugs and writing more tests it wouldn't be so awful. But that being said I also don't want to spend a bunch of time fixing something nobody really uses. Also this is not obvious to me on first look:
Personally I would think that SparseDataFrame is the only time you can have SparseSeries and DataFrame is always dense. |
Also I wanted to point out that Sparse bugs aren't the biggest label category in this repo It looks to me like reshaping, indexing, and dtypes are the big hotspots. Getting rid of sparse I don't think would bring those bug counts down. Also I think with some TLC SparseDataFrames can be a really nice little abstraction for saving on memory consumption in certain cases. |
@hexgnu its not about the issues, rather about having another structure to use, which leads to confusion and code complexity. Yes a pure sparse frame does lead to efficiencies, but you (w/o a lot of acrobatics) lose the heterogenity which makes a DataFrame itself so useful. I think a SparseDataFrame should exist as a separate project :> (and the primitives, IOW SparseArray/Series can still live in pandas proper). |
As long as high-level NDFrame APIs defer actions to block primitives as far as possible, having a regular DataFrame wrapping them should work fine. |
What are the arguments for keeping SparseDataFrame around? I'm not familiar enough to say for sure whether this is possible, but ideally we would
What can a |
I hope we'd all agree that support for sparsity is very useful. I don't know the internals of Pandas, so I won't weigh in on the particular question of "What can a As a user, I'll share my use case and an appreciation for clear documentation. My use case involves
I'd be open to using other ways to load the sparse LibSVM data, but I don't currently know them. Any recommendations? Having |
I am also not familiar enough with the sparse dataframe to answer the question ""What can a SparseDataFrame do that a DataFrame[SparseArray] cannot?".
My main worry is: how much will this complicate our current DataFrame implementation? Will this mean adding special-casing for sparse dataframes in multiple places? ... If that is the case, I am not sure I am in favor of this. Of course it might be that with the new ExtensionArray interface we could make this much cleaner, and wouldn't need such amount of special casing. But I can't assess if that will be the case or not.
This is a bit similar to the the extra metadata attributes I also have in geopandas. It would be good if we can figure out something. |
The main difference is this. A SDF of a single dtype can be very efficiently represented because it has a default_fill_value for the entire frame. Think of csr_matrix here. A DataFrame can already hold SparseSeries. So in theory you can simply replace every case of a SDF with a DataFrame. If you have lots of columns (like a lot), then this would be pretty inefficient. My argument for removing this is simple. This causes an enormous amount of complexity in the codebase, and we have many non-implemented features (mainly indexing). So its pretty half-baked. Simply deprecated / removing in pandas would allow an external library to implement this properly (to allow for the efficiency argument above), while not sacrificing the occasionaly use of a SparseSeries (or even a small number of SparseSeries), e.g. what get_dummies returns (or could return); this is in fact the main case. We would also leave this, but move the implementation out of main pandas (to an external library) this is separately maintained. |
As said above, not familiar with sparse code, but it seems the above is not true and that even a SparseDataFrame with a single dtype already stores the data column-by-column?
|
construct by using .to_sparse() from a DF the method above is not consolidated by default |
There is no difference:
And SparseArray is also limited to 1D I think (it gives an error if you try to pass the above 2D array to SparseArray). |
Ahh, that answers it then. I was thinking SparseArray was 2D. |
pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239
pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239
…es serializing This fixes [ARROW-2273](https://issues.apache.org/jira/browse/ARROW-2273). `pandas` Sparse types are planned to be deprecated in pandas future releases (pandas-dev/pandas#19239). `SparseDataFrame` and `SparseSeries` are naive implementation and have many bugs. IMO, this is not the right time to support these in `pyarrow`. Author: Licht-T <licht-t@outlook.jp> Closes #1997 from Licht-T/add-pandas-sparse-unsupported-msg and squashes the following commits: 64e24ce <Licht-T> ENH: Raise NotImplementedError when pandas Sparse types serializing pandas Sparse types are planned to be deprecated in pandas future releases. pandas-dev/pandas#19239
Opened #23148 for creating a Have we identified anything SparseDataFrame can do that a regular DataFrame can't? @xpe I assume that scikit-learn Anything else? |
I've updated the title to consider deprecating both SparseDataFrame and SparseSeries now that #22325 is in master (FYI, it'd be helpful to have people testing that on real workloads) FWIW, when I was doing that refactor, I discovered several subtle difference between methods implemented on both Series and SparseSeries. The Series implementation was updated but the SparseSeries one lagged behind. |
if we can't do this for 0.24, 0.25 is ok too. |
Multi-Indexing SparseDataFrames is a nice feature you don't get from scipy.sparse arrays. |
@frndlytm you'll still be able to do that with a DataFrame with sparse values In [11]: df = pd.DataFrame({"A": [1, 0], "B": [0, 1]}, dtype="Sparse[int]", index=pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b')]))
In [12]: df.dtypes
Out[12]:
A Sparse[int64, 0]
B Sparse[int64, 0]
dtype: object
In [13]: df
Out[13]:
A B
A a 1 0
b 0 1 or do I misunderstand? |
This is slightly useful in a single dtyped case, but enough of a headache here that we should simply remove it. Ultimately we should even remove
SparseSeries
and just rely on theSparseArray
abstraction and extension types (#19174) .The text was updated successfully, but these errors were encountered: