-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate SparseDataFrame and SparseSeries #26137
Conversation
commit 8b136bf Merge: 3005aed 01d3dc2 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Mar 15 16:03:23 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 3005aed Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Mar 14 06:26:32 2019 -0500 isort? commit 318c06f Merge: 0922296 79205ea Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Mar 14 06:25:45 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 0922296 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 21:35:51 2019 -0500 updates commit f433be8 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 20:54:07 2019 -0500 lint commit 6696f28 Merge: 534a379 1017382 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 20:53:13 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 534a379 Merge: 94a7baf 5c341dc Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 14:37:27 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 94a7baf Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 14:22:48 2019 -0500 fixups commit 6f619b5 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 13:38:48 2019 -0500 32-bit compat commit 24f48c3 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Mar 11 22:05:46 2019 -0500 API: DataFrame.sparse accessor Closes pandas-dev#25681
Codecov Report
@@ Coverage Diff @@
## master #26137 +/- ##
==========================================
- Coverage 41.31% 40.73% -0.58%
==========================================
Files 174 175 +1
Lines 50749 52432 +1683
==========================================
+ Hits 20968 21360 +392
- Misses 29781 31072 +1291
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26137 +/- ##
==========================================
- Coverage 91.77% 91.76% -0.01%
==========================================
Files 174 174
Lines 50639 50646 +7
==========================================
+ Hits 46473 46476 +3
- Misses 4166 4170 +4
Continue to review full report at Codecov.
|
@@ -6,27 +6,28 @@ | |||
Sparse data structures | |||
********************** | |||
|
|||
We have implemented "sparse" versions of ``Series`` and ``DataFrame``. These are not sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the diff here is all that informative. I'd recommend just viewing the new file. The basic flow is
- short intro
- SparseArray / SparseDtype
- Sparse Accessors
- SparseIndex / computation
- Migration Guide
- SparseSeries / SparseDataFrame.
Note: we still have some warnings leaking through on some of the CI jobs (just not numpydev). Trying to track those down. |
I think I got all the warnings... I added a global |
will have a look soon |
doc/source/user_guide/sparse.rst
Outdated
|
||
.. code-block:: python | ||
|
||
# Old way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use *Previous*
and *New*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change old to previous. I think I'll keep them as comments, rather than **-style headings, since we're using **
for the subtopic (e.g. construction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger thanks a lot for this!
Did a first pass, and some high level comments:
- in some older whatsnew files, we will need add some
:okwarnings:
for now (see the doc build on travis) - In the migration section, I think we also need to state some differences between old SparseDataFrame/Series and the new way. Eg:
- It is no longer guaranteed that all columns are sparse. You can have a mixture.
- Practical consequence of the above: assigning values to a new column of a "sparse" dataframe no longer automatically sparsifies it, you need to do that yourself
- also related: no more a
default_fill_value
(but if you can't assign values with automatic sparsification, this default fill value also has no use, I think, so this is not really a problem given the above)
- might be for a different issue, but noted this while reviewing: when having mixed sparse and non-sparse columns in a dataframe, the
sparse
accessor should either give a better error message (indicating that not all columns are sparse) or either work (egdensity
could in principle work for a mixture)- related to that: how to convert to dense if you have a mixture?
@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``: | |||
|
|||
df = pd.DataFrame(np.random.randn(10000, 4)) | |||
df.iloc[:9998] = np.nan | |||
sdf = df.to_sparse() | |||
sdf = df.astype(pd.SparseDtype("float", np.nan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For such a purpose, I was thinking we could also provide df.sparse.to_sparse()
to convert a full DataFrame to sparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, though perhaps as a followup? I don't plan to put more time into sparse personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would have to be a DataFrame.astype('sparse')
? though I think, IOW, or is .sparse
allowed on any DataFrame? the semantics are a bit odd on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger can you respond to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, IOW, or is .sparse allowed on any DataFrame? the semantics are a bit odd on this
Kinda. If you just do df.sparse
on a dataframe without all-sparse values, we raise
In [6]: pd.DataFrame({"A": [1, 2]}).sparse
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-ab0fb67ed650> in <module>
----> 1 pd.DataFrame({"A": [1, 2]}).sparse
...
~/sandbox/pandas/pandas/core/arrays/sparse.py in _validate(self, data)
2119 dtypes = data.dtypes
2120 if not all(isinstance(t, SparseDtype) for t in dtypes):
-> 2121 raise AttributeError(self._validation_msg)
2122
2123 @classmethod
AttributeError: Can only use the '.sparse' accessor with Sparse data.
But we also allow for pd.DataFrame.sparse.from_spmatrix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would have to be a DataFrame.astype('sparse')
It would be .astype('Sparse')
which is shorthand for .astype(SparseDtype(float64, nan))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is all fine; is there a test for using .sparse on non-any-sparse df?
arr[2:5] = np.nan | ||
arr[7:8] = np.nan | ||
sparr = pd.SparseArray(arr) | ||
sparr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important for this PR, but we should actually improve the repr of SparseArray. Currently the example gives
[-2.329703982704994, -0.7776235464173905, nan, nan, nan, -0.07270483900887693, 0.4093257484722553, nan, -0.33749585746785415, 1.884146289689117]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
(so way to wide, and showing too much detail of the random numbers)
doc/source/user_guide/sparse.rst
Outdated
|
||
A ``.sparse`` accessor has been added for :class:`DataFrame` as well. | ||
See :ref:`api.dataframe.sparse` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See :ref:`api.dataframe.sparse` for more. | |
See :ref:`api.frame.sparse` for more. |
I've updated the doc examples to all use Series[sparse], rather than SparseSeries. I've just left a note that the sparse subclasses are deprecated. |
c5fa3fb also has a change to Series.sparse.from_coo. Previously that was using SparseSeries internally, and so a warning was being raised. I (lazily) applied the warnings filter to the class so it was being ignored in the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. just a couple of points.
@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ), | |||
return sparse_matrix, rows, columns | |||
|
|||
|
|||
def _coo_to_sparse_series(A, dense_index=False): | |||
def _coo_to_sparse_series(A, dense_index=False, sparse_series=True): | |||
""" | |||
Convert a scipy.sparse.coo_matrix to a SparseSeries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here (types too if you can!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I'm not really sure on two things
- The type for A is
'scipy.sparse.coo.coo_matrix'
, but we can't import sparse. - The return type is
Union[Series, SparseSeries]
but importing SparseSeries would cause a circular import
so I left types off for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can't you just use the string? (I think that works)
- same use the string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you? Are these types actually checked in our CI? I'd rather not introduce invalid types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they should be
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col))) | ||
s = s.sort_index() | ||
s = s.to_sparse() # TODO: specify kind? | ||
if sparse_series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why exactly do you need sparse_series flag? why can't we just do the astype after calling this routine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called from both Series.sparse and SparseSeries.
Previously, this went coo_matrix -> SparseSeries -> Series[sparse], which caused an undesired warning for Series.sparse.from_coo()
. Once SparseSeries is gone we can remove the keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you add a todo about this then, this is not obvious at all
doc/source/user_guide/sparse.rst
Outdated
|
||
You can apply NumPy *ufuncs* to ``SparseArray`` and get a ``SparseArray`` as a result. | ||
Sparse-specific properties, like ``density``, are available on the ``.sparse`` accssor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accssor
typo.
@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``: | |||
|
|||
df = pd.DataFrame(np.random.randn(10000, 4)) | |||
df.iloc[:9998] = np.nan | |||
sdf = df.to_sparse() | |||
sdf = df.astype(pd.SparseDtype("float", np.nan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger can you respond to this
Sparse Calculation | ||
------------------ | ||
|
||
You can apply NumPy `ufuncs <https://docs.scipy.org/doc/numpy/reference/ufuncs.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we are recommending people work directly with SparseArray? the unit of computation is generally the Series, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was here before, just moved. Whether or not it makes sense, I dunno. Depends on whether or not you need / want an index I suppose.
doc/source/user_guide/sparse.rst
Outdated
~~~~~~~~~~~~ | ||
|
||
A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. | ||
:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`MultiIndex`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have the doc inventory for scipy? can you add a refernce to coo_matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have SciPy in our intersphinx.
@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ), | |||
return sparse_matrix, rows, columns | |||
|
|||
|
|||
def _coo_to_sparse_series(A, dense_index=False): | |||
def _coo_to_sparse_series(A, dense_index=False, sparse_series=True): | |||
""" | |||
Convert a scipy.sparse.coo_matrix to a SparseSeries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can't you just use the string? (I think that works)
- same use the string
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col))) | ||
s = s.sort_index() | ||
s = s.to_sparse() # TODO: specify kind? | ||
if sparse_series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you add a todo about this then, this is not obvious at all
@@ -215,6 +215,7 @@ def test_scalar_with_index_infer_dtype(self, scalar, dtype): | |||
assert exp.dtype == dtype | |||
|
|||
@pytest.mark.parametrize("fill", [1, np.nan, 0]) | |||
@pytest.mark.filterwarnings("ignore:Sparse:FutureWarning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need these as a prior PR added this to setup.cfg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup.cfg has an error:::
config to elevate unhandled warnings to errors. We still need these otherwise the tests would fail.
We have a single test asserting that SparseSeries.__init__
warns, and explicitly ignore the warnings elsewhere.
thanks @TomAugspurger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a bunch of :okwarnings:
needed in older whatsnew files.
~~~~~~~~~~~~~~~ | ||
|
||
.. versionadded:: 0.20.0 | ||
Use :meth:`DataFrame.sparse.from_coo` to create a ``DataFrame`` with sparse values from a sparse matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be from_spmatrix
?
class SparseSeries(Series): | ||
"""Data structure for labeled, sparse floating point data | ||
|
||
.. deprectaed:: 0.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
Thanks. I'll make a PR with these doc updates.
…On Wed, May 29, 2019 at 8:51 AM Simon Hawkins ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/sparse/series.py
<#26137 (comment)>:
> class SparseSeries(Series):
"""Data structure for labeled, sparse floating point data
+ .. deprectaed:: 0.25.0
typo
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26137?email_source=notifications&email_token=AAKAOITMLOLJFTRQDF54URLPX2C5BA5CNFSM4HG7LDP2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB2ABB5A#pullrequestreview-243273972>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIS7PLLTJDVHORMM4YDPX2C5BANCNFSM4HG7LDPQ>
.
|
|
||
|
||
class SparseDataFrame(DataFrame): | ||
""" | ||
DataFrame containing sparse floating point data in the form of SparseSeries | ||
objects | ||
|
||
.. deprectaed:: 0.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Closes #19239
This currently includes the changes from #25682, which I think is mergeable.
I think this would be good to have for 0.25.0. I think it's close, but I may not have time to push this across the finish line. Anyone interested in finishing it off?