Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate Series._from_array ? #19883

Closed
jaumebonet opened this issue Feb 24, 2018 · 6 comments · Fixed by #19893
Closed

Deprecate Series._from_array ? #19883

jaumebonet opened this issue Feb 24, 2018 · 6 comments · Fixed by #19893
Labels
Milestone

Comments

@jaumebonet
Copy link
Contributor

jaumebonet commented Feb 24, 2018

I open this suggestion as per @jorisvandenbossche's recommendation.

This issue follows in the steps of #18213 and #19850.

As it is commented in #18213, _from_array has a single difference with the Series constructor, how it handles SparseArrays:

        # return a sparse series here
        if isinstance(arr, ABCSparseArray):
            from pandas.core.sparse.series import SparseSeries
            cls = SparseSeries

This process could be achieved in a similar way in Series.__new__; something on the lines of:

def __new__( cls, *args, **kwargs ):
    # arr is mandatory, first argument or key `arr`.
    if isinstance(kwargs.get('arr', args[0]), ABCSparseArray):
        from pandas.core.sparse.series import SparseSeries
        cls = SparseSeries
    obj = object.__new__(cls)
    obj.__init__(*args, **kwargs)
    return obj

What's the issue?

As @jorisvandenbossche pointed out, a change like this will result in a change of the API, as this:

>>> s = pd.Series(pd.SparseArray([1, 0, 0, 2, 0]))
>>> type(s)
<class 'pandas.core.series.Series'>

will become this:

>>> s = pd.Series.from_array(pd.SparseArray([1, 0, 0, 2, 0]))
>>> type(s)
<class 'pandas.core.sparse.series.SparseSeries'>

I'm not familiar with sparse data structures, but according to the docs all functionality is kept between Series and SparseSeries. Furthermore, a simple

>>> s = s.to_dense()
>>> type(s)
<class 'pandas.core.series.Series'>

should do it to go back to Series.

Why change it, then?

Currently, Series._from_array is called only inside two functions: DataFrame._idxand DataFrame. _box_col_values. With the proposed change, those calls could be substituted by the default constructor.
Being that the case, when working with panda's subclassing, one would be able to declare complex _constructor_slice such as this:

    @property
    def _constructor_sliced(self):
        def f(*args, **kwargs):
            # adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
            return DerivedSeries(*args, **kwargs).__finalize__(self, method='inherit')
        return f

, which would allow for a more complex relationship between the subclassed DataFrame and its sliced version, including the transfer of metadata according to the user's specification in __finalize__.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

we don't need to deprecate this, rather remove it. Its almost there. If you can do it w/o breaking things great.

@jreback jreback added this to the Next Major Release milestone Feb 24, 2018
@jaumebonet
Copy link
Contributor Author

Sure! If you guys think is worth pursuing, I'll give it a try and see how the test complain!

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

great!

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

also be sure to check perf on Series creation (using asv)

@jorisvandenbossche
Copy link
Member

Deprecating or removing is a semantic discussion ;), the question is if we can remove it, and how to remove it.

Are there differences in behaviour between seies with sparse values and sparse series that would make this backwards incompatible?
Those places where from_array is used now, can we just remove it there without changing the Series constructor? (so the change of behaviour is there and not in Series())

s = s.to_dense()

Note that this is not the same (it gives a series with dense values, not a series with sparse values)

@jaumebonet
Copy link
Contributor Author

After going through the code a bit, I've found a simplified way to fix this that would not imply going into Series.__new__. This is, going to the two places in which Series._from_array are called in Dataframe and change each appearence of:

result = self._constructor_sliced._from_array(
    values, index=self.index, name=label, fastpath=True)

to

tis_constructor_sliced = SparseSeries if isinstance(arr, ABCSparseArray) else self._constructor_sliced
result = tis_constructor_sliced(
    values, index=self.index, name=label, fastpath=True)

this basically takes that single check out of its wrapping function and:

  1. Stills eliminates the need for _from_array.

  2. Keeps the original behaviour of the Series instantiation.

  3. Avoids the need of this extra if check that would happen each time Series would be instantiated, which it has to affect performance somehow.

does that makes more sense/feels safer?

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants