API/ENH/DEPR: Series.unique returns Series #24108

h-vetinari · 2018-12-05T07:38:06Z

I know this PR is too big, but I need it to initiate discussion before v.0.24 cutoff.

I've been working on adding a return_inverse to unique since mid June (as fast as possible). I know that changing the return type of Series.unique is potentially a controversial issue, but I truly believe that this should be done for v.0.24 (as otherwise the behavior is locked in past 1.0 and who knows if it ever changes then).

IMO, it's even a necessity if pandas ever wants to support return_inverse for unique. Here's a few arguments from working on this for a while:

This is both a long-standing issue (API: provide a better way of doing np.unique(return_inverses=True) #4087 milestoned since 0.14) and very hard for a user to do any other way.
.unique is the obvious place for an inverse (as opposed to .duplicated, which I was directed to work on first, ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645)
An inverse for Series only works well if the return type of Series.unique is a Series, see API/ENH: overhaul/unify/improve .unique #22824 (comment). This is not even the strongest reason IMO to change the type, as .unique is a patchwork currently (see rest OP of API/ENH: overhaul/unify/improve .unique #22824):
- Index.unique is already an Index (and changing that back to ndarray would be equally disruptive, and wouldn't lead anyhwere re:reconstruction)
- The current Series.unique already special-cases Categorical and (effectively) DatetimeArray.

The reason I'm pushing this WIP for discussion is that this is obviously needs a deprecation cycle, and I really think this should be part of v.0.24. I'm sorry for the late timing, but I've been working as fast as feedback speed allowed (as often as politeness allowed me to ping) - #21645 was lying around mostly finished for 2 month, the cython backend (#22986 / #23400) took about 2 months, and I haven't been able to get an answer at #22824 for about 5 weeks (e.g. @jorisvandenbossche seems to be very busy or simply not available)

As for the PR itself, I only wanted to change the return-type in this PR, but Series.unique touches several important paths:

Series.unique
IndexOpsMixin.unique and hence Index.unique
Categorical.unique
EA.unique

So, as a demo here, I've adapted the first three, and it's a separate issue that the EA contract should support the possibility for return_inverse. This is also something that should IMO needs to make it to v.0.24.

pep8speaks · 2018-12-05T07:38:27Z

Hello @h-vetinari! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/algorithms.py !
There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/core/base.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/series.py !
There are no PEP8 issues in the file pandas/tests/arrays/categorical/test_analytics.py !
There are no PEP8 issues in the file pandas/tests/arrays/sparse/test_array.py !
There are no PEP8 issues in the file pandas/tests/extension/base/methods.py !
There are no PEP8 issues in the file pandas/tests/frame/test_indexing.py !
There are no PEP8 issues in the file pandas/tests/plotting/common.py !
There are no PEP8 issues in the file pandas/tests/reshape/merge/test_merge.py !
There are no PEP8 issues in the file pandas/tests/series/test_duplicates.py !
There are no PEP8 issues in the file pandas/tests/test_algos.py !
There are no PEP8 issues in the file pandas/tests/test_base.py !

h-vetinari

Some inline notes

h-vetinari · 2018-12-05T07:48:49Z

pandas/core/base.py

+        if isinstance(self, ABCSeries):
+            uniqs = self.unique(raw=True)
+        else:
+            uniqs = self.unique()


Here, raw=True is unfortunately not compatible with the Index case

h-vetinari · 2018-12-05T07:49:15Z

pandas/core/series.py

+        >>> pd.Series([pd.Timestamp('2016-01-01')
+        ...            for _ in range(3)]).unique(raw=False)
+        0   2016-01-01
+        dtype: datetime64[ns]


I think this is clearly superior output...

h-vetinari · 2018-12-05T07:50:04Z

pandas/core/series.py

        Categories (3, object): [a < b < c]
-        """
-        result = super(Series, self).unique()


The branch for raw=True is exactly the same as before, but the diff is messed up because of the changed indentation.

h-vetinari · 2018-12-05T07:50:56Z

pandas/tests/extension/base/methods.py

+        if isinstance(duplicated, ABCSeries) and method != pd.unique:
+            result = method(duplicated, raw=True)
+        else:
+            result = method(duplicated)


This is a bit awkward, I'll admit, but it's the only way I found of keeping the parametrisation while avoiding to raise the FutureWarning.

h-vetinari · 2018-12-05T07:59:54Z

pandas/core/series.py

+
+        We see that the values of `animals` get reconstructed correctly, but
+        the index does not match yet  -- consequently, the last step is to
+        correctly set the index.


At this point, it's be really neat to have #22225 available to use .set_index...

codecov · 2018-12-05T08:14:04Z

Codecov Report

Merging #24108 into master will decrease coverage by 0.04%.
The diff coverage is 60%.

@@            Coverage Diff             @@
##           master   #24108      +/-   ##
==========================================
- Coverage   92.21%   92.16%   -0.05%     
==========================================
  Files         161      161              
  Lines       51684    51723      +39     
==========================================
+ Hits        47658    47673      +15     
- Misses       4026     4050      +24

Flag	Coverage Δ
#multiple	`90.57% <60%> (-0.04%)`	⬇️
#single	`42.98% <23.63%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`92.59% <45.45%> (-1.09%)`	⬇️
pandas/core/indexes/base.py	`96.22% <50%> (-0.11%)`	⬇️
pandas/core/base.py	`96.75% <66.66%> (-0.89%)`	⬇️
pandas/core/arrays/categorical.py	`95.17% <75%> (-0.23%)`	⬇️
pandas/core/algorithms.py	`94.68% <77.77%> (-0.43%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5f4d1...6fd279a. Read the comment docs.

codecov · 2018-12-05T08:14:04Z

Codecov Report

Merging #24108 into master will decrease coverage by 0.04%.
The diff coverage is 60%.

@@            Coverage Diff             @@
##           master   #24108      +/-   ##
==========================================
- Coverage   92.21%   92.16%   -0.05%     
==========================================
  Files         161      161              
  Lines       51684    51723      +39     
==========================================
+ Hits        47658    47673      +15     
- Misses       4026     4050      +24

Flag	Coverage Δ
#multiple	`90.57% <60%> (-0.04%)`	⬇️
#single	`42.98% <23.63%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`92.59% <45.45%> (-1.09%)`	⬇️
pandas/core/indexes/base.py	`96.22% <50%> (-0.11%)`	⬇️
pandas/core/base.py	`96.75% <66.66%> (-0.89%)`	⬇️
pandas/core/arrays/categorical.py	`95.17% <75%> (-0.23%)`	⬇️
pandas/core/algorithms.py	`94.68% <77.77%> (-0.43%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5f4d1...6fd279a. Read the comment docs.

jreback

you are right this is way too big.

the question about the return value of .unique needs to be answered first; return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible

however better to bring this up on the issue (for unique return value)

h-vetinari · 2018-12-05T19:07:40Z

@jreback:

the question about the return value of .unique needs to be answered first;

This is what this PR is about, since the issue had stalled for >1 months despite several pings.

return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible

I disagree with this quite strongly:

Changing Index.unique from Index->ndarray is as much of a breaking change as changing Series.unique from ndarray->Series (but has no benefits for reconstruction)
Series.unique already special-cases Categorical and EA. An ndarray fits even less as the return of a Series method for where pandas is heading.
Since .unique strongly advertises that it does not sort, there's an implicit index mapping happening already, only that it's very hard to coax out.
If it were to keep returning ndarray, having an inverse is basically impossible without running into several antipatterns.
etc.

however better to bring this up on the issue (for unique return value)

Would you mind chiming in there then?

jreback · 2019-05-12T21:24:20Z

good idea, but closing as stale

h-vetinari · 2019-05-13T05:36:07Z

@jreback: good idea, but closing as stale

Not stale, but will need precursors, like #24119.

h-vetinari · 2019-10-07T06:18:36Z

@jreback @gfyoung @simonjayhawkins
Can we please reopen this PR, or at least the precursor #24119?

jreback · 2019-10-07T07:23:01Z

i am not averse to changing the return type of .unique but cannot be linked to return_inverse which likely has less support

PRs need to do exactly 1 thing as have been stated many times
this ape e too complex

h-vetinari · 2019-10-07T07:57:20Z

@jreback
Thanks for the response. I know this PR is too big (see OP), but it is the goal towards which I'd chip off smaller PRs (e.g. #24119).

Unfortunately, the return_inverse parts will have to come first, because it is not possible to return a Series with the correct index without tracking the inverse internally. In fact, I would argue that this is the main reason that Series.unique historically returned an array, because the cython-backend did not support calculating the inverse (luckily that's already been added in #22986 / #23400; the rest is a relatively easy change - see #24119).

jreback · 2019-10-07T08:11:07Z

@h-vetinari maybe i still don’t see it

why is returning a Series from .unique() actually useful?
why not an .array? then doesn’t .factorize() exactly provide the ability to reconstruct (iirc your main goal); speaking of reconstruction; how is this useful to a user? what is the use case?

jorisvandenbossche · 2019-10-07T08:24:10Z

@jreback many of those questions have already seen some discussion in #22824 (including example use cases), so I would suggest we keep the general discussion on overhauling unique there. Can you repeat your comment there?

h-vetinari added 5 commits December 5, 2018 02:00

API/ENH/DEPR: Series.unique returns Series; .unique gets return_inverse

b61ac0e

Fixes for tests

7292921

TST: first pass at tests

10432d4

Add kwarg to Index

9601d6b

Whatsnew

6fd279a

h-vetinari commented Dec 5, 2018

View reviewed changes

jreback requested changes Dec 5, 2018

View reviewed changes

This was referenced Dec 5, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

API: add return_inverse to pd.unique #24119

Closed

gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 6, 2018

h-vetinari mentioned this pull request Dec 6, 2018

RLS: 0.24.0 #24060

Closed

jreback closed this May 12, 2019

h-vetinari mentioned this pull request Dec 12, 2021

RLS: 1.4 #41957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ENH/DEPR: Series.unique returns Series #24108

API/ENH/DEPR: Series.unique returns Series #24108

h-vetinari commented Dec 5, 2018 •

edited

Loading

pep8speaks commented Dec 5, 2018

h-vetinari left a comment

h-vetinari Dec 5, 2018

h-vetinari Dec 5, 2018

h-vetinari Dec 5, 2018

h-vetinari Dec 5, 2018 •

edited

Loading

h-vetinari Dec 5, 2018

codecov bot commented Dec 5, 2018

codecov bot commented Dec 5, 2018 •

edited

Loading

jreback left a comment

h-vetinari commented Dec 5, 2018

jreback commented May 12, 2019

h-vetinari commented May 13, 2019

h-vetinari commented Oct 7, 2019

jreback commented Oct 7, 2019

h-vetinari commented Oct 7, 2019

jreback commented Oct 7, 2019

jorisvandenbossche commented Oct 7, 2019

API/ENH/DEPR: Series.unique returns Series #24108

API/ENH/DEPR: Series.unique returns Series #24108

Conversation

h-vetinari commented Dec 5, 2018 • edited Loading

pep8speaks commented Dec 5, 2018

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Dec 5, 2018

Choose a reason for hiding this comment

h-vetinari Dec 5, 2018

Choose a reason for hiding this comment

h-vetinari Dec 5, 2018

Choose a reason for hiding this comment

h-vetinari Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

h-vetinari Dec 5, 2018

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2018

Codecov Report

codecov bot commented Dec 5, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Dec 5, 2018

jreback commented May 12, 2019

h-vetinari commented May 13, 2019

h-vetinari commented Oct 7, 2019

jreback commented Oct 7, 2019

h-vetinari commented Oct 7, 2019

jreback commented Oct 7, 2019

jorisvandenbossche commented Oct 7, 2019

h-vetinari commented Dec 5, 2018 •

edited

Loading

h-vetinari Dec 5, 2018 •

edited

Loading

codecov bot commented Dec 5, 2018 •

edited

Loading