-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH/DEPR: Series.unique returns Series #24108
Conversation
Hello @h-vetinari! Thanks for submitting the PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline notes
if isinstance(self, ABCSeries): | ||
uniqs = self.unique(raw=True) | ||
else: | ||
uniqs = self.unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, raw=True
is unfortunately not compatible with the Index case
>>> pd.Series([pd.Timestamp('2016-01-01') | ||
... for _ in range(3)]).unique(raw=False) | ||
0 2016-01-01 | ||
dtype: datetime64[ns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is clearly superior output...
Categories (3, object): [a < b < c] | ||
""" | ||
result = super(Series, self).unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch for raw=True
is exactly the same as before, but the diff is messed up because of the changed indentation.
if isinstance(duplicated, ABCSeries) and method != pd.unique: | ||
result = method(duplicated, raw=True) | ||
else: | ||
result = method(duplicated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit awkward, I'll admit, but it's the only way I found of keeping the parametrisation while avoiding to raise the FutureWarning
.
|
||
We see that the values of `animals` get reconstructed correctly, but | ||
the index does not match yet -- consequently, the last step is to | ||
correctly set the index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, it's be really neat to have #22225 available to use .set_index
...
Codecov Report
@@ Coverage Diff @@
## master #24108 +/- ##
==========================================
- Coverage 92.21% 92.16% -0.05%
==========================================
Files 161 161
Lines 51684 51723 +39
==========================================
+ Hits 47658 47673 +15
- Misses 4026 4050 +24
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #24108 +/- ##
==========================================
- Coverage 92.21% 92.16% -0.05%
==========================================
Files 161 161
Lines 51684 51723 +39
==========================================
+ Hits 47658 47673 +15
- Misses 4026 4050 +24
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right this is way too big.
the question about the return value of .unique needs to be answered first; return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible
however better to bring this up on the issue (for unique return value)
This is what this PR is about, since the issue had stalled for >1 months despite several pings.
I disagree with this quite strongly:
Would you mind chiming in there then? |
good idea, but closing as stale |
@jreback @gfyoung @simonjayhawkins |
i am not averse to changing the return type of .unique but cannot be linked to return_inverse which likely has less support PRs need to do exactly 1 thing as have been stated many times |
@jreback Unfortunately, the |
@h-vetinari maybe i still don’t see it why is returning a Series from .unique() actually useful? |
I know this PR is too big, but I need it to initiate discussion before
v.0.24
cutoff.I've been working on adding a
return_inverse
tounique
since mid June (as fast as possible). I know that changing the return type ofSeries.unique
is potentially a controversial issue, but I truly believe that this should be done for v.0.24 (as otherwise the behavior is locked in past 1.0 and who knows if it ever changes then).IMO, it's even a necessity if pandas ever wants to support
return_inverse
forunique
. Here's a few arguments from working on this for a while:.unique
is the obvious place for an inverse (as opposed to.duplicated
, which I was directed to work on first, ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645)Series
only works well if the return type ofSeries.unique
is aSeries
, see API/ENH: overhaul/unify/improve .unique #22824 (comment). This is not even the strongest reason IMO to change the type, as.unique
is a patchwork currently (see rest OP of API/ENH: overhaul/unify/improve .unique #22824):Index.unique
is already anIndex
(and changing that back tondarray
would be equally disruptive, and wouldn't lead anyhwere re:reconstruction)Series.unique
already special-casesCategorical
and (effectively)DatetimeArray
.The reason I'm pushing this WIP for discussion is that this is obviously needs a deprecation cycle, and I really think this should be part of
v.0.24
. I'm sorry for the late timing, but I've been working as fast as feedback speed allowed (as often as politeness allowed me to ping) - #21645 was lying around mostly finished for 2 month, the cython backend (#22986 / #23400) took about 2 months, and I haven't been able to get an answer at #22824 for about 5 weeks (e.g. @jorisvandenbossche seems to be very busy or simply not available)As for the PR itself, I only wanted to change the return-type in this PR, but
Series.unique
touches several important paths:Series.unique
IndexOpsMixin.unique
and henceIndex.unique
Categorical.unique
EA.unique
So, as a demo here, I've adapted the first three, and it's a separate issue that the EA contract should support the possibility for
return_inverse
. This is also something that should IMO needs to make it tov.0.24
.git diff upstream/master -u -- "*.py" | flake8 --diff
return_inverse
topd.unique
return_inverse
toIndex.unique
return_inverse
toCategorical.unique
raw
toSeries.unique
unique
-calls through class hierarchy and just directly call the cython methods fromSeries.unique
, which would allow introducing the deprecation without introducing thereturn_inverse
kwarg yet.