-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix usage of na_sentinel with sort=True in factorize() #25592
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -619,13 +619,19 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |
|
||
if sort and len(uniques) > 0: | ||
from pandas.core.sorting import safe_sort | ||
try: | ||
order = uniques.argsort() | ||
order2 = order.argsort() | ||
labels = take_1d(order2, labels, fill_value=na_sentinel) | ||
uniques = uniques.take(order) | ||
except TypeError: | ||
# Mixed types, where uniques.argsort fails. | ||
if na_sentinel == -1: | ||
# GH-25409 take_1d only works for na_sentinels of -1 | ||
try: | ||
order = uniques.argsort() | ||
order2 = order.argsort() | ||
labels = take_1d(order2, labels, fill_value=na_sentinel) | ||
uniques = uniques.take(order) | ||
except TypeError: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. woa? this is way repetitive. why is what you have in the try not enough here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read the issue. this is fix is way too hacky. A custom sentinel is not a blocking issue for 0.24.2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, as I said above, I can also simply remove the if/else and try/except completely, by not using But personally, I also have no problem with just merging this PR as is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is just adding technical debt and am -1 in merging as is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some reasoning is explained here: #19938 (comment) (mainly performance I think) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, then let's just defer this change to 0.25.0 and do this w/o the multiple repetitive try/excepts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you don't like the extra if/else, but are OK with reverting to only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback OK? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be good for 0.24.2. What's the duplicate code / tech debt here? The call to |
||
# Mixed types, where uniques.argsort fails. | ||
uniques, labels = safe_sort(uniques, labels, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't like the way this is written because it repeats the call to safe_sort. if you can avoid that (easy enough, just use a pass in the except), then call safe_sort if needed (assign to uniques, lables as NOne, None initially), then this would be better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche do you find this clearer? Personally, I find these dummy variables just for flow control a bit confusing (you have to verify that things are eventually set somewhere). With the way things are written, it's clear that each of the three branches has a If we're really concerned about repeating the call to safe_sort, you can make a closer with the values
and call the lambda, but again, that's just unnecessary indirection IMO. In this case I think just do whatever. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Tom that this is not necessarily clearer as what it is now. |
||
na_sentinel=na_sentinel, | ||
assume_unique=True) | ||
else: | ||
uniques, labels = safe_sort(uniques, labels, | ||
na_sentinel=na_sentinel, | ||
assume_unique=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this if/else check to keep the usage of
take_1d
when possible. It was added some time ago to improve performance compared tosafe_sort
, but thus does not handle non-default na_sentinels.If we want to keep the code simpler, I can also remove the
take_1d
alltogether, but that would give a performance degradation compared to the last release.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. Alternatively,
safe_sort
could do this try / except?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the if / else? (as it would also have to deal with the non-default na_sentinels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, it would need that as well. But I think it's probably fine here, unless we see this pattern coming up often when using safe_sort.