-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix usage of na_sentinel with sort=True in factorize() #25592
BUG: fix usage of na_sentinel with sort=True in factorize() #25592
Conversation
except TypeError: | ||
# Mixed types, where uniques.argsort fails. | ||
if na_sentinel == -1: | ||
# GH-25409 take_1d only works for na_sentinels of -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this if/else check to keep the usage of take_1d
when possible. It was added some time ago to improve performance compared to safe_sort
, but thus does not handle non-default na_sentinels.
If we want to keep the code simpler, I can also remove the take_1d
alltogether, but that would give a performance degradation compared to the last release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. Alternatively, safe_sort
could do this try / except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the if / else? (as it would also have to deal with the non-default na_sentinels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, it would need that as well. But I think it's probably fine here, unless we see this pattern coming up often when using safe_sort.
Codecov Report
@@ Coverage Diff @@
## master #25592 +/- ##
==========================================
+ Coverage 91.26% 91.26% +<.01%
==========================================
Files 173 173
Lines 52966 52968 +2
==========================================
+ Hits 48338 48340 +2
Misses 4628 4628
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25592 +/- ##
==========================================
- Coverage 91.29% 91.29% -0.01%
==========================================
Files 173 173
Lines 52961 52963 +2
==========================================
+ Hits 48350 48351 +1
- Misses 4611 4612 +1
Continue to review full report at Codecov.
|
order2 = order.argsort() | ||
labels = take_1d(order2, labels, fill_value=na_sentinel) | ||
uniques = uniques.take(order) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa? this is way repetitive. why is what you have in the try not enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take_1d
cannot handle custom na_sentinels. See the discussion in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the issue. this is fix is way too hacky. A custom sentinel is not a blocking issue for 0.24.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, as I said above, I can also simply remove the if/else and try/except completely, by not using take_1d
if that is preferred. A small performance gain is not worth introducing (now knowingly) a regression in behaviour.
But personally, I also have no problem with just merging this PR as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just adding technical debt and am -1 in merging as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some reasoning is explained here: #19938 (comment) (mainly performance I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then let's just defer this change to 0.25.0 and do this w/o the multiple repetitive try/excepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't like the extra if/else, but are OK with reverting to only safe_sort
bu only for 0.25.0: would you be OK with the current PR for 0.24.2, if I directly do a follow-up PR to remove the if/else try/except on master / for 0.25.0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be good for 0.24.2.
What's the duplicate code / tech debt here? The call to safe_sort
?
not sure why this is that important for 0.24.2. fixing this properly would block the release which doesn't make much sense. |
I think it's essentially ready though, right? |
this is just adding technical debt and am -1 on merging as is. Also the idea of merging a patch then immediately 'fixing' it in master is better, but still, why the urgency. This is a very small edge case, yet adding a horrible hack. |
If you say it is better, do you also mean you are OK with that? For sure it is not the most important bug fix, but it is a very clear error (the output is simply total nonsense), for which we have a fix that we can just merge, so why not? |
Can you explain the tech-debt comment? I still don't quite understand that. Would you prefer the |
no
well this is adding multiple try/excepts with the same code. This is horrible. |
again explain the urgency. this is NOT in any way important at all. |
As I said above, it is indeed not urgent. But it is a ready fix for a known regression. So again, why not? You say you want a proper solution. Can you then explain what you mean with this? What do you propose?
It is not adding multiple try/excepts. There was one, and there is still one. I only added one if/else case to check for a keyword argument depending on which the try/except should never be tried at all. |
How should we proceed then?
… On Mar 12, 2019, at 08:12, Jeff Reback ***@***.***> wrote:
again explain the urgency. this is NOT in any way important at all.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
i see no reason to merge this this is making already complicated code |
Our current code is simply wrong. If you don't want the proposed solution here, can you propose an alternative solution? |
Joris, maybe try moving the if condition into safe_sort itself? It’ll be the same code, but maybe a new diff would be refreshing. |
Actually, doing the I'm still not sure where the "way repetitive" code is. The only repetition I see is
Am I missing something? |
uniques = uniques.take(order) | ||
except TypeError: | ||
# Mixed types, where uniques.argsort fails. | ||
uniques, labels = safe_sort(uniques, labels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the way this is written because it repeats the call to safe_sort. if you can avoid that (easy enough, just use a pass in the except), then call safe_sort if needed (assign to uniques, lables as NOne, None initially), then this would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche do you find this clearer? Personally, I find these dummy variables just for flow control a bit confusing (you have to verify that things are eventually set somewhere). With the way things are written, it's clear that each of the three branches has a uniques
and labels
, and it's clear how things are achieved.
If we're really concerned about repeating the call to safe_sort, you can make a closer with the values
get_uniques_labels = lambda: safe_sort(uniques, labels, na_sentinel, assume_unique=True)`
and call the lambda, but again, that's just unnecessary indirection IMO.
In this case I think just do whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Tom that this is not necessarily clearer as what it is now.
An additional problem is that |
I looked a bit further in it. It is possible to move the logic into
how it looks like: jorisvandenbossche@ba944eb Would that be a more preferred solution? |
A bit less clear, but perhaps better since all EAs can benefit from it?
…On Tue, Mar 12, 2019 at 1:05 PM Joris Van den Bossche < ***@***.***> wrote:
I looked a bit further in it. It is possible to move the logic into
safe_sort, with:
- adding a check_outofbounds keyword to disable extra checks
(otherwise the performance benefit of take_1d is lost)
- fixing safe_sort to work for EAs
how it looks like: ***@***.***
<jorisvandenbossche@ba944eb>
Would that be a better solution?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25592 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIr6Zo_gFfZrdEZ63oE8zZwC7Dfh4ks5vV-xxgaJpZM4bjZIu>
.
|
The discussion is mainly about a code style issue (a way to prevent calling With my release manager hat on (but I know, I am not really independent here), I am going to decide not to block this PR over such a minor issue. So I will merge this PR. But, to address the discussion here, I will also open a new PR with my other proposal mentioned above (moving the logic to |
…=True in factorize()
PR is here: #25696 |
I am -1 on doing this again. merging bad code is not great. here. There is no urgency on this at all. We hold up PRs all the time on code style. |
* master: (22 commits) Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) DOC: fix some grammar and inconsistency issues in the User Guide (pandas-dev#25728) ENH: Add public start, stop, and step attributes to RangeIndex (pandas-dev#25720) Make Rolling.apply documentation clearer (pandas-dev#25712) pandas-dev#25707 - Fixed flakiness in stata write test (pandas-dev#25714) Json normalize nan support (pandas-dev#25619) TST: resolve issues with test_constructor_dtype_datetime64 (pandas-dev#24868) DEPR: Deprecate box kwarg for to_timedelta and to_datetime (pandas-dev#24486) BUG: Preserve name in DatetimeIndex.snap (pandas-dev#25585) Fix concat not respecting order of OrderedDict (pandas-dev#25224) CLN: remove pandas.core.categorical (pandas-dev#25655) TST/CLN: Remove more Panel tests (pandas-dev#25675) Pinned pycodestyle (pandas-dev#25701) DOC: update date of 0.24.2 release notes (pandas-dev#25699) BUG: Fix error in replace with strings that are large numbers (pandas-dev#25616) (pandas-dev#25644) BUG: fix usage of na_sentinel with sort=True in factorize() (pandas-dev#25592) BUG: Fix to_string output when using header (pandas-dev#16718) (pandas-dev#25602) CLN: Remove unused test code (pandas-dev#25670) CLN: remove Panel from concat error message (pandas-dev#25676) ... # Conflicts: # doc/source/whatsnew/v0.25.0.rst
Closes #25409