BUG: fix usage of na_sentinel with sort=True in factorize() #25592

jorisvandenbossche · 2019-03-07T14:59:36Z

jorisvandenbossche · 2019-03-07T15:17:47Z

pandas/core/algorithms.py

-        except TypeError:
-            # Mixed types, where uniques.argsort fails.
+        if na_sentinel == -1:
+            # GH-25409 take_1d only works for na_sentinels of -1


Adding this if/else check to keep the usage of take_1d when possible. It was added some time ago to improve performance compared to safe_sort, but thus does not handle non-default na_sentinels.
If we want to keep the code simpler, I can also remove the take_1d alltogether, but that would give a performance degradation compared to the last release.

This seems reasonable. Alternatively, safe_sort could do this try / except?

You mean the if / else? (as it would also have to deal with the non-default na_sentinels)

Hmm yeah, it would need that as well. But I think it's probably fine here, unless we see this pattern coming up often when using safe_sort.

codecov · 2019-03-07T15:58:33Z

Codecov Report

Merging #25592 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25592      +/-   ##
==========================================
+ Coverage   91.26%   91.26%   +<.01%     
==========================================
  Files         173      173              
  Lines       52966    52968       +2     
==========================================
+ Hits        48338    48340       +2     
  Misses       4628     4628

Flag	Coverage Δ
#multiple	`89.83% <100%> (ø)`	⬆️
#single	`41.71% <75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.79% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b72e7ed...7356997. Read the comment docs.

codecov · 2019-03-07T15:58:35Z

Codecov Report

Merging #25592 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25592      +/-   ##
==========================================
- Coverage   91.29%   91.29%   -0.01%     
==========================================
  Files         173      173              
  Lines       52961    52963       +2     
==========================================
+ Hits        48350    48351       +1     
- Misses       4611     4612       +1

Flag	Coverage Δ
#multiple	`89.87% <100%> (ø)`	⬆️
#single	`41.73% <75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.79% <100%> (+0.01%)`	⬆️
pandas/util/testing.py	`88.98% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc7b466...a9c880e. Read the comment docs.

jreback · 2019-03-07T22:07:12Z

pandas/core/algorithms.py

+                order2 = order.argsort()
+                labels = take_1d(order2, labels, fill_value=na_sentinel)
+                uniques = uniques.take(order)
+            except TypeError:


woa? this is way repetitive. why is what you have in the try not enough here?

take_1d cannot handle custom na_sentinels. See the discussion in the issue.

I read the issue. this is fix is way too hacky. A custom sentinel is not a blocking issue for 0.24.2

Well, as I said above, I can also simply remove the if/else and try/except completely, by not using take_1d if that is preferred. A small performance gain is not worth introducing (now knowingly) a regression in behaviour.

But personally, I also have no problem with just merging this PR as is.

this is just adding technical debt and am -1 in merging as is

Some reasoning is explained here: #19938 (comment) (mainly performance I think)

ok, then let's just defer this change to 0.25.0 and do this w/o the multiple repetitive try/excepts.

If you don't like the extra if/else, but are OK with reverting to only safe_sort bu only for 0.25.0: would you be OK with the current PR for 0.24.2, if I directly do a follow-up PR to remove the if/else try/except on master / for 0.25.0 ?

@jreback OK?

This would be good for 0.24.2.

What's the duplicate code / tech debt here? The call to safe_sort?

…inel

jreback · 2019-03-12T12:42:25Z

not sure why this is that important for 0.24.2. fixing this properly would block the release which doesn't make much sense.

TomAugspurger · 2019-03-12T12:54:50Z

I think it's essentially ready though, right?

jreback · 2019-03-12T13:05:37Z

this is just adding technical debt and am -1 on merging as is. Also the idea of merging a patch then immediately 'fixing' it in master is better, but still, why the urgency. This is a very small edge case, yet adding a horrible hack.

jorisvandenbossche · 2019-03-12T13:08:50Z

Also the idea of merging a patch then immediately 'fixing' it in master is better, but still, why the urgency.

If you say it is better, do you also mean you are OK with that?

For sure it is not the most important bug fix, but it is a very clear error (the output is simply total nonsense), for which we have a fix that we can just merge, so why not?

TomAugspurger · 2019-03-12T13:08:52Z

Can you explain the tech-debt comment? I still don't quite understand that.

Would you prefer the try / except be moved inside its own function? I prefer not to make functions unless they're used in more than one place, since it makes the code harder to follow.

jreback · 2019-03-12T13:12:08Z

If you say it is better, do you also mean you are OK with that?

no

Can you explain the tech-debt comment? I still don't quite understand that.>

Would you prefer the try / except be moved inside its own function? I prefer not to make functions unless they're used in more than one place, since it makes the code harder to follow.

well this is adding multiple try/excepts with the same code. This is horrible.

jreback · 2019-03-12T13:12:25Z

again explain the urgency. this is NOT in any way important at all.

jorisvandenbossche · 2019-03-12T13:16:58Z

As I said above, it is indeed not urgent. But it is a ready fix for a known regression. So again, why not?

You say you want a proper solution. Can you then explain what you mean with this? What do you propose?

well this is adding multiple try/excepts with the same code. This is horrible.

It is not adding multiple try/excepts. There was one, and there is still one. I only added one if/else case to check for a keyword argument depending on which the try/except should never be tried at all.

TomAugspurger · 2019-03-12T13:17:14Z

How should we proceed then?

…

On Mar 12, 2019, at 08:12, Jeff Reback ***@***.***> wrote: again explain the urgency. this is NOT in any way important at all. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jreback · 2019-03-12T13:19:35Z

i see no reason to merge this

this is making already complicated code
way more complicated

jorisvandenbossche · 2019-03-12T13:23:08Z

Our current code is simply wrong. If you don't want the proposed solution here, can you propose an alternative solution?

TomAugspurger · 2019-03-12T13:24:45Z

Joris, maybe try moving the if condition into safe_sort itself? It’ll be the same code, but maybe a new diff would be refreshing.

TomAugspurger · 2019-03-12T14:19:57Z

Actually, doing the if na_sentiel == -1 check in safe_sort isn't quite straightforward, as that accepts list-like and we need array-like.

I'm still not sure where the "way repetitive" code is. The only repetition I see is

                uniques, labels = safe_sort(uniques, labels,
                                             na_sentinel=na_sentinel,
                                             assume_unique=True)

Am I missing something?

jreback · 2019-03-12T14:24:46Z

pandas/core/algorithms.py

+                uniques = uniques.take(order)
+            except TypeError:
+                # Mixed types, where uniques.argsort fails.
+                uniques, labels = safe_sort(uniques, labels,


I don't like the way this is written because it repeats the call to safe_sort. if you can avoid that (easy enough, just use a pass in the except), then call safe_sort if needed (assign to uniques, lables as NOne, None initially), then this would be better

@jorisvandenbossche do you find this clearer? Personally, I find these dummy variables just for flow control a bit confusing (you have to verify that things are eventually set somewhere). With the way things are written, it's clear that each of the three branches has a uniques and labels, and it's clear how things are achieved.

If we're really concerned about repeating the call to safe_sort, you can make a closer with the values

get_uniques_labels = lambda: safe_sort(uniques, labels, na_sentinel, assume_unique=True)`

and call the lambda, but again, that's just unnecessary indirection IMO.

In this case I think just do whatever.

I agree with Tom that this is not necessarily clearer as what it is now.

jorisvandenbossche · 2019-03-12T14:27:13Z

An additional problem is that safe_sort handles more cases, eg it also checks for out of bound labels (which we know we don't have here). So if using take_1d inside safe_sort, we would need extra code there to also handle out of bound labels when using take_1d

jorisvandenbossche · 2019-03-12T18:05:28Z

I looked a bit further in it. It is possible to move the logic into safe_sort, with:

adding a check_outofbounds keyword to disable extra checks (otherwise the performance benefit of take_1d is lost)
fixing safe_sort to work for EAs

how it looks like: jorisvandenbossche@ba944eb

Would that be a more preferred solution?

TomAugspurger · 2019-03-12T19:45:57Z

A bit less clear, but perhaps better since all EAs can benefit from it?

…

On Tue, Mar 12, 2019 at 1:05 PM Joris Van den Bossche < ***@***.***> wrote: I looked a bit further in it. It is possible to move the logic into safe_sort, with: - adding a check_outofbounds keyword to disable extra checks (otherwise the performance benefit of take_1d is lost) - fixing safe_sort to work for EAs how it looks like: ***@***.*** <jorisvandenbossche@ba944eb> Would that be a better solution? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#25592 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIr6Zo_gFfZrdEZ63oE8zZwC7Dfh4ks5vV-xxgaJpZM4bjZIu> .

jorisvandenbossche · 2019-03-12T20:25:08Z

The discussion is mainly about a code style issue (a way to prevent calling safe_sort twice as Jeff proposes, vs calling the function twice as in the current diff), and not about the actual proposal (falling back to safe_sort for the impacted case). And code style is subjective, given that we disagree over which of two is better. But since they are both basically equivalent, I would say this is a minor issue.

With my release manager hat on (but I know, I am not really independent here), I am going to decide not to block this PR over such a minor issue. So I will merge this PR. But, to address the discussion here, I will also open a new PR with my other proposal mentioned above (moving the logic to safe_sort and fixing that to handle EAs). But since this touches more parts, I prefer to keep this for 0.25.0.

…=True in factorize()

jorisvandenbossche · 2019-03-12T20:34:24Z

PR is here: #25696

jreback · 2019-03-12T20:39:28Z

I am -1 on doing this again. merging bad code is not great. here. There is no urgency on this at all. We hold up PRs all the time on code style.

…actorize() (#25695)

* master: (22 commits) Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) DOC: fix some grammar and inconsistency issues in the User Guide (pandas-dev#25728) ENH: Add public start, stop, and step attributes to RangeIndex (pandas-dev#25720) Make Rolling.apply documentation clearer (pandas-dev#25712) pandas-dev#25707 - Fixed flakiness in stata write test (pandas-dev#25714) Json normalize nan support (pandas-dev#25619) TST: resolve issues with test_constructor_dtype_datetime64 (pandas-dev#24868) DEPR: Deprecate box kwarg for to_timedelta and to_datetime (pandas-dev#24486) BUG: Preserve name in DatetimeIndex.snap (pandas-dev#25585) Fix concat not respecting order of OrderedDict (pandas-dev#25224) CLN: remove pandas.core.categorical (pandas-dev#25655) TST/CLN: Remove more Panel tests (pandas-dev#25675) Pinned pycodestyle (pandas-dev#25701) DOC: update date of 0.24.2 release notes (pandas-dev#25699) BUG: Fix error in replace with strings that are large numbers (pandas-dev#25616) (pandas-dev#25644) BUG: fix usage of na_sentinel with sort=True in factorize() (pandas-dev#25592) BUG: Fix to_string output when using header (pandas-dev#16718) (pandas-dev#25602) CLN: Remove unused test code (pandas-dev#25670) CLN: remove Panel from concat error message (pandas-dev#25676) ... # Conflicts: # doc/source/whatsnew/v0.25.0.rst

BUG: fix usage of na_sentinel with sort=True in factorize()

7356997

jorisvandenbossche added this to the 0.24.2 milestone Mar 7, 2019

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 7, 2019

jorisvandenbossche commented Mar 7, 2019

View reviewed changes

jreback requested changes Mar 7, 2019

View reviewed changes

jreback modified the milestones: 0.24.2, 0.25.0 Mar 10, 2019

jorisvandenbossche added 2 commits March 11, 2019 18:20

fix dtype

e1ab3a4

Merge remote-tracking branch 'upstream/master' into factorize-na-sent…

a9c880e

…inel

jorisvandenbossche modified the milestones: 0.25.0, 0.24.2 Mar 12, 2019

jreback modified the milestones: 0.24.2, 0.24.3 Mar 12, 2019

jorisvandenbossche modified the milestones: 0.24.3, 0.24.2 Mar 12, 2019

jreback reviewed Mar 12, 2019

View reviewed changes

jorisvandenbossche merged commit a8fad16 into pandas-dev:master Mar 12, 2019

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 12, 2019

Backport PR pandas-dev#25592: BUG: fix usage of na_sentinel with sort…

5ce08c2

…=True in factorize()

meeseeksmachine mentioned this pull request Mar 12, 2019

Backport PR #25592 on branch 0.24.x (BUG: fix usage of na_sentinel with sort=True in factorize()) #25695

Merged

jorisvandenbossche deleted the factorize-na-sentinel branch March 12, 2019 20:29

jorisvandenbossche mentioned this pull request Mar 12, 2019

CLN: handle EAs and fast path (no bounds checking) in safe_sort #25696

Merged

jorisvandenbossche pushed a commit that referenced this pull request Mar 12, 2019

Backport PR #25592: BUG: fix usage of na_sentinel with sort=True in f…

d589e58

…actorize() (#25695)

BUG: fix usage of na_sentinel with sort=True in factorize() #25592

BUG: fix usage of na_sentinel with sort=True in factorize() #25592

Conversation

jorisvandenbossche commented Mar 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 7, 2019

Codecov Report

codecov bot commented Mar 7, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 11, 2019 • edited Loading

Choose a reason for hiding this comment

jreback commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

jreback commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

jreback commented Mar 12, 2019

jreback commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019 via email

jreback commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

TomAugspurger commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019 • edited Loading

TomAugspurger commented Mar 12, 2019 via email

jorisvandenbossche commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

jreback commented Mar 12, 2019

codecov bot commented Mar 7, 2019 •

edited

Loading

TomAugspurger Mar 11, 2019 •

edited

Loading

jorisvandenbossche commented Mar 12, 2019 •

edited

Loading