Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix usage of na_sentinel with sort=True in factorize() #25592

Merged

Conversation

jorisvandenbossche
Copy link
Member

Closes #25409

@jorisvandenbossche jorisvandenbossche added this to the 0.24.2 milestone Mar 7, 2019
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 7, 2019
except TypeError:
# Mixed types, where uniques.argsort fails.
if na_sentinel == -1:
# GH-25409 take_1d only works for na_sentinels of -1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this if/else check to keep the usage of take_1d when possible. It was added some time ago to improve performance compared to safe_sort, but thus does not handle non-default na_sentinels.
If we want to keep the code simpler, I can also remove the take_1d alltogether, but that would give a performance degradation compared to the last release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Alternatively, safe_sort could do this try / except?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the if / else? (as it would also have to deal with the non-default na_sentinels)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, it would need that as well. But I think it's probably fine here, unless we see this pattern coming up often when using safe_sort.

@codecov
Copy link

codecov bot commented Mar 7, 2019

Codecov Report

Merging #25592 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25592      +/-   ##
==========================================
+ Coverage   91.26%   91.26%   +<.01%     
==========================================
  Files         173      173              
  Lines       52966    52968       +2     
==========================================
+ Hits        48338    48340       +2     
  Misses       4628     4628
Flag Coverage Δ
#multiple 89.83% <100%> (ø) ⬆️
#single 41.71% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.79% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b72e7ed...7356997. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 7, 2019

Codecov Report

Merging #25592 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25592      +/-   ##
==========================================
- Coverage   91.29%   91.29%   -0.01%     
==========================================
  Files         173      173              
  Lines       52961    52963       +2     
==========================================
+ Hits        48350    48351       +1     
- Misses       4611     4612       +1
Flag Coverage Δ
#multiple 89.87% <100%> (ø) ⬆️
#single 41.73% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.79% <100%> (+0.01%) ⬆️
pandas/util/testing.py 88.98% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc7b466...a9c880e. Read the comment docs.

order2 = order.argsort()
labels = take_1d(order2, labels, fill_value=na_sentinel)
uniques = uniques.take(order)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa? this is way repetitive. why is what you have in the try not enough here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take_1d cannot handle custom na_sentinels. See the discussion in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the issue. this is fix is way too hacky. A custom sentinel is not a blocking issue for 0.24.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as I said above, I can also simply remove the if/else and try/except completely, by not using take_1d if that is preferred. A small performance gain is not worth introducing (now knowingly) a regression in behaviour.

But personally, I also have no problem with just merging this PR as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just adding technical debt and am -1 in merging as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some reasoning is explained here: #19938 (comment) (mainly performance I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then let's just defer this change to 0.25.0 and do this w/o the multiple repetitive try/excepts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like the extra if/else, but are OK with reverting to only safe_sort bu only for 0.25.0: would you be OK with the current PR for 0.24.2, if I directly do a follow-up PR to remove the if/else try/except on master / for 0.25.0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback OK?

Copy link
Contributor

@TomAugspurger TomAugspurger Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good for 0.24.2.

What's the duplicate code / tech debt here? The call to safe_sort?

@jreback jreback modified the milestones: 0.24.2, 0.25.0 Mar 10, 2019
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 0.24.2 Mar 12, 2019
@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

not sure why this is that important for 0.24.2. fixing this properly would block the release which doesn't make much sense.

@jreback jreback modified the milestones: 0.24.2, 0.24.3 Mar 12, 2019
@TomAugspurger
Copy link
Contributor

I think it's essentially ready though, right?

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.24.3, 0.24.2 Mar 12, 2019
@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

this is just adding technical debt and am -1 on merging as is. Also the idea of merging a patch then immediately 'fixing' it in master is better, but still, why the urgency. This is a very small edge case, yet adding a horrible hack.

@jorisvandenbossche
Copy link
Member Author

Also the idea of merging a patch then immediately 'fixing' it in master is better, but still, why the urgency.

If you say it is better, do you also mean you are OK with that?

For sure it is not the most important bug fix, but it is a very clear error (the output is simply total nonsense), for which we have a fix that we can just merge, so why not?

@TomAugspurger
Copy link
Contributor

Can you explain the tech-debt comment? I still don't quite understand that.

Would you prefer the try / except be moved inside its own function? I prefer not to make functions unless they're used in more than one place, since it makes the code harder to follow.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

If you say it is better, do you also mean you are OK with that?

no

Can you explain the tech-debt comment? I still don't quite understand that.>

Would you prefer the try / except be moved inside its own function? I prefer not to make functions unless they're used in more than one place, since it makes the code harder to follow.

well this is adding multiple try/excepts with the same code. This is horrible.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

again explain the urgency. this is NOT in any way important at all.

@jorisvandenbossche
Copy link
Member Author

As I said above, it is indeed not urgent. But it is a ready fix for a known regression. So again, why not?

You say you want a proper solution. Can you then explain what you mean with this? What do you propose?

well this is adding multiple try/excepts with the same code. This is horrible.

It is not adding multiple try/excepts. There was one, and there is still one. I only added one if/else case to check for a keyword argument depending on which the try/except should never be tried at all.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 12, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

i see no reason to merge this

this is making already complicated code
way more complicated

@jorisvandenbossche
Copy link
Member Author

Our current code is simply wrong. If you don't want the proposed solution here, can you propose an alternative solution?

@TomAugspurger
Copy link
Contributor

Joris, maybe try moving the if condition into safe_sort itself? It’ll be the same code, but maybe a new diff would be refreshing.

@TomAugspurger
Copy link
Contributor

Actually, doing the if na_sentiel == -1 check in safe_sort isn't quite straightforward, as that accepts list-like and we need array-like.

I'm still not sure where the "way repetitive" code is. The only repetition I see is

                uniques, labels = safe_sort(uniques, labels,
                                             na_sentinel=na_sentinel,
                                             assume_unique=True)

Am I missing something?

uniques = uniques.take(order)
except TypeError:
# Mixed types, where uniques.argsort fails.
uniques, labels = safe_sort(uniques, labels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the way this is written because it repeats the call to safe_sort. if you can avoid that (easy enough, just use a pass in the except), then call safe_sort if needed (assign to uniques, lables as NOne, None initially), then this would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche do you find this clearer? Personally, I find these dummy variables just for flow control a bit confusing (you have to verify that things are eventually set somewhere). With the way things are written, it's clear that each of the three branches has a uniques and labels, and it's clear how things are achieved.

If we're really concerned about repeating the call to safe_sort, you can make a closer with the values

get_uniques_labels = lambda: safe_sort(uniques, labels, na_sentinel, assume_unique=True)`

and call the lambda, but again, that's just unnecessary indirection IMO.

In this case I think just do whatever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tom that this is not necessarily clearer as what it is now.

@jorisvandenbossche
Copy link
Member Author

An additional problem is that safe_sort handles more cases, eg it also checks for out of bound labels (which we know we don't have here). So if using take_1d inside safe_sort, we would need extra code there to also handle out of bound labels when using take_1d

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 12, 2019

I looked a bit further in it. It is possible to move the logic into safe_sort, with:

  • adding a check_outofbounds keyword to disable extra checks (otherwise the performance benefit of take_1d is lost)
  • fixing safe_sort to work for EAs

how it looks like: jorisvandenbossche@ba944eb

Would that be a more preferred solution?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 12, 2019 via email

@jorisvandenbossche
Copy link
Member Author

The discussion is mainly about a code style issue (a way to prevent calling safe_sort twice as Jeff proposes, vs calling the function twice as in the current diff), and not about the actual proposal (falling back to safe_sort for the impacted case). And code style is subjective, given that we disagree over which of two is better. But since they are both basically equivalent, I would say this is a minor issue.

With my release manager hat on (but I know, I am not really independent here), I am going to decide not to block this PR over such a minor issue. So I will merge this PR. But, to address the discussion here, I will also open a new PR with my other proposal mentioned above (moving the logic to safe_sort and fixing that to handle EAs). But since this touches more parts, I prefer to keep this for 0.25.0.

@jorisvandenbossche
Copy link
Member Author

PR is here: #25696

@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

I am -1 on doing this again. merging bad code is not great. here. There is no urgency on this at all. We hold up PRs all the time on code style.

sighingnow added a commit to sighingnow/pandas that referenced this pull request Mar 14, 2019
* master: (22 commits)
  Fixturize tests/frame/test_operators.py (pandas-dev#25641)
  Update ValueError message in corr (pandas-dev#25729)
  DOC: fix some grammar and inconsistency issues in the User Guide (pandas-dev#25728)
  ENH: Add public start, stop, and step attributes to RangeIndex (pandas-dev#25720)
  Make Rolling.apply documentation clearer (pandas-dev#25712)
  pandas-dev#25707 - Fixed flakiness in stata write test (pandas-dev#25714)
  Json normalize nan support (pandas-dev#25619)
  TST: resolve issues with test_constructor_dtype_datetime64 (pandas-dev#24868)
  DEPR: Deprecate box kwarg for to_timedelta and to_datetime (pandas-dev#24486)
  BUG: Preserve name in DatetimeIndex.snap (pandas-dev#25585)
  Fix concat not respecting order of OrderedDict (pandas-dev#25224)
  CLN: remove pandas.core.categorical (pandas-dev#25655)
  TST/CLN: Remove more Panel tests (pandas-dev#25675)
  Pinned pycodestyle (pandas-dev#25701)
  DOC: update date of 0.24.2 release notes (pandas-dev#25699)
  BUG: Fix error in replace with strings that are large numbers (pandas-dev#25616) (pandas-dev#25644)
  BUG: fix usage of na_sentinel with sort=True in factorize() (pandas-dev#25592)
  BUG: Fix to_string output when using header (pandas-dev#16718) (pandas-dev#25602)
  CLN: Remove unused test code (pandas-dev#25670)
  CLN: remove Panel from concat error message (pandas-dev#25676)
  ...

# Conflicts:
#	doc/source/whatsnew/v0.25.0.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants