Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix usage of na_sentinel with sort=True in factorize() #25592

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Fixed Regressions
- Fixed regression in creating a period-dtype array from a read-only NumPy array of period objects. (:issue:`25403`)
- Fixed regression in :class:`Categorical`, where constructing it from a categorical ``Series`` and an explicit ``categories=`` that differed from that in the ``Series`` created an invalid object which could trigger segfaults. (:issue:`25318`)
- Fixed pip installing from source into an environment without NumPy (:issue:`25193`)
- Fixed regression in :func:`factorize` when passing a custom ``na_sentinel`` value with ``sort=True`` (:issue:`25409`).
- Fixed regression in :meth:`DataFrame.to_csv` writing duplicate line endings with gzip compress (:issue:`25311`)

.. _whatsnew_0242.enhancements:
Expand Down
20 changes: 13 additions & 7 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -619,13 +619,19 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):

if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort
try:
order = uniques.argsort()
order2 = order.argsort()
labels = take_1d(order2, labels, fill_value=na_sentinel)
uniques = uniques.take(order)
except TypeError:
# Mixed types, where uniques.argsort fails.
if na_sentinel == -1:
# GH-25409 take_1d only works for na_sentinels of -1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this if/else check to keep the usage of take_1d when possible. It was added some time ago to improve performance compared to safe_sort, but thus does not handle non-default na_sentinels.
If we want to keep the code simpler, I can also remove the take_1d alltogether, but that would give a performance degradation compared to the last release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Alternatively, safe_sort could do this try / except?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the if / else? (as it would also have to deal with the non-default na_sentinels)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, it would need that as well. But I think it's probably fine here, unless we see this pattern coming up often when using safe_sort.

try:
order = uniques.argsort()
order2 = order.argsort()
labels = take_1d(order2, labels, fill_value=na_sentinel)
uniques = uniques.take(order)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa? this is way repetitive. why is what you have in the try not enough here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take_1d cannot handle custom na_sentinels. See the discussion in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the issue. this is fix is way too hacky. A custom sentinel is not a blocking issue for 0.24.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as I said above, I can also simply remove the if/else and try/except completely, by not using take_1d if that is preferred. A small performance gain is not worth introducing (now knowingly) a regression in behaviour.

But personally, I also have no problem with just merging this PR as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just adding technical debt and am -1 in merging as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some reasoning is explained here: #19938 (comment) (mainly performance I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then let's just defer this change to 0.25.0 and do this w/o the multiple repetitive try/excepts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like the extra if/else, but are OK with reverting to only safe_sort bu only for 0.25.0: would you be OK with the current PR for 0.24.2, if I directly do a follow-up PR to remove the if/else try/except on master / for 0.25.0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback OK?

Copy link
Contributor

@TomAugspurger TomAugspurger Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good for 0.24.2.

What's the duplicate code / tech debt here? The call to safe_sort?

# Mixed types, where uniques.argsort fails.
uniques, labels = safe_sort(uniques, labels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the way this is written because it repeats the call to safe_sort. if you can avoid that (easy enough, just use a pass in the except), then call safe_sort if needed (assign to uniques, lables as NOne, None initially), then this would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche do you find this clearer? Personally, I find these dummy variables just for flow control a bit confusing (you have to verify that things are eventually set somewhere). With the way things are written, it's clear that each of the three branches has a uniques and labels, and it's clear how things are achieved.

If we're really concerned about repeating the call to safe_sort, you can make a closer with the values

get_uniques_labels = lambda: safe_sort(uniques, labels, na_sentinel, assume_unique=True)`

and call the lambda, but again, that's just unnecessary indirection IMO.

In this case I think just do whatever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tom that this is not necessarily clearer as what it is now.

na_sentinel=na_sentinel,
assume_unique=True)
else:
uniques, labels = safe_sort(uniques, labels,
na_sentinel=na_sentinel,
assume_unique=True)
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,21 @@ def test_parametrized_factorize_na_value(self, data, na_value):
tm.assert_numpy_array_equal(l, expected_labels)
tm.assert_numpy_array_equal(u, expected_uniques)

@pytest.mark.parametrize('sort', [True, False])
@pytest.mark.parametrize('na_sentinel', [-1, -10, 100])
def test_factorize_na_sentinel(self, sort, na_sentinel):
data = np.array(['b', 'a', None, 'b'], dtype=object)
labels, uniques = algos.factorize(data, sort=sort,
na_sentinel=na_sentinel)
if sort:
expected_labels = np.array([1, 0, na_sentinel, 1], dtype=np.intp)
expected_uniques = np.array(['a', 'b'], dtype=object)
else:
expected_labels = np.array([0, 1, na_sentinel, 0], dtype=np.intp)
expected_uniques = np.array(['b', 'a'], dtype=object)
tm.assert_numpy_array_equal(labels, expected_labels)
tm.assert_numpy_array_equal(uniques, expected_uniques)


class TestUnique(object):

Expand Down