ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

jacobaustin123 · 2019-07-04T23:04:26Z

closes Add key to sorting functions #3942
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Added a key parameter to the DataFrame/Series.sort_values and sort_index functions matching Python sorted semantics for allowing custom sorting orders. Address open issue #3942.

WillAyd

Nice PR! Can you add support for Index to go with DataFrame / Series as well?

pandas/core/frame.py

WillAyd · 2019-07-06T17:54:29Z

From typing import Callable at the top of the module then just Callable for the annotation

…

Sent from my iPhone

On Jul 6, 2019, at 10:41 AM, Jacob Austin ***@***.***> wrote: @ja3067 commented on this pull request. In pandas/core/frame.py: > @@ -4977,6 +4977,7 @@ def sort_values( inplace=False, kind="quicksort", na_position="last", + key=None Would you just like key : typing.Callable? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

pep8speaks · 2019-07-06T21:55:10Z

Hello @jacobaustin123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-27 02:12:02 UTC

jacobaustin123 · 2019-07-06T22:05:40Z

Just pushed a new commit that adds support for Index.sort_values(key=...) and adds a key flag to nargsort(...) and lexsort_indexer(...). I've also added static type hints. I still use Index.map to apply the key function for sort_index functions, because it seems simpler and semantically better to apply the key to the values and not the codes.

doc/source/whatsnew/v0.25.1.rst

pandas/core/frame.py

pandas/core/indexes/base.py

pandas/core/series.py

WillAyd · 2019-09-05T17:03:08Z

@ja3067 is this still active?

jacobaustin123 · 2019-09-05T17:17:19Z

I will make the requested changes this weekend. The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns. Should the same function be applied to every level/column, or should the key function be expected to operate on a tuple? Likewise, how should the key behave for sort_remaining? Should it be applied to everything?

WillAyd · 2019-09-20T14:55:28Z

The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns.

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

jacobaustin123 · 2019-09-20T15:37:05Z

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

@WillAyd Agreed. In the case, say, of one numeric column and one string column, should we for instance support a dictionary of {column_name : Callable} so that different keys can be passed, or the key can be passed only to the primary column. Same question for sort_index.

WillAyd · 2019-09-22T21:58:08Z

I would leave that enhancement to a separate PR if requested

WillAyd · 2019-11-07T21:05:13Z

@ja3067 is this still active?

jacobaustin123 · 2019-11-13T04:26:57Z

@WillAyd I'll make the necessary updates this weekend. Sorry for the delay.

jacobaustin123 · 2019-11-19T21:16:09Z

@WillAyd Ok. The final PR is almost done. Last question is about the key for sortindex for a MultiIndex when level is specified. There are two options:

the key function is applied to the entire MultiIndex tuple like the map function does currently. You could do something like key=lambda x : x[0].lower()
the key function is applied to each level in the MultiIindex separately. The easiest way to do this would be to add a method to the MultiIndex called idx.maplevel(self, mapper, level=0) that applies a map function to a set of levels. However, this would probably require modifying the Index API. Would this be OK/worthwhile? This would be a little less flexible, but maybe more reasonable.

What do you think?

jacobaustin123 · 2019-11-21T15:00:49Z

@WillAyd I've incorporated the review changes, and it passes tests now. Should be good.

pandas/core/frame.py

pandas/core/indexes/datetimelike.py

WillAyd · 2019-11-29T17:53:56Z

Can you merge master and repush? Not sure what 37 error was may be resolved

jorisvandenbossche · 2020-04-13T06:35:20Z

@jorisvandenbossche the only big change is how keys are applied to MultiIindex objects. Keys are vectorized, and they're applied per column to DataFrames and now per level to MultiIndex. So if you do df.sort_index(key=...) the key is passed each level of the index separately. Otherwise things are the same.

Thanks, that sounds good!

jorisvandenbossche

Few small nitpicks. For the rest looks good!

doc/source/user_guide/basics.rst

doc/source/whatsnew/v1.1.0.rst

jorisvandenbossche · 2020-04-13T06:42:48Z

doc/source/whatsnew/v1.1.0.rst

+We've added a ``key`` argument to the DataFrame and Series sorting methods, including
+:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
+and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
+to the each column of a DataFrame before sorting is performed (:issue:`27237`). See


Suggested change

to the each column of a DataFrame before sorting is performed (:issue:`27237`). See

to each column of a DataFrame before sorting is performed (:issue:`27237`). See

Apart from the typo, I find "each column" a bit confusing, as it is of course only applied to those columns that are used for sorting?

See if the new version is clearer.

New version looks good!

doc/source/whatsnew/v1.1.0.rst

pandas/core/generic.py

pandas/core/indexes/datetimelike.py

pandas/core/sorting.py

jacobaustin123 · 2020-04-14T05:08:34Z

@WillAyd @jorisvandenbossche @jreback This is badly timed since I know we want to finish this, but I made some changes today that add support for dictionary keys, i.e.

df = DataFrame({0: ["Hello", "goodbye"], 1: [0, 1]})
result = df.sort_values([0, 1], key={0: lambda col: col.str.lower()})

So you can specify a different key function for different levels/columns. I have a commit ready I can push now, but if people prefer, we can finalize this and have that as a separate commit, although as part of this addition I refactored the code significantly so everything is pulled into ensure_key_mapped. @jreback might prefer that, since it means there's no more key in lexsort_indexer or nargsort. I just didn't want to push it if people wanted to merge this and be done.

jreback · 2020-04-14T05:15:54Z

let’s not add anything more

jacobaustin123 · 2020-04-14T14:46:56Z

pandas/core/indexes/datetimelike.py

-            _as = self.argsort()
-            if not ascending:
-                _as = _as[::-1]
-            sorted_index = self.take(_as)
            return sorted_index, _as
        else:


@jreback do you know what this second branch is here for? I'm tempted to make the argsort method the default since it's necessary for key sorting, unless it's an important optimization.

jacobaustin123 · 2020-04-14T14:48:11Z

pandas/core/sorting.py

+    if axis == 0:
+        new_df = DataFrame._from_arrays(new_levels, df.columns, df.index)
+    else:
+        new_df = DataFrame._from_arrays(new_levels, df.index, df.columns).transpose()


Is there a way to do this (construct a DataFrame along either row or column axis) that's better than transposing it like this?

jacobaustin123 · 2020-04-14T14:50:20Z

@jreback Ok. I just reverted those changes but kept some of the refactoring so e.g. lexort_indexer no longer has a key function, and ensure_key_mapped handles all the logic for different datatypes. There are two questions I point out in a review above. Otherwise I'm happier with this version and I think it achieves your goal for having key mapping totally encapsulated in ensure_key_mapped.

Dict keys can be an easy PR on top.

jacobaustin123 · 2020-04-26T03:13:40Z

@jreback gentle bump. would love to finish this up.

jreback

@jacobaustin123 I would really like to merge this. but you keep changing an enormous amount of code. Please got back to when I make comments I think 10 or 12 days ago.

changed all of the ensure_key_mapped again, this is a mess. Please just go back to the much simpler version.

doc/source/user_guide/basics.rst

doc/source/whatsnew/v1.1.0.rst

pandas/core/sorting.py

jacobaustin123 · 2020-04-26T19:52:06Z

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

jreback · 2020-04-26T19:54:16Z

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

well I'ld lke to make the doc changes I have above, and not have 3 different ensure_key_mapped functions. you had 1 or maybe 2 before. Please simplify. I am happy to take changes as a followup, but this is too much now.

jacobaustin123 · 2020-04-26T19:55:40Z

@jreback. Understood. I'll address doc comments and revert.

jreback · 2020-04-26T19:56:30Z

@jreback. Understood. I'll address doc comments and revert.

thanks.

this is just a very large PR and hard to grok what has changed.

jacobaustin123 · 2020-04-27T01:34:22Z

@jreback ok I just reverted to the prior version and updated documentation. I had to force push, but I just reverted to the commit before the major changes and then added two commits on top.

jreback · 2020-04-27T01:38:02Z

kk will look soon thanks @jacobaustin123

jreback · 2020-04-27T13:04:29Z

pandas/core/series.py

@@ -2986,6 +3039,9 @@ def sort_values(
            )

        def _try_kind_sort(arr):
+            arr = ensure_key_mapped(arr, key)
+            arr = getattr(arr, "_values", arr)


in followon you can use extract_array here

jreback · 2020-04-27T16:07:17Z

pandas/core/sorting.py

@@ -204,15 +218,14 @@ def lexsort_indexer(keys, orders=None, na_position: str = "last"):
    elif orders is None:
        orders = [True] * len(keys)

-    for key, order in zip(keys, orders):
-
+    for k, order in zip(keys, orders):


can you do this

jreback · 2020-04-27T16:08:56Z

doc/source/whatsnew/v1.1.0.rst

+We've added a ``key`` argument to the DataFrame and Series sorting methods, including
+:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
+and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
+column-by-column to each column used for sorting, before sorting is performed (:issue:`27237`).


can you also add the orginal issue number here (or rather replace this issue number )

jreback · 2020-04-27T16:10:11Z

thanks @jacobaustin123

3 followon comments if you would. Also happy to take a refactoring PR in sorting generally. Targeted PRs are best.

jacobaustin123 · 2020-04-27T18:03:29Z

@jreback awesome thank you! I'll address these and maybe do a small refactoring PR. And maybe a second PR to add dictionary key sorting.

….sort_index(key=...) sorting (pandas-dev#27237)

jacobaustin123 mentioned this pull request Jul 4, 2019

Add key to sorting functions #3942

Closed

WillAyd requested changes Jul 5, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

WillAyd added API Design Enhancement and removed API Design labels Jul 5, 2019

jreback requested changes Jul 5, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jacobaustin123 force-pushed the master branch from c282377 to d30a8c3 Compare July 7, 2019 00:33

WillAyd requested changes Aug 28, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

pandas/core/series.py Outdated Show resolved Hide resolved

jacobaustin123 force-pushed the master branch 4 times, most recently from c814ec8 to 47f9751 Compare November 20, 2019 21:51

WillAyd reviewed Nov 21, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/indexes/datetimelike.py Outdated Show resolved Hide resolved

pandas/core/indexes/datetimelike.py Outdated Show resolved Hide resolved

jacobaustin123 force-pushed the master branch from 47f9751 to 4b51227 Compare November 28, 2019 16:54

jacobaustin123 force-pushed the master branch 3 times, most recently from 4ae2bdd to 9ce3b26 Compare November 30, 2019 01:53

jacobaustin123 added 2 commits April 13, 2020 01:36

removed trailing whitespace

2957e60

linting

7c6c2f0

jorisvandenbossche approved these changes Apr 13, 2020

View reviewed changes

jacobaustin123 added 3 commits April 13, 2020 11:36

fixed small bug with datetimelike, updated docs

ad745c4

fixed trailing whitespace

3ad3358

Merge branch 'master' of https://github.com/pandas-dev/pandas

e87a9a9

jacobaustin123 commented Apr 14, 2020

View reviewed changes

jreback requested changes Apr 26, 2020

View reviewed changes

reverted and updated documentation

a5d5c6d

jacobaustin123 force-pushed the master branch from 2b705a8 to a5d5c6d Compare April 27, 2020 01:29

merged and updated

56f73ba

jacobaustin123 added 2 commits April 26, 2020 21:49

fixed linting issue and added comments

4250e31

fixed small issue in tests

4d5ba53

jreback approved these changes Apr 27, 2020

View reviewed changes

jreback reviewed Apr 27, 2020

View reviewed changes

jreback merged commit dec736f into pandas-dev:master Apr 27, 2020

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

ENH: Added key option to df/series.sort_values(key=...) and df/series…

78a475f

….sort_index(key=...) sorting (pandas-dev#27237)

	to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
	to each column of a DataFrame before sorting is performed (:issue:`27237`). See

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

Conversation

jacobaustin123 commented Jul 4, 2019

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Jul 6, 2019 via email

pep8speaks commented Jul 6, 2019 • edited Loading

Comment last updated at 2020-04-27 02:12:02 UTC

jacobaustin123 commented Jul 6, 2019

WillAyd commented Sep 5, 2019

jacobaustin123 commented Sep 5, 2019

WillAyd commented Sep 20, 2019

jacobaustin123 commented Sep 20, 2019 • edited Loading

WillAyd commented Sep 22, 2019

WillAyd commented Nov 7, 2019

jacobaustin123 commented Nov 13, 2019 • edited Loading

jacobaustin123 commented Nov 19, 2019 • edited Loading

jacobaustin123 commented Nov 21, 2019

WillAyd commented Nov 29, 2019

jorisvandenbossche commented Apr 13, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobaustin123 commented Apr 14, 2020

jreback commented Apr 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobaustin123 commented Apr 14, 2020

jacobaustin123 commented Apr 26, 2020

jreback left a comment

Choose a reason for hiding this comment

jacobaustin123 commented Apr 26, 2020

jreback commented Apr 26, 2020

jacobaustin123 commented Apr 26, 2020

jreback commented Apr 26, 2020

jacobaustin123 commented Apr 27, 2020

jreback commented Apr 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 27, 2020

jacobaustin123 commented Apr 27, 2020

pep8speaks commented Jul 6, 2019 •

edited

Loading

jacobaustin123 commented Sep 20, 2019 •

edited

Loading

jacobaustin123 commented Nov 13, 2019 •

edited

Loading

jacobaustin123 commented Nov 19, 2019 •

edited

Loading