Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

Merged
merged 64 commits into from
Apr 27, 2020
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
eddd918
ENH: added df/series.sort_values(key=...) and df/series.sort_index(ke…
jacobaustin123 Jul 4, 2019
e05462a
fixed a few small bugs
jacobaustin123 Jan 28, 2020
0f33c5c
bug fixes
jacobaustin123 Jan 28, 2020
b7d76cd
fixed
jacobaustin123 Jan 28, 2020
cf1fb5a
Merge branch 'master' of http://github.com/pandas-dev/pandas
jacobaustin123 Jan 28, 2020
8343f76
fixed
jacobaustin123 Jan 28, 2020
94281d3
fixed
jacobaustin123 Jan 28, 2020
c505dd9
updated docstrings
jacobaustin123 Jan 28, 2020
ecb6910
fixed documentation
jacobaustin123 Jan 28, 2020
55c444e
fixed
jacobaustin123 Jan 29, 2020
9d6762b
merged with master
jacobaustin123 Feb 11, 2020
d774b15
updated docs
jacobaustin123 Feb 11, 2020
64e70b4
linting
jacobaustin123 Feb 11, 2020
03d6573
fixed tests
jacobaustin123 Feb 11, 2020
9f5209e
merged
jacobaustin123 Mar 22, 2020
81c0172
reformatted
jacobaustin123 Mar 22, 2020
6d0d725
fixed linting issue
jacobaustin123 Mar 22, 2020
ef72542
fixed conflicts
jacobaustin123 Mar 27, 2020
0aabf56
fixed formatting
jacobaustin123 Mar 27, 2020
210df50
ENH: made sort_index apply the key to each level separately
jacobaustin123 Mar 28, 2020
b40a963
fixed a bug with duplicate names
jacobaustin123 Mar 28, 2020
90e2cfe
fixed strange bug with duplicate column names
jacobaustin123 Mar 28, 2020
8e12404
Merge branch 'master' of http://github.com/pandas-dev/pandas
jacobaustin123 Mar 28, 2020
447c48f
fixed bug
jacobaustin123 Mar 28, 2020
46171f0
fixed linting
jacobaustin123 Mar 28, 2020
a44a999
fixed linting issues
jacobaustin123 Mar 28, 2020
94b795c
disabled tests temporarily
jacobaustin123 Mar 28, 2020
6e651c0
fixed linting
jacobaustin123 Mar 28, 2020
5a92484
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Mar 31, 2020
fbdfc1e
reverted changes due to 33134
jacobaustin123 Mar 31, 2020
c56dbd6
updated documentation
jacobaustin123 Apr 1, 2020
77f44bf
fixed merge conflict
jacobaustin123 Apr 7, 2020
2106d86
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 7, 2020
620f57a
updated docs
jacobaustin123 Apr 7, 2020
6a5bc32
fixed linting issue
jacobaustin123 Apr 7, 2020
5b244fb
try to recover from invalid type in output
jacobaustin123 Apr 7, 2020
6f15e66
fixed linting issue
jacobaustin123 Apr 7, 2020
7d2037b
added more tests
jacobaustin123 Apr 7, 2020
5048944
added some more tests
jacobaustin123 Apr 8, 2020
3b2d176
merged
jacobaustin123 Apr 10, 2020
0e239c8
fixed linting issue
jacobaustin123 Apr 10, 2020
bc44d0d
major documentation additions, removed key for Categorical
jacobaustin123 Apr 10, 2020
07d903c
doc linting issue
jacobaustin123 Apr 10, 2020
ecdbf4c
another linting fix
jacobaustin123 Apr 10, 2020
c376a74
fixed linting actually
jacobaustin123 Apr 10, 2020
f5e5808
moved apply_key to sorting.py
jacobaustin123 Apr 11, 2020
1058839
fixed tests
jacobaustin123 Apr 11, 2020
c87a527
satisfied mypy
jacobaustin123 Apr 11, 2020
e6026d6
fixed isort issues
jacobaustin123 Apr 11, 2020
ab0b887
fixed a doc issue
jacobaustin123 Apr 11, 2020
364cc5e
wow linting is hard
jacobaustin123 Apr 11, 2020
8db09d0
updated whatsnew
jacobaustin123 Apr 13, 2020
7477fd1
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 13, 2020
1d0319c
cleaned up sorting.py
jacobaustin123 Apr 13, 2020
1f60689
fixed indentation
jacobaustin123 Apr 13, 2020
2957e60
removed trailing whitespace
jacobaustin123 Apr 13, 2020
7c6c2f0
linting
jacobaustin123 Apr 13, 2020
ad745c4
fixed small bug with datetimelike, updated docs
jacobaustin123 Apr 13, 2020
3ad3358
fixed trailing whitespace
jacobaustin123 Apr 13, 2020
e87a9a9
Merge branch 'master' of https://github.com/pandas-dev/pandas
jacobaustin123 Apr 13, 2020
a5d5c6d
reverted and updated documentation
jacobaustin123 Apr 27, 2020
56f73ba
merged and updated
jacobaustin123 Apr 27, 2020
4250e31
fixed linting issue and added comments
jacobaustin123 Apr 27, 2020
4d5ba53
fixed small issue in tests
jacobaustin123 Apr 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions doc/source/user_guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1781,6 +1781,24 @@ used to sort a pandas object by its index levels.
# Series
unsorted_df['three'].sort_index()

.. _basics.sort_index_key:

.. versionadded:: 1.1.0

Sorting by index also supports a ``key`` parameter that takes a callable
function to apply to the index being sorted. for `MultiIndex` objects,
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
the key is applied per-level to the levels specified by `level`.

.. ipython:: python

s1 = pd.DataFrame({
"a": ['B', 'a', 'C'],
"b": [1, 2, 3],
"c": [2, 3, 4]
}).set_index(list("ab"))
s1.sort_index(level="a")
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
s1.sort_index(level="a", key=lambda idx: idx.str.lower())
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved

.. _basics.sort_values:

By values
Expand Down Expand Up @@ -1813,6 +1831,33 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

.. _basics.sort_value_key:

.. versionadded:: 1.1.0

Sorting also supports a ``key`` parameter that takes a callable function
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
to apply to the values being sorted.

.. ipython:: python
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved

s1 = pd.Series(['B', 'a', 'C'])
s1.sort_values()
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
s1.sort_values(key=lambda x: x.str.lower())

`key` will be given the :class:`Series` of values and should return a ``Series``
or array of the same shape with the transformed values. For `DataFrame` objects,
the key is applied per column, so the key should still expect a Series and return
a Series, e.g.

.. ipython:: python

df = pd.DataFrame({"a": ['B', 'a', 'C'], "b": [1, 2, 3]})
df.sort_values(by='a')
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
df.sort_values(by='a', key=lambda col: col.str.lower())

The name or type of each column can be used to apply different functions to
different columns.

jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
.. _basics.sort_indexes_and_values:

By indexes and values
Expand Down
38 changes: 38 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,44 @@ For example:
ser["2014"]
ser.loc["May 2015"]

.. _whatsnew_110.key_sorting:

Sorting with keys
^^^^^^^^^^^^^^^^^

We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
to each column of a DataFrame before sorting is performed (:issue:`27237`). See

Apart from the typo, I find "each column" a bit confusing, as it is of course only applied to those columns that are used for sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the new version is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version looks good!

:ref:`sort_values with keys <basics.sort_value_key>` and :ref:`sort_index with keys
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved
<basics.sort_index_key>` for more information.

.. ipython:: python

s = pd.Series(['C', 'a', 'B'])
s.sort_values()
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved


Note how this is sorted with capital letters first. If we apply the `ser.str.lower()` method, we get
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved

.. ipython:: python

s.sort_values(key=lambda x: x.str.lower)
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved


When applied to a `DataFrame`, they key is applied per-column to all columns or a subset if
`by` is specified, e.g.

.. ipython:: python

df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'],
'b': [1, 2, 3, 4, 5, 6]})
df.sort_values(by=['a'], key=lambda col: col.str.lower())
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved


For more details, see examples and documentation in :meth:`DataFrame.sort_values`,
:meth:`Series.sort_values`, and :meth:`~DataFrame.sort_index`.

.. _whatsnew_110.timestamp_fold_support:

Fold argument support in Timestamp constructor
Expand Down
6 changes: 6 additions & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,13 @@

# to maintain type information across generic functions and parametrization
T = TypeVar("T")

# used in decorators to preserve the signature of the function it decorates
# see https://mypy.readthedocs.io/en/stable/generics.html#declaring-decorators
FuncType = Callable[..., Any]
F = TypeVar("F", bound=FuncType)

# types of vectorized key functions for DataFrame::sort_values and
# DataFrame::sort_index, among others
ValueKeyFunc = Optional[Callable[["Series"], Union["Series", AnyArrayLike]]]
IndexKeyFunc = Optional[Callable[["Index"], Union["Index", AnyArrayLike]]]
9 changes: 9 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1221,3 +1221,12 @@ def tick_classes(request):
Fixture for Tick based datetime offsets available for a time series.
"""
return request.param


@pytest.fixture(params=[None, lambda x: x])
def sort_by_key(request):
"""
Simple fixture for testing keys in sorting methods.
Tests None (no key) and the identity key.
"""
return request.param
4 changes: 3 additions & 1 deletion pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1532,7 +1532,9 @@ def argsort(self, ascending=True, kind="quicksort", **kwargs):
"""
return super().argsort(ascending=ascending, kind=kind, **kwargs)

def sort_values(self, inplace=False, ascending=True, na_position="last"):
def sort_values(
self, inplace: bool = False, ascending: bool = True, na_position: str = "last",
):
"""
Sort the Categorical by category value returning a new
Categorical by default.
Expand Down
47 changes: 42 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,11 @@
Axis,
Dtype,
FilePathOrBuffer,
IndexKeyFunc,
Label,
Level,
Renamer,
ValueKeyFunc,
)
from pandas.compat import PY37
from pandas.compat._optional import import_optional_dependency
Expand Down Expand Up @@ -139,6 +141,7 @@
)
from pandas.core.ops.missing import dispatch_fill_zeros
from pandas.core.series import Series
from pandas.core.sorting import ensure_key_mapped

from pandas.io.common import get_filepath_or_buffer
from pandas.io.formats import console, format as fmt
Expand Down Expand Up @@ -4935,10 +4938,10 @@ def f(vals):

# ----------------------------------------------------------------------
# Sorting

# TODO: Just move the sort_values doc here.
jreback marked this conversation as resolved.
Show resolved Hide resolved
@Substitution(**_shared_doc_kwargs)
@Appender(NDFrame.sort_values.__doc__)
def sort_values(
def sort_values( # type: ignore[override] # NOQA # issue 27237
self,
by,
axis=0,
Expand All @@ -4947,6 +4950,7 @@ def sort_values(
kind="quicksort",
na_position="last",
ignore_index=False,
key: ValueKeyFunc = None,
):
inplace = validate_bool_kwarg(inplace, "inplace")
axis = self._get_axis_number(axis)
Expand All @@ -4961,19 +4965,30 @@ def sort_values(
from pandas.core.sorting import lexsort_indexer

keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
indexer = lexsort_indexer(keys, orders=ascending, na_position=na_position)

# need to rewrap columns in Series to apply key function
if key is not None:
keys = [Series(k, name=name) for (k, name) in zip(keys, by)]

indexer = lexsort_indexer(
keys, orders=ascending, na_position=na_position, key=key
)
indexer = ensure_platform_int(indexer)
else:
from pandas.core.sorting import nargsort

by = by[0]
k = self._get_label_or_level_values(by, axis=axis)

# need to rewrap column in Series to apply key function
if key is not None:
k = Series(k, name=by)

if isinstance(ascending, (tuple, list)):
ascending = ascending[0]

indexer = nargsort(
k, kind=kind, ascending=ascending, na_position=na_position
k, kind=kind, ascending=ascending, na_position=na_position, key=key
)

new_data = self._mgr.take(
Expand All @@ -4999,6 +5014,7 @@ def sort_index(
na_position: str = "last",
sort_remaining: bool = True,
ignore_index: bool = False,
key: IndexKeyFunc = None,
):
"""
Sort object by labels (along an axis).
Expand Down Expand Up @@ -5034,6 +5050,16 @@ def sort_index(

.. versionadded:: 1.0.0

key : callable, optional
If not None, apply the key function to the index values
before sorting. This is similar to the `key` argument in the
builtin :meth:`sorted` function, with the notable difference that
this `key` function should be *vectorized*. It should expect an
``Index`` and return an ``Index`` of the same shape. For MultiIndex
inputs, the key is applied *per level*.
jacobaustin123 marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 1.1.0

Returns
-------
DataFrame
Expand Down Expand Up @@ -5067,6 +5093,17 @@ def sort_index(
100 1
29 2
1 4

A key function can be specified which is applied to the index before
sorting. For a ``MultiIndex`` this is applied to each level separately.

>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
>>> df.sort_index(key=lambda x: x.str.lower())
a
A 1
b 2
C 3
d 4
"""
# TODO: this can be combined with Series.sort_index impl as
# almost identical
Expand All @@ -5075,12 +5112,12 @@ def sort_index(

axis = self._get_axis_number(axis)
labels = self._get_axis(axis)
labels = ensure_key_mapped(labels, key, levels=level)

# make sure that the axis is lexsorted to start
# if not we need to reconstruct to get the correct indexer
labels = labels._sort_levels_monotonic()
if level is not None:

new_axis, indexer = labels.sortlevel(
level, ascending=ascending, sort_remaining=sort_remaining
)
Expand Down
Loading