Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

Merged
merged 18 commits into from
Jan 5, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 69 additions & 27 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,11 +226,11 @@ We can also do elementwise :func:`divmod`:
Missing data / operations with fill values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In Series and DataFrame, the arithmetic functions have the option of inputting
a *fill_value*, namely a value to substitute when at most one of the values at
a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
case the result will be NaN (you can later replace NaN with some other value
In Series and DataFrame, the arithmetic functions have the option of inputting
a *fill_value*, namely a value to substitute when at most one of the values at
a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
case the result will be NaN (you can later replace NaN with some other value
using ``fillna`` if you wish).

.. ipython:: python
Expand Down Expand Up @@ -260,8 +260,8 @@ arithmetic operations described above:
df.gt(df2)
df2.ne(df)

These operations produce a pandas object of the same type as the left-hand-side
input that is of dtype ``bool``. These ``boolean`` objects can be used in
These operations produce a pandas object of the same type as the left-hand-side
input that is of dtype ``bool``. These ``boolean`` objects can be used in
indexing operations, see the section on :ref:`Boolean indexing<indexing.boolean>`.

.. _basics.reductions:
Expand Down Expand Up @@ -452,7 +452,7 @@ So, for instance, to reproduce :meth:`~DataFrame.combine_first` as above:
Descriptive statistics
----------------------

There exists a large number of methods for computing descriptive statistics and
There exists a large number of methods for computing descriptive statistics and
other related operations on :ref:`Series <api.series.stats>`, :ref:`DataFrame
<api.dataframe.stats>`, and :ref:`Panel <api.panel.stats>`. Most of these
are aggregations (hence producing a lower-dimensional result) like
Expand Down Expand Up @@ -540,7 +540,7 @@ will exclude NAs on Series input by default:
np.mean(df['one'])
np.mean(df['one'].values)

:meth:`Series.nunique` will return the number of unique non-NA values in a
:meth:`Series.nunique` will return the number of unique non-NA values in a
Series:

.. ipython:: python
Expand Down Expand Up @@ -852,7 +852,7 @@ Aggregation API
The aggregation API allows one to express possibly multiple aggregation operations in a single concise way.
This API is similar across pandas objects, see :ref:`groupby API <groupby.aggregate>`, the
:ref:`window functions API <stats.aggregate>`, and the :ref:`resample API <timeseries.aggregate>`.
The entry point for aggregation is :meth:`DataFrame.aggregate`, or the alias
The entry point for aggregation is :meth:`DataFrame.aggregate`, or the alias
:meth:`DataFrame.agg`.

We will use a similar starting frame from above:
Expand All @@ -864,8 +864,8 @@ We will use a similar starting frame from above:
tsdf.iloc[3:7] = np.nan
tsdf

Using a single function is equivalent to :meth:`~DataFrame.apply`. You can also
pass named methods as strings. These will return a ``Series`` of the aggregated
Using a single function is equivalent to :meth:`~DataFrame.apply`. You can also
pass named methods as strings. These will return a ``Series`` of the aggregated
output:

.. ipython:: python
Expand All @@ -887,7 +887,7 @@ Single aggregations on a ``Series`` this will return a scalar value:
Aggregating with multiple functions
+++++++++++++++++++++++++++++++++++

You can pass multiple aggregation arguments as a list.
You can pass multiple aggregation arguments as a list.
The results of each of the passed functions will be a row in the resulting ``DataFrame``.
These are naturally named from the aggregation function.

Expand Down Expand Up @@ -1430,7 +1430,7 @@ Series can also be used:
df.rename(columns={'one': 'foo', 'two': 'bar'},
index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

If the mapping doesn't include a column/index label, it isn't renamed. Note that
If the mapping doesn't include a column/index label, it isn't renamed. Note that
extra labels in the mapping don't throw an error.

.. versionadded:: 0.21.0
Expand Down Expand Up @@ -1740,19 +1740,26 @@ description.
Sorting
-------

There are two obvious kinds of sorting that you may be interested in: sorting
by label and sorting by actual values.
Pandas supports three kinds of sorting: sorting by index labels,
sorting by column values, and sorting by a combination of both.

.. _basics.sort_index:

By Index
~~~~~~~~

The primary method for sorting axis
labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index()`` methods.
The :meth:`Series.sort_index` and :meth:`DataFrame.sort_index` methods are
used to sort a pandas object by its index levels.

.. ipython:: python

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
columns=['three', 'two', 'one'])
unsorted_df

# DataFrame
unsorted_df.sort_index()
Expand All @@ -1762,20 +1769,22 @@ labels (indexes) are the ``Series.sort_index()`` and the ``DataFrame.sort_index(
# Series
unsorted_df['three'].sort_index()

.. _basics.sort_values:

By Values
~~~~~~~~~

The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` are the entry points for **value** sorting (i.e. the values in a column or row).
:meth:`DataFrame.sort_values` can accept an optional ``by`` argument for ``axis=0``
which will use an arbitrary vector or a column name of the DataFrame to
determine the sort order:
The :meth:`Series.sort_values` method is used to sort a `Series` by its values. The
:meth:`DataFrame.sort_values` method is used to sort a `DataFrame` by its column or row values.
The optional ``by`` parameter to :meth:`DataFrame.sort_values` may used to specify one or more columns
to use to determine the sorted order.

.. ipython:: python

df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
df1.sort_values(by='two')

The ``by`` argument can take a list of column names, e.g.:
The ``by`` parameter can take a list of column names, e.g.:

.. ipython:: python

Expand All @@ -1790,6 +1799,39 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

.. _basics.sort_indexes_and_values:

By Indexes and Values
~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.23.0

Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may
refer to either columns or index level names.

.. ipython:: python

# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
index=idx)
df_multi

Sort by 'second' (index) and 'A' (column)

.. ipython:: python

df_multi.sort_values(by=['second', 'A'])

.. note::

If a string matches both a column name and an index level name then a
warning is issued and the column takes precedence. This will result in an
ambiguity error in a future version.

.. _basics.searchsorted:

Expand Down Expand Up @@ -1881,7 +1923,7 @@ The main types stored in pandas objects are ``float``, ``int``, ``bool``,
``int64`` and ``int32``. See :ref:`Series with TZ <timeseries.timezone_series>`
for more detail on ``datetime64[ns, tz]`` dtypes.

A convenient :attr:`~DataFrame.dtypes` attribute for DataFrame returns a Series
A convenient :attr:`~DataFrame.dtypes` attribute for DataFrame returns a Series
with the data type of each column.

.. ipython:: python
Expand All @@ -1902,8 +1944,8 @@ On a ``Series`` object, use the :attr:`~Series.dtype` attribute.

dft['A'].dtype

If a pandas object contains data with multiple dtypes *in a single column*, the
dtype of the column will be chosen to accommodate all of the data types
If a pandas object contains data with multiple dtypes *in a single column*, the
dtype of the column will be chosen to accommodate all of the data types
(``object`` is the most general).

.. ipython:: python
Expand Down Expand Up @@ -1941,7 +1983,7 @@ defaults
~~~~~~~~

By default integer types are ``int64`` and float types are ``float64``,
*regardless* of platform (32-bit or 64-bit).
*regardless* of platform (32-bit or 64-bit).
The following will all result in ``int64`` dtypes.

.. ipython:: python
Expand Down
26 changes: 26 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,32 @@ levels <merging.merge_on_columns_and_levels>` documentation section.

left.merge(right, on=['key1', 'key2'])

.. _whatsnew_0230.enhancements.sort_by_columns_and_levels:

Sorting by a combination of columns and index levels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Strings passed to :meth:`DataFrame.sort_values` as the ``by`` parameter may
now refer to either column names or index level names. This enables sorting
``DataFrame`` instances by a combination of index levels and columns without
resetting indexes. See the :ref:`Sorting by Indexes and Values
<basics.sort_indexes_and_values>` documentation section.
(:issue:`14353`)

.. ipython:: python

# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
index=idx)
df_multi

# Sort by 'second' (index) and 'A' (column)
df_multi.sort_values(by=['second', 'A'])

.. _whatsnew_0230.enhancements.ran_inf:

Expand Down
30 changes: 14 additions & 16 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,15 @@
axes_single_arg="{0 or 'index', 1 or 'columns'}",
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""",
Name or list of names to sort by.

- if `axis` is 0 or `'index'` then `by` may contain index
levels and/or column labels
- if `axis` is 1 or `'columns'` then `by` may contain column
levels and/or index labels

.. versionmodified:: 0.23.0
Allow specifying index or column level names.""",
versionadded_to_excel='',
optional_labels="""labels : array-like, optional
New labels / index to conform the axis specified by 'axis' to.""",
Expand Down Expand Up @@ -3623,7 +3631,7 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
kind='quicksort', na_position='last'):
inplace = validate_bool_kwarg(inplace, 'inplace')
axis = self._get_axis_number(axis)
other_axis = 0 if axis == 1 else 1
stacklevel = 2 # Number of stack levels from df.sort_values

if not isinstance(by, list):
by = [by]
Expand All @@ -3635,10 +3643,8 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,

keys = []
for x in by:
k = self.xs(x, axis=other_axis).values
if k.ndim == 2:
raise ValueError('Cannot sort by duplicate column %s' %
str(x))
k = self._get_label_or_level_values(x, axis=axis,
stacklevel=stacklevel)
keys.append(k)
indexer = lexsort_indexer(keys, orders=ascending,
na_position=na_position)
Expand All @@ -3647,17 +3653,9 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False,
from pandas.core.sorting import nargsort

by = by[0]
k = self.xs(by, axis=other_axis).values
if k.ndim == 2:

# try to be helpful
if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot sort by column %s in a '
'multi-index you need to explicitly '
'provide all the levels' % str(by))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below as well, so it would be nice to somehow keep this message

k = self._get_label_or_level_values(by, axis=axis,
stacklevel=stacklevel)

raise ValueError('Cannot sort by duplicate column %s' %
str(by))
if isinstance(ascending, (tuple, list)):
ascending = ascending[0]

Expand Down
Loading