ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

jonmmease · 2017-08-28T19:38:48Z

This PR implements the changes proposed in #14353. @jorisvandenbossche

closes ENH: Support sorting DataFrames by a combination of columns and index levels #14353
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
Rework to use utility methods introduced in Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484 after it is merged

gfyoung · 2017-08-28T20:30:13Z

doc/source/basics.rst

@@ -1726,8 +1726,9 @@ Sorting
   The sorting API is substantially changed in 0.17.0, see :ref:`here <whatsnew_0170.api_breaking.sorting>` for these changes.
   In particular, all sorting methods now return a new object by default, and **DO NOT** operate in-place (except by passing ``inplace=True``).

-There are two obvious kinds of sorting that you may be interested in: sorting
-by label and sorting by actual values.
+There are three obvious kinds of sorting that you may be interested in: sorting


I know this word was in the original doc, but I'm a little uneasy about judgmental words like "obvious." I think we can just remove it without impacting much.

Agreed, I've reworded it.

gfyoung · 2017-08-28T20:30:47Z

doc/source/basics.rst

@@ -1776,6 +1783,34 @@ argument:
   s.sort_values()
   s.sort_values(na_position='first')

+By Indexes and Values
+~~~~~~~~~~~~~~~~~~~~~
+.. versionadded:: 0.21


Not sure you need this version-added tag here since you add it below.

I left this versionadded tag and removed the one below in the note. This will help people quickly realize that the feature is new and it should be clear that the note below wouldn't apply without the feature.

gfyoung · 2017-08-28T20:34:40Z

pandas/core/generic.py

@@ -67,7 +67,8 @@
    args_transpose='axes to permute (int or label for object)',
    optional_by="""
        by : str or list of str
-            Name or list of names which refer to the axis items.""")
+            Name or list of names which refer to the axis items or index
+            levels.""")


Not blocking: My OCD really wants you to not "orphan" the "levels" word at the end of this sentence 😄 . However, that's largely a me problem, but if you do happen to come up with a wording that doesn't do this orphaning, by all means go for it.

Heh, now it bothers me too! I reworked it to fit on one line. "Name or list of names matching axis items or index levels."

codecov · 2017-08-29T12:32:57Z

Codecov Report

❗ No coverage uploaded for pull request base (master@36dadd7). Click here to learn what that means.
The diff coverage is 93.33%.

@@            Coverage Diff            @@
##             master   #17361   +/-   ##
=========================================
  Coverage          ?   91.02%           
=========================================
  Files             ?      162           
  Lines             ?    49574           
  Branches          ?        0           
=========================================
  Hits              ?    45123           
  Misses            ?     4451           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`88.8% <93.33%> (?)`
#single	`40.24% <6.66%> (?)`

Impacted Files	Coverage Δ
pandas/core/generic.py	`91.99% <ø> (ø)`
pandas/core/frame.py	`97.68% <93.33%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36dadd7...10b4e24. Read the comment docs.

codecov · 2017-08-29T12:32:58Z

Codecov Report

Merging #17361 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17361      +/-   ##
==========================================
- Coverage   91.57%   91.57%   -0.01%     
==========================================
  Files         150      150              
  Lines       48941    48940       -1     
==========================================
- Hits        44817    44816       -1     
  Misses       4124     4124

Flag	Coverage Δ
#multiple	`89.93% <100%> (-0.01%)`	⬇️
#single	`41.75% <12.5%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.62% <100%> (-0.01%)`	⬇️
pandas/core/groupby.py	`92.14% <100%> (ø)`	⬆️
pandas/core/generic.py	`95.95% <100%> (ø)`	⬆️
pandas/core/reshape/merge.py	`94.21% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c19bdc9...71748b6. Read the comment docs.

jreback · 2017-11-23T16:37:51Z

@jmmease this looks like an older version of #17484

jonmmease · 2017-11-23T17:01:20Z

@jreback This PR was for the sort_values col&index change. It had some shared utility code with #17484 (the merge col&index change) so i was holding off on updating it until that one was merged.

Do you want to keep this PR closed until #17484 is merged and I make the updates?

jreback · 2017-11-23T22:15:37Z

np can rebase and evaluate after #17484

# Conflicts: # doc/source/basics.rst # doc/source/whatsnew/v0.21.0.txt # pandas/core/frame.py

Update some documentation vocabulary to label/level distinction

- Move to test_sort_values_level_as_str.py - Use fixtures and parametrize - Added axis=1 test cases

jonmmease · 2017-12-02T19:56:16Z

Ok, I just pushed a bunch of updates:

Converted to use the new _get_label_or_level_values utility method introduced in Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484
Moved tests to a new file, updated to use fixtures/parametrize, and added axis=1 test cases
I reviewed the sort_values documentation in the docstrings and in basics.rst and made small language updates throughout. I hope these changes make things a bit clearer and more consistent.
I added a full whatsnew section for 0.22.0

@jreback @jorisvandenbossche @gfyoung @TomAugspurger

jonmmease · 2017-12-06T14:39:20Z

Just following up on this.

@gfyoung if you have time to look this over at some point, I'd appreciate your feedback on these documentation/docstring updates.

@jreback does this approach look alright to you?

Thanks all

jreback · 2017-12-07T01:03:20Z

@jmmease lgtm. @TomAugspurger @jorisvandenbossche if you'd have a look.

jorisvandenbossche

@jmmease Added some comments!

jorisvandenbossche · 2017-12-07T09:51:19Z

doc/source/basics.rst

@@ -1738,19 +1738,26 @@ description.
 Sorting
 -------

-There are two obvious kinds of sorting that you may be interested in: sorting
-by label and sorting by actual values.
+Pandas supports three kinds of sorting: sorting by index levels,


"index levels" -> "index labels" ?

Maybe this isn't correct. But I was trying to be consistent in using level to refer to the names of the indexes or columns (df.index.names, df.columns.names) and label to refer to the values referenced by the levels (df.index.get_level_values(lv1), df.columns.get_level_values(lv1)). Is that in line with accepted vocabulary? @jorisvandenbossche

Personally I think "index level" is the equivalent of "column", and from there it would follow that the equivalent of "column values" is "index labels of one level / index level labels / index level values" (but none of those sound that good) and the equivalent of "column names/labels" would then be "index level name".
But it's not that we do that very consistently currently, or that this is necessarily agreed upon vocabulary. But I think it is in line more or less with what you say?

The main reason I commented it here, is because the next line uses "column values", and as you say "label to refer to the values" then to be consistent it would be "index labels".
And a second reason I commented is that I think "index level" is more complicated, as then a user needs to be familiar with the concept of MultiIndexes, while for this section about sorting, that is not needed.

Thanks for clarifying. After looking at it again, I do agree that "index labels" is the better parallel to "columns values".

jorisvandenbossche · 2017-12-07T09:53:32Z

doc/source/basics.rst

-which will use an arbitrary vector or a column name of the DataFrame to
-determine the sort order:
+The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are
+used to sort a pandas object by its values.  The optional ``by`` parameter to


I would not use just "values" but indicate "column values" or "columns". Values in the index are also values of the object.

Ah, but that is because it can also be rows I suppose :-) Then I would keep the explicit "column or row values" like before.

I split this into two sentences so that a Series is sorted by "values" and a DataFrame is sorted by "column or row values".

jorisvandenbossche · 2017-12-07T09:57:40Z

doc/source/basics.rst

+                           index=idx)
+   df_multi
+
+   # Sort by 'second' (index) and 'A' (column)


I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out)

Sounds good

jorisvandenbossche · 2017-12-07T10:38:07Z

pandas/core/frame.py

@@ -113,7 +113,15 @@
    axes_single_arg="{0 or 'index', 1 or 'columns'}",
    optional_by="""
        by : str or list of str
-            Name or list of names which refer to the axis items.""",
+            Name or list of names matching axis levels or off-axis labels.


I find "off-axis labels" very complex. I mean I know what you mean and theoretically it is correct, but I think novice users will not understand this (the previous "axis items" was also not that good).

I am still thinking of a better wording (the annoying thing is that you cannot just say "column labels or index level names" which is probably the common case how this is used)

How about "Name or list of names to sort by"? And then things are clarified below.

Sounds good. Going with "Name or list of names to sort by"

jorisvandenbossche · 2017-12-07T10:42:51Z

pandas/tests/frame/test_sort_values_level_as_str.py

+
+    # Sorting by 'idx' should sort by the idx column and raise a
+    # FutureWarning
+    with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):


Can you see if you can get this working without check_stacklevel=False ?

Sure. Could you offer any insight into what check_stacklevel is actually checking? I've had trouble understanding that @jorisvandenbossche

See #9584 and #10676.
The stacklevel is specified in the FutureWarning in _check_label_or_level_ambiguity, and is currenlty set at 2, which is not correct. This should be higher, depending on how many call steps there are between sort_values and calling that function.
The main problem here will probably be that _check_label_or_level_ambiguity is used in multiple places, so the needed stacklevel might differ for the different usages.

Thanks, that's helpful

The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning there's an optional stacklevel parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.

It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.

Ok, I was able to remove check_stacklevel=False for these tests, and for the related tests for groupby and merge. I added a stacklevel parameter to _check_label_or_level_ambiguity and _get_label_or_level_values that behaves like the stacklevel parameter of warnings.warn itself (stacklevel=1 blames the caller, stacklevel=2 blames the callers parent, etc.).

Does this look like a good way to handle it? @TomAugspurger @jorisvandenbossche

jorisvandenbossche · 2017-12-07T10:50:41Z

pandas/tests/frame/test_sorting.py

            # multi-column 'by' is separate codepath
            df.sort_values(by=['a', 'b'])

        # with multi-index
        # GH4370
        df = DataFrame(np.random.randn(4, 2),
                       columns=MultiIndex.from_tuples([('a', 0), ('a', 1)]))
-        with tm.assert_raises_regex(ValueError, 'levels'):
+        with tm.assert_raises_regex(ValueError, 'not unique'):


I think for this case the old error message was more informative, so it would be nice if we could keep a separate message for the case of incomplete key for MI compared to just non-unique key

Ok, sure. The new error message will read as follows (for the MultiIndex case):

ValueError: The column label 'a' is not unique. For a multi-index, the label must be a tuple with elements corresponding to each level.

Does that sound clear? @jorisvandenbossche

jorisvandenbossche · 2017-12-07T10:51:13Z

pandas/core/frame.py

-                if isinstance(self.columns, MultiIndex):
-                    raise ValueError('Cannot sort by column %s in a '
-                                     'multi-index you need to explicitly '
-                                     'provide all the levels' % str(by))


See comment below as well, so it would be nice to somehow keep this message

TomAugspurger · 2017-12-07T13:16:01Z

doc/source/basics.rst

+.. versionadded:: 0.22.0
+
+Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may
+refer to either columns or index levels.


Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to?

Side-question, does this work on a regular (non-multiindex) index?

I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too.

Changed to "index level names".

And yes, this does work for both the Index and MultIndex cases.
@TomAugspurger

TomAugspurger · 2017-12-07T13:19:43Z

pandas/core/frame.py

+            - if `axis` is 1 or `'columns'` then `by` may contain column
+              levels and/or index labels
+
+            Support for specify index/column levels was added in


This could be a ..versionmodified:

.. versionmodified:: 0.22.0 Allow specifying index or column level names.

TomAugspurger · 2017-12-07T13:21:27Z

pandas/core/frame.py

@@ -113,7 +113,15 @@
    axes_single_arg="{0 or 'index', 1 or 'columns'}",
    optional_by="""
        by : str or list of str
-            Name or list of names which refer to the axis items.""",
+            Name or list of names matching axis levels or off-axis labels.


How about "Name or list of names to sort by"? And then things are clarified below.

TomAugspurger · 2017-12-07T13:25:38Z

pandas/tests/frame/test_sort_values_level_as_str.py

+
+    # Sorting by 'idx' should sort by the idx column and raise a
+    # FutureWarning
+    with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):


The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning there's an optional stacklevel parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.

It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.

jonmmease · 2017-12-07T17:17:38Z

Thanks for the thorough comments @jorisvandenbossche and @TomAugspurger

I just pushed an update that I believe address all of them. Let me know if I missed anything or if you see any other updates you'd like before this is merged.

jonmmease · 2017-12-11T12:58:50Z

@TomAugspurger @jorisvandenbossche Did these latest updates address your comments? Are there any other changes you'd like to see before this is merged? Thanks!

cc @jreback

jonmmease · 2017-12-18T17:36:13Z

Friendly ping
@TomAugspurger @jorisvandenbossche @jreback

…h_14353 Move whatsnew to 0.23.0 # Conflicts: # doc/source/whatsnew/v0.22.0.txt

jonmmease · 2017-12-26T15:22:12Z

@jreback @jorisvandenbossche @TomAugspurger I noticed the milestone tag for this was moved to 0.23.0. I just moved the whatsnew to this version. Are we in a freeze for merging things into 0.23.0 until 0.22.0 is released?

…h_14353 # Conflicts: # doc/source/basics.rst

jreback · 2017-12-30T18:33:10Z

@jmmease 0.22 is a special release coming out today. so all previous 0.22 is now 0.23.

jreback · 2017-08-29T13:25:52Z

pandas/tests/frame/test_sorting.py

@@ -552,3 +559,185 @@ def test_sort_index_intervalindex(self):
            closed='right')
        result = result.columns.levels[1].categories
        tm.assert_index_equal(result, expected)
+
+    def test_sort_index_and_column(self):
+        # Build MultiIndex


add the issue number as a comment

jreback · 2017-08-29T13:26:25Z

pandas/tests/frame/test_sorting.py

+        expected = df_none.sort_values('inner').set_index('inner')
+        result = df_single.sort_values('inner')
+        assert_frame_equal(result, expected)
+        #   - Descending


jreback · 2017-08-29T13:26:43Z

pandas/tests/frame/test_sorting.py

+
+        result = df_multi.sort_values('inner', ascending=False)
+        assert_frame_equal(result, expected)
+        #   - Descending


jreback · 2017-12-30T18:39:49Z

@jmmease just a couple of minor comments. note that windows CI is currently failing. just wait to repush.

jonmmease · 2017-12-30T18:56:53Z

Thanks @jreback . If I'm not mistaken, It looks like the comments you just made ("blank line", and "issue number") are against an older version of the PR. The new tests were parametrized and moved to test_sort_values_level_as_str.py.

jonmmease · 2018-01-01T01:34:13Z

@jreback The AppVeyor build is passing now, but I'm getting a "The job exceeded the maximum time limit for jobs, and has been terminated." error for most of the Travis builds.

jreback · 2018-01-01T01:47:05Z

yeah travis just started failing
not sure why

jreback · 2018-01-01T14:20:07Z

rebase and push again, fixed some hanging by Travis CI

…h_14353

jonmmease · 2018-01-01T14:24:47Z

Done. Thanks @jreback . Happy New Year BTW!

jreback · 2018-01-01T14:44:35Z

happy new year. will have a final look, but this lgtm. @TomAugspurger

jonmmease · 2018-01-05T17:07:06Z

Thanks for taking a look @TomAugspurger
Anything else @jreback ?

jreback · 2018-01-05T19:15:52Z

thanks @jmmease really nice patch! keep em coming!

jonmmease · 2018-01-05T19:54:03Z

Awesome, thanks for all your time on this @jreback @TomAugspurger, @jorisvandenbossche, and @gfyoung !

Support sorting frames by a combo of columns and index levels GH 14353

6e05de5

gfyoung added API Design Enhancement labels Aug 28, 2017

gfyoung reviewed Aug 28, 2017

View reviewed changes

Documentation cleanup for review

10b4e24

Added whatsnew entry

5712269

jonmmease mentioned this pull request Sep 9, 2017

Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484

Merged

5 tasks

jreback closed this Nov 23, 2017

jreback reopened this Nov 23, 2017

Jon M. Mease added 6 commits December 2, 2017 10:10

Merge remote-tracking branch 'upstream/master' into enh_14353

89a7f5f

# Conflicts: # doc/source/basics.rst # doc/source/whatsnew/v0.21.0.txt # pandas/core/frame.py

Moved whatsnew to 0.22.0

42d5ec3

Convert to new _get_label_or_level_values utility method

7c7edfe

Update some documentation vocabulary to label/level distinction

Fix up sorting test suite

a6dfd0a

Refactor new test cases.

acb13a4

- Move to test_sort_values_level_as_str.py - Use fixtures and parametrize - Added axis=1 test cases

Documentation cleanup.

14baf33

jreback added this to the 0.22.0 milestone Dec 7, 2017

jorisvandenbossche reviewed Dec 7, 2017

View reviewed changes

TomAugspurger reviewed Dec 7, 2017

View reviewed changes

Jon M. Mease added 2 commits December 7, 2017 09:25

Documentation updates per review

4a05ffa

Improve warning stacklevel handling for all uses of lable/level utils

ceacad4

Jon M. Mease added 2 commits December 7, 2017 12:05

Reintroduce specialized error message for the multi-index case

85f0363

Merge remote-tracking branch 'upstream/master' into enh_14353

85cafb6

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

de0f336

…h_14353 Move whatsnew to 0.23.0 # Conflicts: # doc/source/whatsnew/v0.22.0.txt

Jon M. Mease added 2 commits December 26, 2017 10:27

Update release version references to 0.23.0

bbbda0f

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

bbda441

…h_14353 # Conflicts: # doc/source/basics.rst

jreback approved these changes Dec 30, 2017

View reviewed changes

Added github issue numbers to new tests

3ba4ef6

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

71748b6

…h_14353

TomAugspurger approved these changes Jan 2, 2018

View reviewed changes

jreback merged commit 7663a63 into pandas-dev:master Jan 5, 2018

This was referenced May 16, 2018

What to do with future Ambiguity Error about overlapping names when sorting? #21080

Closed

What to do with future Ambiguity Error about overlapping names when sorting? #21081

Closed

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

Conversation

jonmmease commented Aug 28, 2017 • edited Loading

gfyoung Aug 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 29, 2017

Codecov Report

codecov bot commented Aug 29, 2017 • edited Loading

Codecov Report

jreback commented Nov 23, 2017

jonmmease commented Nov 23, 2017

jreback commented Nov 23, 2017

jonmmease commented Dec 2, 2017

jonmmease commented Dec 6, 2017

jreback commented Dec 7, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Dec 7, 2017

jonmmease commented Dec 11, 2017

jonmmease commented Dec 18, 2017

jonmmease commented Dec 26, 2017

jreback commented Dec 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 30, 2017 • edited Loading

jonmmease commented Dec 30, 2017

jonmmease commented Jan 1, 2018

jreback commented Jan 1, 2018

jreback commented Jan 1, 2018

jonmmease commented Jan 1, 2018

jreback commented Jan 1, 2018

jonmmease commented Jan 5, 2018

jreback commented Jan 5, 2018

jonmmease commented Jan 5, 2018

jonmmease commented Aug 28, 2017 •

edited

Loading

gfyoung Aug 28, 2017 •

edited

Loading

gfyoung Aug 28, 2017 •

edited

Loading

codecov bot commented Aug 29, 2017 •

edited

Loading

jreback commented Dec 30, 2017 •

edited

Loading