-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361
Conversation
doc/source/basics.rst
Outdated
@@ -1726,8 +1726,9 @@ Sorting | |||
The sorting API is substantially changed in 0.17.0, see :ref:`here <whatsnew_0170.api_breaking.sorting>` for these changes. | |||
In particular, all sorting methods now return a new object by default, and **DO NOT** operate in-place (except by passing ``inplace=True``). | |||
|
|||
There are two obvious kinds of sorting that you may be interested in: sorting | |||
by label and sorting by actual values. | |||
There are three obvious kinds of sorting that you may be interested in: sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this word was in the original doc, but I'm a little uneasy about judgmental words like "obvious." I think we can just remove it without impacting much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've reworded it.
doc/source/basics.rst
Outdated
@@ -1776,6 +1783,34 @@ argument: | |||
s.sort_values() | |||
s.sort_values(na_position='first') | |||
|
|||
By Indexes and Values | |||
~~~~~~~~~~~~~~~~~~~~~ | |||
.. versionadded:: 0.21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you need this version-added tag here since you add it below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this versionadded tag and removed the one below in the note. This will help people quickly realize that the feature is new and it should be clear that the note below wouldn't apply without the feature.
pandas/core/generic.py
Outdated
@@ -67,7 +67,8 @@ | |||
args_transpose='axes to permute (int or label for object)', | |||
optional_by=""" | |||
by : str or list of str | |||
Name or list of names which refer to the axis items.""") | |||
Name or list of names which refer to the axis items or index | |||
levels.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking: My OCD really wants you to not "orphan" the "levels" word at the end of this sentence 😄 . However, that's largely a me problem, but if you do happen to come up with a wording that doesn't do this orphaning, by all means go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, now it bothers me too! I reworked it to fit on one line. "Name or list of names matching axis items or index levels."
Codecov Report
@@ Coverage Diff @@
## master #17361 +/- ##
=========================================
Coverage ? 91.02%
=========================================
Files ? 162
Lines ? 49574
Branches ? 0
=========================================
Hits ? 45123
Misses ? 4451
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17361 +/- ##
==========================================
- Coverage 91.57% 91.57% -0.01%
==========================================
Files 150 150
Lines 48941 48940 -1
==========================================
- Hits 44817 44816 -1
Misses 4124 4124
Continue to review full report at Codecov.
|
@jmmease this looks like an older version of #17484 |
np can rebase and evaluate after #17484 |
# Conflicts: # doc/source/basics.rst # doc/source/whatsnew/v0.21.0.txt # pandas/core/frame.py
Update some documentation vocabulary to label/level distinction
- Move to test_sort_values_level_as_str.py - Use fixtures and parametrize - Added axis=1 test cases
Ok, I just pushed a bunch of updates:
|
@jmmease lgtm. @TomAugspurger @jorisvandenbossche if you'd have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmmease Added some comments!
doc/source/basics.rst
Outdated
@@ -1738,19 +1738,26 @@ description. | |||
Sorting | |||
------- | |||
|
|||
There are two obvious kinds of sorting that you may be interested in: sorting | |||
by label and sorting by actual values. | |||
Pandas supports three kinds of sorting: sorting by index levels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"index levels" -> "index labels" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this isn't correct. But I was trying to be consistent in using level
to refer to the names of the indexes or columns (df.index.names
, df.columns.names
) and label
to refer to the values referenced by the levels (df.index.get_level_values(lv1)
, df.columns.get_level_values(lv1)
). Is that in line with accepted vocabulary? @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think "index level" is the equivalent of "column", and from there it would follow that the equivalent of "column values" is "index labels of one level / index level labels / index level values" (but none of those sound that good) and the equivalent of "column names/labels" would then be "index level name".
But it's not that we do that very consistently currently, or that this is necessarily agreed upon vocabulary. But I think it is in line more or less with what you say?
The main reason I commented it here, is because the next line uses "column values", and as you say "label to refer to the values" then to be consistent it would be "index labels".
And a second reason I commented is that I think "index level" is more complicated, as then a user needs to be familiar with the concept of MultiIndexes, while for this section about sorting, that is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. After looking at it again, I do agree that "index labels" is the better parallel to "columns values".
doc/source/basics.rst
Outdated
which will use an arbitrary vector or a column name of the DataFrame to | ||
determine the sort order: | ||
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are | ||
used to sort a pandas object by its values. The optional ``by`` parameter to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not use just "values" but indicate "column values" or "columns". Values in the index are also values of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but that is because it can also be rows I suppose :-) Then I would keep the explicit "column or row values" like before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split this into two sentences so that a Series
is sorted by "values" and a DataFrame
is sorted by "column or row values".
doc/source/basics.rst
Outdated
index=idx) | ||
df_multi | ||
|
||
# Sort by 'second' (index) and 'A' (column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
pandas/core/frame.py
Outdated
@@ -113,7 +113,15 @@ | |||
axes_single_arg="{0 or 'index', 1 or 'columns'}", | |||
optional_by=""" | |||
by : str or list of str | |||
Name or list of names which refer to the axis items.""", | |||
Name or list of names matching axis levels or off-axis labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "off-axis labels" very complex. I mean I know what you mean and theoretically it is correct, but I think novice users will not understand this (the previous "axis items" was also not that good).
I am still thinking of a better wording (the annoying thing is that you cannot just say "column labels or index level names" which is probably the common case how this is used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Name or list of names to sort by"? And then things are clarified below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Going with "Name or list of names to sort by"
|
||
# Sorting by 'idx' should sort by the idx column and raise a | ||
# FutureWarning | ||
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you see if you can get this working without check_stacklevel=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Could you offer any insight into what check_stacklevel
is actually checking? I've had trouble understanding that @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #9584 and #10676.
The stacklevel
is specified in the FutureWarning in _check_label_or_level_ambiguity
, and is currenlty set at 2
, which is not correct. This should be higher, depending on how many call steps there are between sort_values
and calling that function.
The main problem here will probably be that _check_label_or_level_ambiguity
is used in multiple places, so the needed stacklevel might differ for the different usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning
there's an optional stacklevel
parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.
It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was able to remove check_stacklevel=False
for these tests, and for the related tests for groupby
and merge
. I added a stacklevel
parameter to _check_label_or_level_ambiguity
and _get_label_or_level_values
that behaves like the stacklevel
parameter of warnings.warn
itself (stacklevel=1
blames the caller, stacklevel=2
blames the callers parent, etc.).
Does this look like a good way to handle it? @TomAugspurger @jorisvandenbossche
pandas/tests/frame/test_sorting.py
Outdated
# multi-column 'by' is separate codepath | ||
df.sort_values(by=['a', 'b']) | ||
|
||
# with multi-index | ||
# GH4370 | ||
df = DataFrame(np.random.randn(4, 2), | ||
columns=MultiIndex.from_tuples([('a', 0), ('a', 1)])) | ||
with tm.assert_raises_regex(ValueError, 'levels'): | ||
with tm.assert_raises_regex(ValueError, 'not unique'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this case the old error message was more informative, so it would be nice if we could keep a separate message for the case of incomplete key for MI compared to just non-unique key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure. The new error message will read as follows (for the MultiIndex
case):
ValueError: The column label 'a' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
Does that sound clear? @jorisvandenbossche
if isinstance(self.columns, MultiIndex): | ||
raise ValueError('Cannot sort by column %s in a ' | ||
'multi-index you need to explicitly ' | ||
'provide all the levels' % str(by)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below as well, so it would be nice to somehow keep this message
doc/source/basics.rst
Outdated
.. versionadded:: 0.22.0 | ||
|
||
Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may | ||
refer to either columns or index levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to?
Side-question, does this work on a regular (non-multiindex) index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "index level names".
And yes, this does work for both the Index
and MultIndex
cases.
@TomAugspurger
pandas/core/frame.py
Outdated
- if `axis` is 1 or `'columns'` then `by` may contain column | ||
levels and/or index labels | ||
|
||
Support for specify index/column levels was added in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a ..versionmodified:
.. versionmodified:: 0.22.0
Allow specifying index or column level names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/frame.py
Outdated
@@ -113,7 +113,15 @@ | |||
axes_single_arg="{0 or 'index', 1 or 'columns'}", | |||
optional_by=""" | |||
by : str or list of str | |||
Name or list of names which refer to the axis items.""", | |||
Name or list of names matching axis levels or off-axis labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Name or list of names to sort by"? And then things are clarified below.
|
||
# Sorting by 'idx' should sort by the idx column and raise a | ||
# FutureWarning | ||
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning
there's an optional stacklevel
parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.
It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.
Thanks for the thorough comments @jorisvandenbossche and @TomAugspurger I just pushed an update that I believe address all of them. Let me know if I missed anything or if you see any other updates you'd like before this is merged. |
@TomAugspurger @jorisvandenbossche Did these latest updates address your comments? Are there any other changes you'd like to see before this is merged? Thanks! cc @jreback |
Friendly ping |
…h_14353 Move whatsnew to 0.23.0 # Conflicts: # doc/source/whatsnew/v0.22.0.txt
@jreback @jorisvandenbossche @TomAugspurger I noticed the milestone tag for this was moved to 0.23.0. I just moved the whatsnew to this version. Are we in a freeze for merging things into 0.23.0 until 0.22.0 is released? |
…h_14353 # Conflicts: # doc/source/basics.rst
@jmmease 0.22 is a special release coming out today. so all previous 0.22 is now 0.23. |
pandas/tests/frame/test_sorting.py
Outdated
@@ -552,3 +559,185 @@ def test_sort_index_intervalindex(self): | |||
closed='right') | |||
result = result.columns.levels[1].categories | |||
tm.assert_index_equal(result, expected) | |||
|
|||
def test_sort_index_and_column(self): | |||
# Build MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
pandas/tests/frame/test_sorting.py
Outdated
expected = df_none.sort_values('inner').set_index('inner') | ||
result = df_single.sort_values('inner') | ||
assert_frame_equal(result, expected) | ||
# - Descending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
pandas/tests/frame/test_sorting.py
Outdated
|
||
result = df_multi.sort_values('inner', ascending=False) | ||
assert_frame_equal(result, expected) | ||
# - Descending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@jmmease just a couple of minor comments. note that windows CI is currently failing. just wait to repush. |
Thanks @jreback . If I'm not mistaken, It looks like the comments you just made ("blank line", and "issue number") are against an older version of the PR. The new tests were parametrized and moved to |
@jreback The AppVeyor build is passing now, but I'm getting a "The job exceeded the maximum time limit for jobs, and has been terminated." error for most of the Travis builds. |
yeah travis just started failing |
rebase and push again, fixed some hanging by Travis CI |
Done. Thanks @jreback . Happy New Year BTW! |
happy new year. will have a final look, but this lgtm. @TomAugspurger |
Thanks for taking a look @TomAugspurger |
thanks @jmmease really nice patch! keep em coming! |
Awesome, thanks for all your time on this @jreback @TomAugspurger, @jorisvandenbossche, and @gfyoung ! |
This PR implements the changes proposed in #14353. @jorisvandenbossche
git diff upstream/master -u -- "*.py" | flake8 --diff