Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support sorting frames by a combo of columns and index levels (GH 14353) #17361

Merged
merged 18 commits into from
Jan 5, 2018

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Aug 28, 2017

This PR implements the changes proposed in #14353. @jorisvandenbossche

@@ -1726,8 +1726,9 @@ Sorting
The sorting API is substantially changed in 0.17.0, see :ref:`here <whatsnew_0170.api_breaking.sorting>` for these changes.
In particular, all sorting methods now return a new object by default, and **DO NOT** operate in-place (except by passing ``inplace=True``).

There are two obvious kinds of sorting that you may be interested in: sorting
by label and sorting by actual values.
There are three obvious kinds of sorting that you may be interested in: sorting
Copy link
Member

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this word was in the original doc, but I'm a little uneasy about judgmental words like "obvious." I think we can just remove it without impacting much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've reworded it.

@@ -1776,6 +1783,34 @@ argument:
s.sort_values()
s.sort_values(na_position='first')

By Indexes and Values
~~~~~~~~~~~~~~~~~~~~~
.. versionadded:: 0.21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need this version-added tag here since you add it below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this versionadded tag and removed the one below in the note. This will help people quickly realize that the feature is new and it should be clear that the note below wouldn't apply without the feature.

@@ -67,7 +67,8 @@
args_transpose='axes to permute (int or label for object)',
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""")
Name or list of names which refer to the axis items or index
levels.""")
Copy link
Member

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: My OCD really wants you to not "orphan" the "levels" word at the end of this sentence 😄 . However, that's largely a me problem, but if you do happen to come up with a wording that doesn't do this orphaning, by all means go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, now it bothers me too! I reworked it to fit on one line. "Name or list of names matching axis items or index levels."

@codecov
Copy link

codecov bot commented Aug 29, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@36dadd7). Click here to learn what that means.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17361   +/-   ##
=========================================
  Coverage          ?   91.02%           
=========================================
  Files             ?      162           
  Lines             ?    49574           
  Branches          ?        0           
=========================================
  Hits              ?    45123           
  Misses            ?     4451           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.8% <93.33%> (?)
#single 40.24% <6.66%> (?)
Impacted Files Coverage Δ
pandas/core/generic.py 91.99% <ø> (ø)
pandas/core/frame.py 97.68% <93.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36dadd7...10b4e24. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 29, 2017

Codecov Report

Merging #17361 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17361      +/-   ##
==========================================
- Coverage   91.57%   91.57%   -0.01%     
==========================================
  Files         150      150              
  Lines       48941    48940       -1     
==========================================
- Hits        44817    44816       -1     
  Misses       4124     4124
Flag Coverage Δ
#multiple 89.93% <100%> (-0.01%) ⬇️
#single 41.75% <12.5%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.62% <100%> (-0.01%) ⬇️
pandas/core/groupby.py 92.14% <100%> (ø) ⬆️
pandas/core/generic.py 95.95% <100%> (ø) ⬆️
pandas/core/reshape/merge.py 94.21% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c19bdc9...71748b6. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

@jmmease this looks like an older version of #17484

@jreback jreback closed this Nov 23, 2017
@jonmmease
Copy link
Contributor Author

@jreback This PR was for the sort_values col&index change. It had some shared utility code with #17484 (the merge col&index change) so i was holding off on updating it until that one was merged.

Do you want to keep this PR closed until #17484 is merged and I make the updates?

@jreback jreback reopened this Nov 23, 2017
@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

np can rebase and evaluate after #17484

Jon M. Mease added 6 commits December 2, 2017 10:10
# Conflicts:
#	doc/source/basics.rst
#	doc/source/whatsnew/v0.21.0.txt
#	pandas/core/frame.py
Update some documentation vocabulary to label/level distinction
  - Move to test_sort_values_level_as_str.py
  - Use fixtures and parametrize
  - Added axis=1 test cases
@jonmmease
Copy link
Contributor Author

Ok, I just pushed a bunch of updates:

  • Converted to use the new _get_label_or_level_values utility method introduced in Support merging DataFrames on a combo of columns and index levels (GH 14355) #17484
  • Moved tests to a new file, updated to use fixtures/parametrize, and added axis=1 test cases
  • I reviewed the sort_values documentation in the docstrings and in basics.rst and made small language updates throughout. I hope these changes make things a bit clearer and more consistent.
  • I added a full whatsnew section for 0.22.0

@jreback @jorisvandenbossche @gfyoung @TomAugspurger

@jonmmease
Copy link
Contributor Author

Just following up on this.

@gfyoung if you have time to look this over at some point, I'd appreciate your feedback on these documentation/docstring updates.

@jreback does this approach look alright to you?

Thanks all

@jreback jreback added this to the 0.22.0 milestone Dec 7, 2017
@jreback
Copy link
Contributor

jreback commented Dec 7, 2017

@jmmease lgtm. @TomAugspurger @jorisvandenbossche if you'd have a look.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmmease Added some comments!

@@ -1738,19 +1738,26 @@ description.
Sorting
-------

There are two obvious kinds of sorting that you may be interested in: sorting
by label and sorting by actual values.
Pandas supports three kinds of sorting: sorting by index levels,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index levels" -> "index labels" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this isn't correct. But I was trying to be consistent in using level to refer to the names of the indexes or columns (df.index.names, df.columns.names) and label to refer to the values referenced by the levels (df.index.get_level_values(lv1), df.columns.get_level_values(lv1)). Is that in line with accepted vocabulary? @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think "index level" is the equivalent of "column", and from there it would follow that the equivalent of "column values" is "index labels of one level / index level labels / index level values" (but none of those sound that good) and the equivalent of "column names/labels" would then be "index level name".
But it's not that we do that very consistently currently, or that this is necessarily agreed upon vocabulary. But I think it is in line more or less with what you say?

The main reason I commented it here, is because the next line uses "column values", and as you say "label to refer to the values" then to be consistent it would be "index labels".
And a second reason I commented is that I think "index level" is more complicated, as then a user needs to be familiar with the concept of MultiIndexes, while for this section about sorting, that is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. After looking at it again, I do agree that "index labels" is the better parallel to "columns values".

which will use an arbitrary vector or a column name of the DataFrame to
determine the sort order:
The :meth:`Series.sort_values` and :meth:`DataFrame.sort_values` methods are
used to sort a pandas object by its values. The optional ``by`` parameter to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use just "values" but indicate "column values" or "columns". Values in the index are also values of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that is because it can also be rows I suppose :-) Then I would keep the explicit "column or row values" like before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this into two sentences so that a Series is sorted by "values" and a DataFrame is sorted by "column or row values".

index=idx)
df_multi

# Sort by 'second' (index) and 'A' (column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as normal text between two code blocks instead of as comment (in general that makes it clearer IMO if you want a comment to stand out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@@ -113,7 +113,15 @@
axes_single_arg="{0 or 'index', 1 or 'columns'}",
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""",
Name or list of names matching axis levels or off-axis labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "off-axis labels" very complex. I mean I know what you mean and theoretically it is correct, but I think novice users will not understand this (the previous "axis items" was also not that good).

I am still thinking of a better wording (the annoying thing is that you cannot just say "column labels or index level names" which is probably the common case how this is used)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Name or list of names to sort by"? And then things are clarified below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Going with "Name or list of names to sort by"


# Sorting by 'idx' should sort by the idx column and raise a
# FutureWarning
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if you can get this working without check_stacklevel=False ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Could you offer any insight into what check_stacklevel is actually checking? I've had trouble understanding that @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #9584 and #10676.
The stacklevel is specified in the FutureWarning in _check_label_or_level_ambiguity, and is currenlty set at 2, which is not correct. This should be higher, depending on how many call steps there are between sort_values and calling that function.
The main problem here will probably be that _check_label_or_level_ambiguity is used in multiple places, so the needed stacklevel might differ for the different usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning there's an optional stacklevel parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.

It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was able to remove check_stacklevel=False for these tests, and for the related tests for groupby and merge. I added a stacklevel parameter to _check_label_or_level_ambiguity and _get_label_or_level_values that behaves like the stacklevel parameter of warnings.warn itself (stacklevel=1 blames the caller, stacklevel=2 blames the callers parent, etc.).

Does this look like a good way to handle it? @TomAugspurger @jorisvandenbossche

# multi-column 'by' is separate codepath
df.sort_values(by=['a', 'b'])

# with multi-index
# GH4370
df = DataFrame(np.random.randn(4, 2),
columns=MultiIndex.from_tuples([('a', 0), ('a', 1)]))
with tm.assert_raises_regex(ValueError, 'levels'):
with tm.assert_raises_regex(ValueError, 'not unique'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this case the old error message was more informative, so it would be nice if we could keep a separate message for the case of incomplete key for MI compared to just non-unique key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure. The new error message will read as follows (for the MultiIndex case):

ValueError: The column label 'a' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.

Does that sound clear? @jorisvandenbossche

if isinstance(self.columns, MultiIndex):
raise ValueError('Cannot sort by column %s in a '
'multi-index you need to explicitly '
'provide all the levels' % str(by))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below as well, so it would be nice to somehow keep this message

.. versionadded:: 0.22.0

Strings passed as the ``by`` parameter to :meth:`DataFrame.sort_values` may
refer to either columns or index levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with "index levels". Does changing it to "index names" make sense here, since that's what you're referring to?

Side-question, does this work on a regular (non-multiindex) index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use "index level names" in the whatsnew. That seems like a good option, if you could use it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "index level names".

And yes, this does work for both the Index and MultIndex cases.
@TomAugspurger

- if `axis` is 1 or `'columns'` then `by` may contain column
levels and/or index labels

Support for specify index/column levels was added in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a ..versionmodified:

.. versionmodified:: 0.22.0
   Allow specifying index or column level names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -113,7 +113,15 @@
axes_single_arg="{0 or 'index', 1 or 'columns'}",
optional_by="""
by : str or list of str
Name or list of names which refer to the axis items.""",
Name or list of names matching axis levels or off-axis labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Name or list of names to sort by"? And then things are clarified below.


# Sorting by 'idx' should sort by the idx column and raise a
# FutureWarning
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to report the users's line of code that triggered the warning. In our code, when you emit the FutureWarning there's an optional stacklevel parameter that's controls which line of code the error message is reported with. It's supposed to be the number of function calls from user code to our method emitting the warning.

It's not always possible to get that correct, if you have multiple pandas methods calling the method emitting the warning.

@jonmmease
Copy link
Contributor Author

Thanks for the thorough comments @jorisvandenbossche and @TomAugspurger

I just pushed an update that I believe address all of them. Let me know if I missed anything or if you see any other updates you'd like before this is merged.

@jonmmease
Copy link
Contributor Author

@TomAugspurger @jorisvandenbossche Did these latest updates address your comments? Are there any other changes you'd like to see before this is merged? Thanks!

cc @jreback

@jonmmease
Copy link
Contributor Author

Friendly ping
@TomAugspurger @jorisvandenbossche @jreback

…h_14353

Move whatsnew to 0.23.0

# Conflicts:
#	doc/source/whatsnew/v0.22.0.txt
@jonmmease
Copy link
Contributor Author

@jreback @jorisvandenbossche @TomAugspurger I noticed the milestone tag for this was moved to 0.23.0. I just moved the whatsnew to this version. Are we in a freeze for merging things into 0.23.0 until 0.22.0 is released?

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

@jmmease 0.22 is a special release coming out today. so all previous 0.22 is now 0.23.

@@ -552,3 +559,185 @@ def test_sort_index_intervalindex(self):
closed='right')
result = result.columns.levels[1].categories
tm.assert_index_equal(result, expected)

def test_sort_index_and_column(self):
# Build MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

expected = df_none.sort_values('inner').set_index('inner')
result = df_single.sort_values('inner')
assert_frame_equal(result, expected)
# - Descending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line


result = df_multi.sort_values('inner', ascending=False)
assert_frame_equal(result, expected)
# - Descending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

@jmmease just a couple of minor comments. note that windows CI is currently failing. just wait to repush.

@jonmmease
Copy link
Contributor Author

Thanks @jreback . If I'm not mistaken, It looks like the comments you just made ("blank line", and "issue number") are against an older version of the PR. The new tests were parametrized and moved to test_sort_values_level_as_str.py.

@jonmmease
Copy link
Contributor Author

@jreback The AppVeyor build is passing now, but I'm getting a "The job exceeded the maximum time limit for jobs, and has been terminated." error for most of the Travis builds.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

yeah travis just started failing
not sure why

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

rebase and push again, fixed some hanging by Travis CI

@jonmmease
Copy link
Contributor Author

Done. Thanks @jreback . Happy New Year BTW!

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

happy new year. will have a final look, but this lgtm. @TomAugspurger

@jonmmease
Copy link
Contributor Author

Thanks for taking a look @TomAugspurger
Anything else @jreback ?

@jreback jreback merged commit 7663a63 into pandas-dev:master Jan 5, 2018
@jreback
Copy link
Contributor

jreback commented Jan 5, 2018

thanks @jmmease really nice patch! keep em coming!

@jonmmease
Copy link
Contributor Author

Awesome, thanks for all your time on this @jreback @TomAugspurger, @jorisvandenbossche, and @gfyoung !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support sorting DataFrames by a combination of columns and index levels
5 participants