-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231
Conversation
interesting..this may go pretty far towards solving #5190 |
I'd like time to review this - please don't merge for now. |
agreed.. |
First, you need to put the new argument in last position everywhere so On external API - how about calling it |
Okay, I'll put the new argument at the end. I'm a little hesitant about adding an Instead of The change to the code may be minimal: something like
|
I don't think we need to mess with the na_last (as far as having complicated behavior) |
It seems awkward to me (na_position=last seems better or na_order=last). And if we're starting from the premise that we want to unify the API, would this apply flexibly to all the other sort methods? what about Series and Index sorting? And to confirm, na_last=False makes them first? Removes them? And I'm assuming it's also supposed to maintain relative position? |
Index doesn't have sort - so just Series and Panel. (groupby?) Assuming it applies to all. |
this is just about sort ordering; you can't do a remove here; there are only 2 ways to do it, na_first or na_last (that said, you could do ti with a 'how' argument). Not sure we need to break the API here though |
Okay. na_last doesn't sound intuitive to me, but it seems like I'm in the
|
I like na_position better than na_last since I find |
And you can just check first letter if you want (so accept last, l, first,
|
While I'm at it, would it be okay to change |
Also, I neglected to say earlier - thanks for submitting this PR! |
Not totally clear on what you mean. That said, we can't change existing |
Not at all, my pleasure. It seems to me the keyword |
If you change the parameter, nice to change name of PR too :) |
@unutbu can you add a release notes/v0.13.0 entry (1-liner or small example in 0.13 if you want) under API changes |
I've found a bug in my code. With |
Glad you caught that! Still have some time to get this in for 0.13. |
@@ -2632,34 +2638,22 @@ def sort_index(self, axis=0, by=None, ascending=True, inplace=False, | |||
raise ValueError('Length of ascending (%d) != length of by' | |||
' (%d)' % (len(ascending), len(by))) | |||
|
|||
if len(by) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the if len(by) > 1
. Now, whenever the by
parameter is used, the sorting is done by _lexsort_indexer
. Before, argsort
was being used when len(by) == 1
. But argsort
always puts NaNs at the end, so if we want to be able to place NaNs at the beginning, it seems better now to use _lexsort_indexer
.
Warning: My PR only affects how DataFrame.sort works when sorting columns with the |
As I shift code away from argsort and onto |
okay, then you need to make the parameter None for now and then raise (NotImplementedError) if it is specified without a 'by' arg I guess. But that said - better if you supported that. |
I like maybe want to add that warning now? |
@jreback I've added code to @jtratner I've tried to extend |
Hm. Travis-CI does not like the changes I've made. I'm setting up new virtualenvs to debug this... |
I think there's some issue with mergesort not being fully implemented in numpy 1.6 - maybe you changed a default without realizing it? |
This error here - https://travis-ci.org/unutbu/pandas/jobs/12751402 - " |
(and by you here I mean "pandas with this PR") |
@unutbu awesome! |
The deprecation removal notice is in #6581 |
@unutbu their may be a couple of FutureWarnings from |
@unutbu its |
sorted 43baf9e |
@unutbu realized that bigger question I have is that practically evey other sort is defaulting to |
@jreback: Thanks for fixing |
ok.. makes sense |
…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231
closes #3917
This is an attempt to fix the Nested Sort with NaN bug (#3917).
I've added tests to
test_frame.py
andtest_hashtable.py
to demonstrate the problem.hashtable.Factorizer.factorize
has been modified to mapnan
tona_sentinel
. Before it was mappingnan
to a label which was already being used. This, I believe is the origin of the bug.My first idea was to mimick/reuse code from
Series.order
, since this method already handlesnan
s nicely, and allows the user to choose if nans should be placed at the beginning or the end of the sort via thena_last
parameter. Although I found a solution using code fromSeries.order
, I eventually abandoned this when I realized this patches the problem at too high a level and that it could be handled more generally with a modification offactorize
. I retained the idea thatdf.sort
should have ana_last
parameter, however.To that end,
groupby._lexsort_indexer
has been modified to handle all possible combinations ofna_last
andorders
settings. There are four tests (assertions) intest_frame.py
to exercise the possibilities, one of which demonstrates thatdf.sort(['A','B'])
now behaves correctly for the DataFrame shown in GH3917.