Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231

Merged
merged 1 commit into from
Mar 27, 2014

Conversation

unutbu
Copy link
Contributor

@unutbu unutbu commented Oct 15, 2013

closes #3917

This is an attempt to fix the Nested Sort with NaN bug (#3917).

I've added tests to test_frame.py and test_hashtable.py to demonstrate the problem.

hashtable.Factorizer.factorize has been modified to map nan to na_sentinel. Before it was mapping nan to a label which was already being used. This, I believe is the origin of the bug.

My first idea was to mimick/reuse code from Series.order, since this method already handles nans nicely, and allows the user to choose if nans should be placed at the beginning or the end of the sort via the na_last parameter. Although I found a solution using code from Series.order, I eventually abandoned this when I realized this patches the problem at too high a level and that it could be handled more generally with a modification of factorize. I retained the idea that df.sort should have a na_last parameter, however.

To that end, groupby._lexsort_indexer has been modified to handle all possible combinations of na_last and orders settings. There are four tests (assertions) in test_frame.py to exercise the possibilities, one of which demonstrates that df.sort(['A','B']) now behaves correctly for the DataFrame shown in GH3917.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

interesting..this may go pretty far towards solving #5190

@jtratner
Copy link
Contributor

I'd like time to review this - please don't merge for now.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

agreed..

@jtratner
Copy link
Contributor

First, you need to put the new argument in last position everywhere so
you don't break anything (and don't accidentally have users entering
something wrong - esp because usually no difference between 1 and True)

On external API - how about calling it handle_na, and then you can pass
either "rm" (remove them), "fill" (pass fill value), "last" (end up as
last), could also do "first" (end up as first) etc. this reminds me that I
need to add na-handling on apply and groupby as well. internal methods can
handle however they want.

@unutbu
Copy link
Contributor Author

unutbu commented Oct 15, 2013

Okay, I'll put the new argument at the end.

I'm a little hesitant about adding an rm option. Should a sort really be able to remove items?

Instead of handle_na='fill', how would you feel about a key=func parameter? Then instead of replacing NaNs with fill_value, we could use func to provide proxy values for NaNs: key=lambda x: fill_value if np.isnan(x) else x.

The change to the code may be minimal: something like

k = func(self[x].values)

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

@jtratner

I don't think we need to mess with the na_last (as far as having complicated behavior)
why is the current not adequate ?

@jtratner
Copy link
Contributor

It seems awkward to me (na_position=last seems better or na_order=last). And if we're starting from the premise that we want to unify the API, would this apply flexibly to all the other sort methods? what about Series and Index sorting? And to confirm, na_last=False makes them first? Removes them? And I'm assuming it's also supposed to maintain relative position?

@jtratner
Copy link
Contributor

Index doesn't have sort - so just Series and Panel. (groupby?) Assuming it applies to all.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

this is just about sort ordering; you can't do a remove here; there are only 2 ways to do it, na_first or na_last (that said, you could do ti with a 'how' argument). Not sure we need to break the API here though

@jtratner
Copy link
Contributor

Okay. na_last doesn't sound intuitive to me, but it seems like I'm in the
minority.
On Oct 15, 2013 4:47 PM, "jreback" notifications@github.com wrote:

this is just about sort ordering; you can't do a remove here; there are
only 2 ways to do it, na_first or na_last (that said, you could do ti with
a 'how' argument). Not sure we need to break the API here though


Reply to this email directly or view it on GitHubhttps://github.com//pull/5231#issuecomment-26370801
.

@unutbu
Copy link
Contributor Author

unutbu commented Oct 15, 2013

I like na_position better than na_last since I find na_last=False a bit confusing. But whatever you decide is fine with me. I'm going to hold off on adding a key=func parameter to a different pull request :)

@jtratner
Copy link
Contributor

And you can just check first letter if you want (so accept last, l, first,
f etc)
On Oct 15, 2013 5:03 PM, "unutbu" notifications@github.com wrote:

I like na_position better than na_last since I find na_last=False a bit
confusing. I'm going to hold off on adding a key=func parameter to a
different pull request :)


Reply to this email directly or view it on GitHubhttps://github.com//pull/5231#issuecomment-26371995
.

@unutbu
Copy link
Contributor Author

unutbu commented Oct 15, 2013

While I'm at it, would it be okay to change orders to ascending in groupby _lexsort_indexer?
Or maybe I shouldn't break the API...

@jtratner
Copy link
Contributor

Also, I neglected to say earlier - thanks for submitting this PR!

@jtratner
Copy link
Contributor

Not totally clear on what you mean. That said, we can't change existing
behavior without a warning (but can add keyword arguments with defaults to
change behavior)

@unutbu
Copy link
Contributor Author

unutbu commented Oct 15, 2013

Not at all, my pleasure.

It seems to me the keyword ascending as used in DataFrame.sort has the same meaning as orders as used in groupby's _lexsort_indexer. If that is true, it would be nice if only one name were used for the concept.

@jtratner
Copy link
Contributor

If you change the parameter, nice to change name of PR too :)

@jreback
Copy link
Contributor

jreback commented Oct 17, 2013

@unutbu can you add a release notes/v0.13.0 entry (1-liner or small example in 0.13 if you want)

under API changes

@unutbu
Copy link
Contributor Author

unutbu commented Oct 17, 2013

I've found a bug in my code. With df defined as in GH3917, df.sort(['A'], na_position='first') does not place NaN at the beginning of the result.

@jtratner
Copy link
Contributor

Glad you caught that! Still have some time to get this in for 0.13.

@@ -2632,34 +2638,22 @@ def sort_index(self, axis=0, by=None, ascending=True, inplace=False,
raise ValueError('Length of ascending (%d) != length of by'
' (%d)' % (len(ascending), len(by)))

if len(by) > 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the if len(by) > 1. Now, whenever the by parameter is used, the sorting is done by _lexsort_indexer. Before, argsort was being used when len(by) == 1. But argsort always puts NaNs at the end, so if we want to be able to place NaNs at the beginning, it seems better now to use _lexsort_indexer.

@unutbu
Copy link
Contributor Author

unutbu commented Oct 17, 2013

Warning: My PR only affects how DataFrame.sort works when sorting columns with the by parameter. It doesn't currently address sorting by label. If NaN is a label, na_position does not affect the way it gets sorted...

@unutbu
Copy link
Contributor Author

unutbu commented Oct 17, 2013

As I shift code away from argsort and onto _lexsort_indexer, the kind parameter is being used less and less. With my changes, kind is only used when sorting by label.

@jtratner
Copy link
Contributor

okay, then you need to make the parameter None for now and then raise (NotImplementedError) if it is specified without a 'by' arg I guess. But that said - better if you supported that.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2013

I like na_position...so in 0.14, then we will deprecate na_last for Series.order and change to na_position, yes? (aside from other fixups to Series.sort/order)

maybe want to add that warning now?

@unutbu
Copy link
Contributor Author

unutbu commented Oct 19, 2013

@jreback I've added code to core/series.py to warn users that na_last is deprecated, and allow them to use the na_position parameter instead.

@jtratner I've tried to extend np.argsort by adding ascending and na_postion parameters. The extended function, _nargsort is in core/groupby. I used _nargsort in core/frame.py to allow DataFrame to sort a single column or label while respecting na_position.

@unutbu
Copy link
Contributor Author

unutbu commented Oct 19, 2013

Hm. Travis-CI does not like the changes I've made. I'm setting up new virtualenvs to debug this...

@jtratner
Copy link
Contributor

I think there's some issue with mergesort not being fully implemented in numpy 1.6 - maybe you changed a default without realizing it?

@jtratner
Copy link
Contributor

This error here - https://travis-ci.org/unutbu/pandas/jobs/12751402 - "TypeError: unorderable types: str() > float()", is likely occurring because you didn't mask out the nan values before the sort.

@jtratner
Copy link
Contributor

(and by you here I mean "pandas with this PR")

@jreback jreback merged commit 87e1212 into pandas-dev:master Mar 27, 2014
@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

@unutbu awesome!

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

The deprecation removal notice is in #6581

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

@unutbu their may be a couple of FutureWarnings from na_last in the current test suite in test_groupby.py, can you do a followup to change the paremeter name (maybe elsewhere...just happened to notice that one)

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

@unutbu its Series.sort is passing na_last to order...i'll just make the change...pretty trivial

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

sorted 43baf9e

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@unutbu realized that Series.sort does not propogate the na_position argument, no biggie...

bigger question I have is that sort defaults to quicksort and order defaults to mergesort probably because sort is a 'pass-thru' (or was) to np.sort which defaults that.

practically evey other sort is defaulting to quicksort any thoughts on this?

@jreback jreback mentioned this pull request Apr 9, 2014
@unutbu
Copy link
Contributor Author

unutbu commented Apr 9, 2014

@jreback: Thanks for fixing Series.sort. Making quicksort the default for all sorting methods makes sense to me.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

ok.. makes sense

jreback added a commit to jreback/pandas that referenced this pull request Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Nested sort with NaN
3 participants