Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Nested sort with NaN #3917

Closed
hayd opened this issue Jun 15, 2013 · 4 comments · Fixed by #5231
Closed

BUG: Nested sort with NaN #3917

hayd opened this issue Jun 15, 2013 · 4 comments · Fixed by #5231
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jun 15, 2013

Nested sort doesn't seem to work with NaNs, see this SO question.

In [11]: df
Out[11]:
    a   b
0   1   9
1   2 NaN
2 NaN   5
3   1   2
4   6   5
5   8   4
6   4   5

In [12]: df.sort(columns=["a","b"])
Out[12]:
    a   b
3   1   2
0   1   9
1   2 NaN
2 NaN   5
6   4   5
4   6   5
5   8   4

(It works as expected using a single columns)

@hayd
Copy link
Contributor Author

hayd commented Jun 15, 2013

And failing hacks:

In [27]: df.sort("a").groupby("a", group_keys=False).apply(lambda x: x.sort("b"))
Out[27]:
   a   b
3  1   2
0  1   9
1  2 NaN
6  4   5
4  6   5
5  8   4
# missing 2.

In [28]: df.sort("a").groupby("a", group_keys=False).apply(lambda x: x)
Out[28]:
    a   b
0   1   9
3   1   2
1   2 NaN
6   4   5
4   6   5
5   8   4
2 NaN NaN

@cpcloud
Copy link
Member

cpcloud commented Jun 15, 2013

think inf should behave the same way too and respect the ascending param

@hayd
Copy link
Contributor Author

hayd commented Jun 15, 2013

Note that is the way it works with one col:

In [19]: df.sort("a")
Out[19]:
    a   b
0   1   9
3   1   2
1   2 NaN
6   4   5
4   6   5
5   8   4
2 NaN   5

Also Series order method (pretty much the same as sort) offers na_last argument:

na_last : boolean (optional, default=True)
    Put NaN's at beginning or end

@jcjf
Copy link
Contributor

jcjf commented Jul 26, 2013

I was shocked to discover this issue as well. I think the problem is in the Cython function called within the else statement in pandas.core.groupby._indexer_from_factorized:

if max_group > 1e6:
    # Use mergesort to avoid memory errors in counting sort
    indexer = comp_ids.argsort(kind='mergesort')
else:
    indexer, _ = _algos.groupsort_indexer(comp_ids.astype(np.int64),
                                          max_group)

Unfortunately, I don't know enough about debugging Cython code to help out more than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants