-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: nlargest/nsmallest gave wrong result (#22752) #22754
Conversation
d914975
to
c09c5a2
Compare
Hello @troels! Thanks for updating the PR.
Comment last updated on September 23, 2018 at 16:58 Hours UTC |
93062c6
to
03aa68d
Compare
Codecov Report
@@ Coverage Diff @@
## master #22754 +/- ##
==========================================
+ Coverage 92.18% 92.18% +<.01%
==========================================
Files 169 169
Lines 50820 50823 +3
==========================================
+ Hits 46850 46853 +3
Misses 3970 3970
Continue to review full report at Codecov.
|
280f095
to
7a68d93
Compare
Can you run asv on the nlargest / nsmallest benchmarks: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite |
25a4ef3
to
9d94f27
Compare
Hi @TomAugspurger, I ran the benchmark which gave me no significant changes. I thought that was odd since the old version seemed to do a lot of work. I checked and the old benchmarks only tested sorting on a single column, which is kind of non-controversial. I therefore added two new benchmarks and increased the size of the tested dataframe a bit. Here is the results:
So the new version is also significantly faster, and more so the larger the original dataframe is and the more columns being sorted on. |
9d94f27
to
45ab37e
Compare
pandas/core/algorithms.py
Outdated
duplicated = values[duplicated_filter] | ||
non_duplicated = values[~duplicated_filter] | ||
indexer = get_indexer(indexer, non_duplicated.index) | ||
last_value = values == values[values.index[-1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why are you changing this duplicated logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the old code would pick all duplicates, not only those that were duplicates of the top-most/bottom-most value.
That meant that if you had a series with two distinct set of duplicates, those sets of duplicates would be taken as equal for the next iteration.
So picking top-three from e.g: [0 4] [0 3] [1 2] [1 1] would in fact pick:
[0 4] [0 3] [1 2] instead of the correct
[1 2] [1 1] [0 4]
Now only the duplicates of the element on the border will go on to the next iteration:
so [1 2] [1 1] are determined on the first iteration and [0 4] on the next giving the correct result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks, maybe add a comment to that effect
@@ -505,14 +505,21 @@ class NSort(object): | |||
param_names = ['keep'] | |||
|
|||
def setup(self, keep): | |||
self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) | |||
self.df = DataFrame(np.random.randn(100000, 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the benchmarks show any difference? (e.g. from the prior impl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's significantly faster now. Especially, if I create large dataframes with many columns and many duplicates.
If sorting only on one column the speed should be exactly the same.
ok @troels this looks good (just add some more comments as indicated), ping on green. |
45ab37e
to
a3444cf
Compare
Ok, I've added some more comments. I hope the mechanism is clearer now. |
lgtm. you have a lint error, ping on green. |
a3444cf
to
63218fd
Compare
When asking for the n largest/smallest rows in a dataframe nlargest/nsmallest sometimes failed to differentiate the correct result based on the latter columns.
@jreback The lint error should be fixed now. The failing ci-run in pandas-dev.pandas looks like a flaky network connection while setting up the testrun. |
thanks @troels |
nlargest
function with multiple columns #22752git diff upstream/master -u -- "*.py" | flake8 --diff
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result.
I looked at the nsmallest/nlargest implementation for data frames and to
me it looks wrong.
With ties in the first columns, the old algorithm picked all duplicates rather
than all values that lies on the border for the next tie-break iteration.
That meant that to find the top 5 values in a data frame like e.g:
[0 5] [0 4] [1 0] [2 3] [2 2] [2 1]
(That would be [1 0] [2 1] [2 2] [2 3] [0 4] [0 5])
first column values, that is):
[1 0]
That is: [2 3] [2 2] [2 1] [0 4] [0 5]
Which is not the correct result: [1 0][0 4] [2 3] [2 2] [2 1]
I've changed the algorithm so instead of tie-breaking on all values having duplicates in an earlier column, it will now tie break only on the duplicates of the largest/smallest value in a given column, so it will do:
[1 0] [2 1] [2 2] [2 3] [0 4] [0 5]
[1 0] [2 1] [2 2] [2 3]
Which is the correct result.