Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

Merged
merged 1 commit into from
Sep 25, 2018

Conversation

troels
Copy link
Contributor

@troels troels commented Sep 18, 2018

When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result.

I looked at the nsmallest/nlargest implementation for data frames and to
me it looks wrong.

With ties in the first columns, the old algorithm picked all duplicates rather
than all values that lies on the border for the next tie-break iteration.
That meant that to find the top 5 values in a data frame like e.g:

[0 5] [0 4] [1 0] [2 3] [2 2] [2 1]

  1. the algorithm would first pick the top 5 via the first column:
    (That would be [1 0] [2 1] [2 2] [2 3] [0 4] [0 5])
  2. It would then try to pick the values that were certain (that is the unique
    first column values, that is):
    [1 0]
  3. and go on for the next iteration with all rows where the first column value was non-unique but among the top 5:
    That is: [2 3] [2 2] [2 1] [0 4] [0 5]
  4. It would the compare the remaining on the second column, picking: [0 5] [0 4] [2 3] [2 2]
  5. Finally returning: [1 0] [0 5] [0 4] [2 3] [2 2]

Which is not the correct result: [1 0][0 4] [2 3] [2 2] [2 1]

I've changed the algorithm so instead of tie-breaking on all values having duplicates in an earlier column, it will now tie break only on the duplicates of the largest/smallest value in a given column, so it will do:

  1. Pick top 5 of column 1:
    [1 0] [2 1] [2 2] [2 3] [0 4] [0 5]
  2. zero is the border value, the rest is safe:
    [1 0] [2 1] [2 2] [2 3]
  3. [0 4] [0 5] goes on to the next iteration:
  4. 5 > 4 so [0 5] is the top 1 of what's left:
  5. The result is: [1 0] [2 1] [2 2] [2 3] [0 5]

Which is the correct result.

@pep8speaks
Copy link

pep8speaks commented Sep 18, 2018

Hello @troels! Thanks for updating the PR.

Comment last updated on September 23, 2018 at 16:58 Hours UTC

@troels troels changed the title BUG: nlargest/nsmallest gave wrong result (#22748) BUG: nlargest/nsmallest gave wrong result (#22752) Sep 18, 2018
@troels troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 93062c6 to 03aa68d Compare September 19, 2018 10:43
@codecov
Copy link

codecov bot commented Sep 19, 2018

Codecov Report

Merging #22754 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22754      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50823       +3     
==========================================
+ Hits        46850    46853       +3     
  Misses       3970     3970
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.37% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.72% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4cc0a71...63218fd. Read the comment docs.

@troels troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 280f095 to 7a68d93 Compare September 19, 2018 12:07
@TomAugspurger
Copy link
Contributor

Can you run asv on the nlargest / nsmallest benchmarks: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite

@troels troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 25a4ef3 to 9d94f27 Compare September 23, 2018 11:36
@troels
Copy link
Contributor Author

troels commented Sep 23, 2018

Hi @TomAugspurger,

I ran the benchmark which gave me no significant changes.

I thought that was odd since the old version seemed to do a lot of work. I checked and the old benchmarks only tested sorting on a single column, which is kind of non-controversial.

I therefore added two new benchmarks and increased the size of the tested dataframe a bit. Here is the results:

[ 56.25%] ··· frame_methods.NSort.time_nlargest_one_column                                                                                                 ok
[ 56.25%] ··· ======= ============
                keep              
              ------- ------------
               first   4.02±0.3ms 
                last   3.24±0.2ms 
                all    3.61±0.1ms 
              ======= ============

[ 62.50%] ··· frame_methods.NSort.time_nlargest_two_columns                                                                                                ok
[ 62.50%] ··· ======= ============
                keep              
              ------- ------------
               first   5.46±0.4ms 
                last   5.11±0.2ms 
                all    5.26±0.1ms 
              ======= ============

[ 68.75%] ··· frame_methods.NSort.time_nsmallest_one_column                                                                                                ok
[ 68.75%] ··· ======= ============
                keep              
              ------- ------------
               first   3.55±0.2ms 
                last   3.71±0.1ms 
                all    3.48±0.2ms 
              ======= ============

[ 75.00%] ··· frame_methods.NSort.time_nsmallest_two_columns                                                                                               ok
[ 75.00%] ··· ======= ============
                keep              
              ------- ------------
               first   4.87±0.2ms 
                last   4.73±0.2ms 
                all    4.73±0.2ms 
              ======= ============

[ 75.00%] · For pandas commit fb784caf <ci> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 81.25%] ··· frame_methods.NSort.time_nlargest_one_column                                                                                                 ok
[ 81.25%] ··· ======= ============
                keep              
              ------- ------------
               first   3.97±0.1ms 
                last   3.45±0.1ms 
                all    4.11±0.1ms 
              ======= ============

[ 87.50%] ··· frame_methods.NSort.time_nlargest_two_columns                                                                                                ok
[ 87.50%] ··· ======= ============
                keep              
              ------- ------------
               first   6.68±0.2ms 
                last   6.39±0.1ms 
                all    6.67±0.2ms 
              ======= ============

[ 93.75%] ··· frame_methods.NSort.time_nsmallest_one_column                                                                                                ok
[ 93.75%] ··· ======= =============
                keep               
              ------- -------------
               first   3.65±0.07ms 
                last   3.67±0.08ms 
                all    3.65±0.01ms 
              ======= =============

[100.00%] ··· frame_methods.NSort.time_nsmallest_two_columns                                                                                               ok
[100.00%] ··· ======= =============
                keep               
              ------- -------------
               first    6.29±0.2ms 
                last    6.64±0.1ms 
                all    6.30±0.07ms 
              ======= =============

       before           after         ratio
     [fb784caf]       [9d94f27c]
     <ci>             <nlargest-nsmallest>
-      6.29±0.2ms       4.87±0.2ms     0.77  frame_methods.NSort.time_nsmallest_two_columns('first')
-     6.30±0.07ms       4.73±0.2ms     0.75  frame_methods.NSort.time_nsmallest_two_columns('all')
-      6.64±0.1ms       4.73±0.2ms     0.71  frame_methods.NSort.time_nsmallest_two_columns('last')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

So the new version is also significantly faster, and more so the larger the original dataframe is and the more columns being sorted on.

duplicated = values[duplicated_filter]
non_duplicated = values[~duplicated_filter]
indexer = get_indexer(indexer, non_duplicated.index)
last_value = values == values[values.index[-1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why are you changing this duplicated logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the old code would pick all duplicates, not only those that were duplicates of the top-most/bottom-most value.

That meant that if you had a series with two distinct set of duplicates, those sets of duplicates would be taken as equal for the next iteration.
So picking top-three from e.g: [0 4] [0 3] [1 2] [1 1] would in fact pick:
[0 4] [0 3] [1 2] instead of the correct
[1 2] [1 1] [0 4]

Now only the duplicates of the element on the border will go on to the next iteration:
so [1 2] [1 1] are determined on the first iteration and [0 4] on the next giving the correct result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks, maybe add a comment to that effect

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 23, 2018
@@ -505,14 +505,21 @@ class NSort(object):
param_names = ['keep']

def setup(self, keep):
self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC'))
self.df = DataFrame(np.random.randn(100000, 3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the benchmarks show any difference? (e.g. from the prior impl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's significantly faster now. Especially, if I create large dataframes with many columns and many duplicates.

If sorting only on one column the speed should be exactly the same.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

ok @troels this looks good (just add some more comments as indicated), ping on green.

@jreback jreback added this to the 0.24.0 milestone Sep 23, 2018
@troels
Copy link
Contributor Author

troels commented Sep 23, 2018

Ok, I've added some more comments. I hope the mechanism is clearer now.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

lgtm. you have a lint error, ping on green.

When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
@troels
Copy link
Contributor Author

troels commented Sep 24, 2018

@jreback The lint error should be fixed now. The failing ci-run in pandas-dev.pandas looks like a flaky network connection while setting up the testrun.

@jreback jreback merged commit 1c4130d into pandas-dev:master Sep 25, 2018
@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

thanks @troels

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected output for nlargest function with multiple columns
4 participants