BUG: nlargest/nsmallest gave wrong result (#22752) #22754

troels · 2018-09-18T22:48:48Z

closes Unexpected output for nlargest function with multiple columns #22752
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result.

I looked at the nsmallest/nlargest implementation for data frames and to
me it looks wrong.

With ties in the first columns, the old algorithm picked all duplicates rather
than all values that lies on the border for the next tie-break iteration.
That meant that to find the top 5 values in a data frame like e.g:

[0 5] [0 4] [1 0] [2 3] [2 2] [2 1]

the algorithm would first pick the top 5 via the first column:
(That would be [1 0] [2 1] [2 2] [2 3] [0 4] [0 5])
It would then try to pick the values that were certain (that is the unique
first column values, that is):
[1 0]
and go on for the next iteration with all rows where the first column value was non-unique but among the top 5:
That is: [2 3] [2 2] [2 1] [0 4] [0 5]
It would the compare the remaining on the second column, picking: [0 5] [0 4] [2 3] [2 2]
Finally returning: [1 0] [0 5] [0 4] [2 3] [2 2]

Which is not the correct result: [1 0][0 4] [2 3] [2 2] [2 1]

I've changed the algorithm so instead of tie-breaking on all values having duplicates in an earlier column, it will now tie break only on the duplicates of the largest/smallest value in a given column, so it will do:

Pick top 5 of column 1:
[1 0] [2 1] [2 2] [2 3] [0 4] [0 5]
zero is the border value, the rest is safe:
[1 0] [2 1] [2 2] [2 3]
[0 4] [0 5] goes on to the next iteration:
5 > 4 so [0 5] is the top 1 of what's left:
The result is: [1 0] [2 1] [2 2] [2 3] [0 5]

Which is the correct result.

pep8speaks · 2018-09-18T22:51:39Z

Hello @troels! Thanks for updating the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/frame_methods.py !
There are no PEP8 issues in the file pandas/core/algorithms.py !
There are no PEP8 issues in the file pandas/tests/frame/test_analytics.py !

Comment last updated on September 23, 2018 at 16:58 Hours UTC

codecov · 2018-09-19T10:43:57Z

Codecov Report

Merging #22754 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22754      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50823       +3     
==========================================
+ Hits        46850    46853       +3     
  Misses       3970     3970

Flag	Coverage Δ
#multiple	`90.6% <100%> (ø)`	⬆️
#single	`42.37% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.72% <100%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4cc0a71...63218fd. Read the comment docs.

TomAugspurger · 2018-09-19T14:44:30Z

Can you run asv on the nlargest / nsmallest benchmarks: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite

troels · 2018-09-23T11:42:57Z

Hi @TomAugspurger,

I ran the benchmark which gave me no significant changes.

I thought that was odd since the old version seemed to do a lot of work. I checked and the old benchmarks only tested sorting on a single column, which is kind of non-controversial.

I therefore added two new benchmarks and increased the size of the tested dataframe a bit. Here is the results:

[ 56.25%] ··· frame_methods.NSort.time_nlargest_one_column                                                                                                 ok
[ 56.25%] ··· ======= ============
                keep              
              ------- ------------
               first   4.02±0.3ms 
                last   3.24±0.2ms 
                all    3.61±0.1ms 
              ======= ============

[ 62.50%] ··· frame_methods.NSort.time_nlargest_two_columns                                                                                                ok
[ 62.50%] ··· ======= ============
                keep              
              ------- ------------
               first   5.46±0.4ms 
                last   5.11±0.2ms 
                all    5.26±0.1ms 
              ======= ============

[ 68.75%] ··· frame_methods.NSort.time_nsmallest_one_column                                                                                                ok
[ 68.75%] ··· ======= ============
                keep              
              ------- ------------
               first   3.55±0.2ms 
                last   3.71±0.1ms 
                all    3.48±0.2ms 
              ======= ============

[ 75.00%] ··· frame_methods.NSort.time_nsmallest_two_columns                                                                                               ok
[ 75.00%] ··· ======= ============
                keep              
              ------- ------------
               first   4.87±0.2ms 
                last   4.73±0.2ms 
                all    4.73±0.2ms 
              ======= ============

[ 75.00%] · For pandas commit fb784caf <ci> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 81.25%] ··· frame_methods.NSort.time_nlargest_one_column                                                                                                 ok
[ 81.25%] ··· ======= ============
                keep              
              ------- ------------
               first   3.97±0.1ms 
                last   3.45±0.1ms 
                all    4.11±0.1ms 
              ======= ============

[ 87.50%] ··· frame_methods.NSort.time_nlargest_two_columns                                                                                                ok
[ 87.50%] ··· ======= ============
                keep              
              ------- ------------
               first   6.68±0.2ms 
                last   6.39±0.1ms 
                all    6.67±0.2ms 
              ======= ============

[ 93.75%] ··· frame_methods.NSort.time_nsmallest_one_column                                                                                                ok
[ 93.75%] ··· ======= =============
                keep               
              ------- -------------
               first   3.65±0.07ms 
                last   3.67±0.08ms 
                all    3.65±0.01ms 
              ======= =============

[100.00%] ··· frame_methods.NSort.time_nsmallest_two_columns                                                                                               ok
[100.00%] ··· ======= =============
                keep               
              ------- -------------
               first    6.29±0.2ms 
                last    6.64±0.1ms 
                all    6.30±0.07ms 
              ======= =============

       before           after         ratio
     [fb784caf]       [9d94f27c]
     <ci>             <nlargest-nsmallest>
-      6.29±0.2ms       4.87±0.2ms     0.77  frame_methods.NSort.time_nsmallest_two_columns('first')
-     6.30±0.07ms       4.73±0.2ms     0.75  frame_methods.NSort.time_nsmallest_two_columns('all')
-      6.64±0.1ms       4.73±0.2ms     0.71  frame_methods.NSort.time_nsmallest_two_columns('last')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

So the new version is also significantly faster, and more so the larger the original dataframe is and the more columns being sorted on.

jreback · 2018-09-23T13:11:55Z

pandas/core/algorithms.py

-            duplicated = values[duplicated_filter]
-            non_duplicated = values[~duplicated_filter]
-            indexer = get_indexer(indexer, non_duplicated.index)
+            last_value = values == values[values.index[-1]]


hmm, why are you changing this duplicated logic?

Because the old code would pick all duplicates, not only those that were duplicates of the top-most/bottom-most value.

That meant that if you had a series with two distinct set of duplicates, those sets of duplicates would be taken as equal for the next iteration.
So picking top-three from e.g: [0 4] [0 3] [1 2] [1 1] would in fact pick:
[0 4] [0 3] [1 2] instead of the correct
[1 2] [1 1] [0 4]

Now only the duplicates of the element on the border will go on to the next iteration:
so [1 2] [1 1] are determined on the first iteration and [0 4] on the next giving the correct result.

ok thanks, maybe add a comment to that effect

jreback · 2018-09-23T13:21:28Z

asv_bench/benchmarks/frame_methods.py

@@ -505,14 +505,21 @@ class NSort(object):
    param_names = ['keep']

    def setup(self, keep):
-        self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC'))
+        self.df = DataFrame(np.random.randn(100000, 3),


do the benchmarks show any difference? (e.g. from the prior impl)

Yes, it's significantly faster now. Especially, if I create large dataframes with many columns and many duplicates.

If sorting only on one column the speed should be exactly the same.

jreback · 2018-09-23T13:28:41Z

ok @troels this looks good (just add some more comments as indicated), ping on green.

troels · 2018-09-23T15:46:47Z

Ok, I've added some more comments. I hope the mechanism is clearer now.

jreback · 2018-09-23T16:56:53Z

lgtm. you have a lint error, ping on green.

When asking for the n largest/smallest rows in a dataframe nlargest/nsmallest sometimes failed to differentiate the correct result based on the latter columns.

troels · 2018-09-24T06:38:12Z

@jreback The lint error should be fixed now. The failing ci-run in pandas-dev.pandas looks like a flaky network connection while setting up the testrun.

jreback · 2018-09-25T12:58:36Z

thanks @troels

…dev#22754)

troels force-pushed the nlargest-nsmallest branch from d914975 to c09c5a2 Compare September 18, 2018 22:51

troels changed the title ~~BUG: nlargest/nsmallest gave wrong result (#22748)~~ BUG: nlargest/nsmallest gave wrong result (#22752) Sep 18, 2018

troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 93062c6 to 03aa68d Compare September 19, 2018 10:43

troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 280f095 to 7a68d93 Compare September 19, 2018 12:07

troels force-pushed the nlargest-nsmallest branch 2 times, most recently from 25a4ef3 to 9d94f27 Compare September 23, 2018 11:36

troels force-pushed the nlargest-nsmallest branch from 9d94f27 to 45ab37e Compare September 23, 2018 11:48

jreback reviewed Sep 23, 2018

View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 23, 2018

jreback reviewed Sep 23, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Sep 23, 2018

troels force-pushed the nlargest-nsmallest branch from 45ab37e to a3444cf Compare September 23, 2018 15:45

troels force-pushed the nlargest-nsmallest branch from a3444cf to 63218fd Compare September 23, 2018 16:58

BUG: nlargest/nsmallest gave wrong result (pandas-dev#22752)

63218fd

When asking for the n largest/smallest rows in a dataframe nlargest/nsmallest sometimes failed to differentiate the correct result based on the latter columns.

jreback merged commit 1c4130d into pandas-dev:master Sep 25, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: nlargest/nsmallest gave wrong result (pandas-dev#22752) (pandas-…

11928f0

…dev#22754)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

troels commented Sep 18, 2018 •

edited

Loading

pep8speaks commented Sep 18, 2018 •

edited

Loading

codecov bot commented Sep 19, 2018 •

edited

Loading

TomAugspurger commented Sep 19, 2018

troels commented Sep 23, 2018 •

edited

Loading

jreback Sep 23, 2018

troels Sep 23, 2018

jreback Sep 23, 2018

jreback Sep 23, 2018

troels Sep 23, 2018

jreback commented Sep 23, 2018

troels commented Sep 23, 2018

jreback commented Sep 23, 2018

troels commented Sep 24, 2018

jreback commented Sep 25, 2018

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

Conversation

troels commented Sep 18, 2018 • edited Loading

pep8speaks commented Sep 18, 2018 • edited Loading

Comment last updated on September 23, 2018 at 16:58 Hours UTC

codecov bot commented Sep 19, 2018 • edited Loading

Codecov Report

TomAugspurger commented Sep 19, 2018

troels commented Sep 23, 2018 • edited Loading

jreback Sep 23, 2018

Choose a reason for hiding this comment

troels Sep 23, 2018

Choose a reason for hiding this comment

jreback Sep 23, 2018

Choose a reason for hiding this comment

jreback Sep 23, 2018

Choose a reason for hiding this comment

troels Sep 23, 2018

Choose a reason for hiding this comment

jreback commented Sep 23, 2018

troels commented Sep 23, 2018

jreback commented Sep 23, 2018

troels commented Sep 24, 2018

jreback commented Sep 25, 2018

troels commented Sep 18, 2018 •

edited

Loading

pep8speaks commented Sep 18, 2018 •

edited

Loading

codecov bot commented Sep 19, 2018 •

edited

Loading

troels commented Sep 23, 2018 •

edited

Loading