Unexpected output for `nlargest` function with multiple columns #22752

rileymcdowell · 2018-09-18T20:56:29Z

Code Sample

import pandas as pd

aas = [2,2,2,1,1,1]
bbs = [1,2,3,3,2,1]
n = 4

df = pd.DataFrame({'a': aas, 'b':  bbs})

print('-- First --')
nlargest = df.nlargest(n, columns=['a', 'b']).sort_values(['a', 'b'], ascending=False)
print(nlargest)

print('-- Second --')
pseudo_nlargest = df.sort_values(['a', 'b'], ascending=False).head(n)
print(pseudo_nlargest)

Actual Output

-- First --
   a  b
2  2  3
1  2  2
3  1  3
4  1  2
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <different!>
3  1  3 <different!>

Text within square brackets added to call attention to rows with unexpected output.

Problem description

According to the documentation for nlargest, the nlargest function should function identically to df.sort_values(columns, ascending=False).head(n) but be more performant. Presumably this is more performant due to not needing to sort the entire dataframe.

I am observing different behavior. In the example above, I expect the first and second dataframes to be the same in both indices and values. (Note that I've sorted the output of the nlargest function to remove sort order as a difference).

Similar issues, but different enough that I opened a new one

#21426 - Deals with unsigned ints, this issue uses signed int64s.
#19563 - Different by sort order only, this issue is different in that the rows themselves are a different, non-unique subset of the original rows.

Expected Output

-- First --
   a  b
2  2  3
1  2  2
0  2  1
3  1  3
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <same>
3  1  3 <same>

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 40.4.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

When asking for the n largest/smallest rows in a dataframe nlargest/nsmallest sometimes failed to differentiate the correct result based on the latter columns.

…dev#22754)

troels mentioned this issue Sep 18, 2018

BUG: nlargest/nsmallest gave wrong result (#22752) #22754

Merged

4 tasks

WillAyd added this to the Contributions Welcome milestone Sep 19, 2018

WillAyd added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff DataFrame DataFrame data structure labels Sep 19, 2018

jreback modified the milestones: Contributions Welcome, 0.24.0 Sep 23, 2018

jreback closed this as completed in #22754 Sep 25, 2018

jreback pushed a commit that referenced this issue Sep 25, 2018

BUG: nlargest/nsmallest gave wrong result (#22752) (#22754)

1c4130d

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

BUG: nlargest/nsmallest gave wrong result (pandas-dev#22752) (pandas-…

11928f0

…dev#22754)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected output for `nlargest` function with multiple columns #22752

Unexpected output for `nlargest` function with multiple columns #22752

rileymcdowell commented Sep 18, 2018

INSTALLED VERSIONS

Unexpected output for nlargest function with multiple columns #22752

Unexpected output for nlargest function with multiple columns #22752

Comments

rileymcdowell commented Sep 18, 2018

Code Sample

Actual Output

Problem description

Similar issues, but different enough that I opened a new one

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

Unexpected output for `nlargest` function with multiple columns #22752

Unexpected output for `nlargest` function with multiple columns #22752

Output of `pd.show_versions()`