Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected output for nlargest function with multiple columns #22752

Closed
rileymcdowell opened this issue Sep 18, 2018 · 0 comments · Fixed by #22754
Closed

Unexpected output for nlargest function with multiple columns #22752

rileymcdowell opened this issue Sep 18, 2018 · 0 comments · Fixed by #22754
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure
Milestone

Comments

@rileymcdowell
Copy link

Code Sample

import pandas as pd

aas = [2,2,2,1,1,1]
bbs = [1,2,3,3,2,1]
n = 4

df = pd.DataFrame({'a': aas, 'b':  bbs})

print('-- First --')
nlargest = df.nlargest(n, columns=['a', 'b']).sort_values(['a', 'b'], ascending=False)
print(nlargest)

print('-- Second --')
pseudo_nlargest = df.sort_values(['a', 'b'], ascending=False).head(n)
print(pseudo_nlargest)

Actual Output

-- First --
   a  b
2  2  3
1  2  2
3  1  3
4  1  2
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <different!>
3  1  3 <different!>

Text within square brackets added to call attention to rows with unexpected output.

Problem description

According to the documentation for nlargest, the nlargest function should function identically to df.sort_values(columns, ascending=False).head(n) but be more performant. Presumably this is more performant due to not needing to sort the entire dataframe.

I am observing different behavior. In the example above, I expect the first and second dataframes to be the same in both indices and values. (Note that I've sorted the output of the nlargest function to remove sort order as a difference).

Similar issues, but different enough that I opened a new one

#21426 - Deals with unsigned ints, this issue uses signed int64s.
#19563 - Different by sort order only, this issue is different in that the rows themselves are a different, non-unique subset of the original rows.

Expected Output

-- First --
   a  b
2  2  3
1  2  2
0  2  1
3  1  3
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <same>
3  1  3 <same>

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 40.4.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

troels added a commit to troels/pandas that referenced this issue Sep 18, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 18, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 19, 2018
@WillAyd WillAyd added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff DataFrame DataFrame data structure labels Sep 19, 2018
troels added a commit to troels/pandas that referenced this issue Sep 19, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 19, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 19, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 22, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 23, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 23, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Sep 23, 2018
troels added a commit to troels/pandas that referenced this issue Sep 23, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
troels added a commit to troels/pandas that referenced this issue Sep 23, 2018
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants