Skip to content

Commit

Permalink
BUG: nlargest/nsmallest gave wrong result (pandas-dev#22752)
Browse files Browse the repository at this point in the history
When asking for the n largest/smallest rows in a dataframe
nlargest/nsmallest sometimes failed to differentiate
the correct result based on the latter columns.
  • Loading branch information
troels committed Sep 18, 2018
1 parent 1a12c41 commit c09c5a2
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 11 deletions.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -804,6 +804,7 @@ Other
- :meth:`~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
- :meth:`~pandas.io.formats.style.Styler.background_gradient` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` (:issue:`15204`)
- :meth:`DataFrame.nlargest` and :meth:`DataFrame.nsmallest` now returns the correct n values when keep != 'all' also when tied on the first columns (:issue:`22752`)
- :meth:`~pandas.io.formats.style.Styler.bar` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` and setting clipping range with ``vmin`` and ``vmax`` (:issue:`21548` and :issue:`21526`). ``NaN`` values are also handled properly.
- Logical operations ``&, |, ^`` between :class:`Series` and :class:`Index` will no longer raise ``ValueError`` (:issue:`22092`)
-
Expand Down
31 changes: 20 additions & 11 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1221,27 +1221,36 @@ def get_indexer(current_indexer, other_indexer):
# and break
# Otherwise we must save the index of the non duplicated values
# and set the next cur_frame to cur_frame filtered on all
# duplcicated values (#GH15297)
# duplicated values (#GH15297)
series = cur_frame[column]
values = getattr(series, method)(cur_n, keep=self.keep)
is_last_column = len(columns) - 1 == i
if is_last_column or values.nunique() == series.isin(values).sum():

# Last column in columns or values are unique in
# series => values
# is all that matters
if is_last_column:
# Last column reached if two values are identical
# we can't separate them or
indexer = get_indexer(indexer, values.index)
break

duplicated_filter = series.duplicated(keep=False)
duplicated = values[duplicated_filter]
non_duplicated = values[~duplicated_filter]
indexer = get_indexer(indexer, non_duplicated.index)
# Find the value at the border
if method == 'nlargest':
border_value = values.min()
else:
border_value = values.max()

border_values = series == border_value
if border_values.count() == 1:
# The largest/smallest value is unique
# we are done.
indexer = get_indexer(indexer, values.index)
break

# Must set cur frame to include all duplicated values
safe_values = values[values != border_value]
indexer = get_indexer(indexer, safe_values.index)
# Must set cur frame to include all border values
# to consider for the next column, we also can reduce
# cur_n by the current length of the indexer
cur_frame = cur_frame[series.isin(duplicated)]
cur_frame = cur_frame[border_values]
cur_n = n - len(indexer)

frame = frame.take(indexer)
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/series/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -1996,6 +1996,29 @@ def test_duplicate_keep_all_ties(self):
expected = Series([6, 7, 7, 7, 7], index=[7, 3, 4, 5, 6])
assert_series_equal(result, expected)

@pytest.mark.parametrize('method,expected', [
('nlargest',
pd.DataFrame({'a': [2, 2, 2, 1], 'b': [3, 2, 1, 3]},
index=[2, 1, 0, 3])),
('nsmallest',
pd.DataFrame({'a': [2, 1, 1, 1], 'b': [1, 3, 2, 1]},
index=[0, 3, 4, 5]))])
def test_duplicates_on_starter_columns(self, method, expected):
# regression test for #22752

df = pd.DataFrame({
'a': [2, 2, 2, 1, 1, 1],
'b': [1, 2, 3, 3, 2, 1]
})

result = getattr(df, method)(
4, columns=['a', 'b']
).sort_values(
['a', 'b'], ascending=False
)

assert_frame_equal(result, expected)


class TestCategoricalSeriesAnalytics(object):

Expand Down

0 comments on commit c09c5a2

Please sign in to comment.